[ad_1]
Amazon SageMaker provides a set of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) get started training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can handle different types of input data including table, image and text.
The SageMaker XGBoost algorithm allows you to easily run XGBoost training and inference on SageMaker. XGBoost (eXtreme Gradient Boosting) is a popular and efficient open source implementation of the gradient boosted tree algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm works well in ML competitions because of its powerful handling of different data types, relationships, distributions, and the variety of hyperparameters you can specify. You can use XGBoost for regression, classification (binary and multi-class) and ranking problems. You can use GPUs to speed up training on large datasets.
Today we are excited to announce that SageMaker XGBoost now offers fully distributed GPU training.
Starting with version 1.5-1 and higher, you can now use all GPUs when using a multi-GPU instance. The new feature addresses your needs to use fully distributed GPU training when dealing with large datasets. This means you can use multiple Amazon Elastic Compute Cloud (Amazon EC2) instances (GPUs) and use all GPUs per instance.
Distributed GPU training with multiple GPU instances
With SageMaker XGBoost 1.2-2 or later, you can use one or more single GPU instances for training. hyperparameter tree_method should be set to gpu_hist. When using more than one instance (distributed setup), the data must be split between the instances as follows (same as the non-GPU distributed training steps mentioned in the XGBoost algorithm). While this setting is effective and can be used in a variety of training settings, it does not apply to all GPUs when you select a multi-GPU instance such as g5.12xlarge.
With SageMaker XGBoost 1.5-1 and higher, you can now use all GPUs on each instance when using multiple GPU instances. The ability to use all GPUs in multiple GPU instances is offered by integrating the Dask Framework.
You can use this option to finish training quickly. In addition to saving time, this option is also useful for working around blockers, such as (soft) limits on maximum usable instances, or if a training job cannot provide a large number of single GPU instances for some reason.
The configuration for using this option is the same as the previous option, except for the following differences:
- Add a new hyperparameter
use_dask_gpu_training
with a string valuetrue
. - Set the distribution parameter when creating the TrainingInput
FullyReplicated
, using single or multiple instances. The core Dask framework will handle the data loading and split the data between Dask workers. This is different from the data distribution setting for all other distributed training with SageMaker XGBoost.
Note that splitting the data into smaller files still works for Parquet, where Dask will read each file as a partition. Since you will have a Dask worker per GPU, the number of files must exceed the number of instances * the number of GPUs per instance. Also, having too small a file size and too many files can slow down performance. For more information, see Avoiding very large graphs. For CSV, we still recommend splitting large files into smaller ones to reduce data download times and enable faster reading. However, this is not a requirement.
Currently, the input formats supported by this option are:
text/csv
application/x-parquet
The following input modes are supported:
The code looks like this:
The following screenshots show a successful training job log from a notebook.
Mark
We compared the evaluation metrics to ensure that the model quality did not deteriorate in the multi-GPU training path compared to the single GPU training. We also benchmarked large data sets to ensure that our distributed GPU setups were efficient and scalable.
Billing time refers to absolute wall clock time. The training time is only the XGBoost training time that is measured train()
Call before the model is stored in Amazon Simple Storage Service (Amazon S3).
Performance benchmarks on large datasets
Using multiple GPUs is usually appropriate for large datasets with complex training. We created a dummy dataset with 2,497,248,278 rows and 28 features for testing. The data set was 150 GB and consisted of 1,419 files. Each file size was 105-115 MB. We saved the data in Parquet format to an S3 bucket. To simulate somewhat complex training, we used this dataset for a binary classification task with 1000 rounds to compare the performance between a single GPU training path and a multi-GPU training path.
The following table provides the billing training time and performance comparison between single GPU training path and multi GPU training path.
One GPU learning path | |||
The instance type | The number of instances | Billing Time/Instance(s) | training time(s) |
g4dn.xlarge | 20 | without memory | |
g4dn.2xlarge | 20 | without memory | |
g4dn.4xlarge | 15 | 1710 | 1551.9 |
16 | 1592 | 1412.2 | |
17 | 1542 | 1352.2 | |
18 | 1423 | 1281.2 | |
19 | 1346 | 1220.3 |
Multi-GPU Learning Path | |||
The instance type | The number of instances | Billing Time/Instance(s) | training time(s) |
g4dn.12xlarge | 7 | without memory | |
8 | 1143 | 784.7 | |
9 | 1039 | 710.73 | |
10 | 978 | 676.7 | |
12 | 940 | 614.35 |
We can see that using multiple GPU instances leads to lower training time and lower overall time. The single GPU training path still has some advantages in downloading and reading only part of the data for each instance and thus lower data download times. It also doesn’t suffer from Dask’s overhead. Therefore, the difference between training time and total time is smaller. However, due to the use of more GPUs, a multi-GPU setup can significantly reduce training time.
You should use an EC2 instance that has enough computing power to avoid memory errors when dealing with large datasets.
It is possible to reduce the total time even further by using a single GPU setup with more instances or a more powerful instance. However, in terms of cost, it can be more expensive. For example, the table below shows a comparison of training time and cost with a single GPU instance of g4dn.8xlarge.
One GPU learning path | |||
The instance type | The number of instances | Billing Time/Instance(s) | cost ($) |
g4dn.8xlarge | 15 | 1679 | 15.22 |
17 | 1509 | 15.51 | |
19 | 1326 | 15.22 |
Multi-GPU Learning Path | |||
The instance type | The number of instances | Billing Time/Instance(s) | cost ($) |
g4dn.12xlarge | 8 | 1143 | 9.93 |
10 | 978 | 10.63 | |
12 | 940 | 12.26 |
The cost calculation is based on the asking price for each instance. For more information, see Amazon EC2 G4 Instances.
Model quality benchmarks
For model quality, we compared the evaluation metrics between the Dask GPU option and the single-GPU option and trained on different types and number of instances. For different tasks, we used different datasets and hyperparameters, each dataset divided into training, validation and test sets.
For binary classification (binary:logistic
) task, we used the HIGGS dataset in CSV format. The training partition of the dataset has 9,348,181 rows and 28 features. The number of rounds used was 1000. The following table summarizes the results.
Multi-GPU Training with Dask | ||||||
The instance type | Number of GPUs / instance | The number of instances | Billing Time/Instance(s) | accuracy % | F1 % | ROC AUC % |
g4dn.2xlarge | 1 | 1 | 343 | 75.97 | 77.61 | 84.34 |
g4dn.4xlarge | 1 | 1 | 413 | 76.16 | 77.75 | 84.51 |
g4dn.8xlarge | 1 | 1 | 413 | 76.16 | 77.75 | 84.51 |
g4dn.12xlarge | 4 | 1 | 157 | 76.16 | 77.74 | 84.52 |
for regression (reg:squarederror
), we used the NYC green cab data set (with some modifications) in parquet format. The training partition of the dataset has 72,921,051 rows and 8 features. The number of rounds used was 500. The following table shows the results.
Multi-GPU Training with Dask | ||||||
The instance type | Number of GPUs / instance | The number of instances | Billing Time/Instance(s) | MSE | R2 | MAE |
g4dn.2xlarge | 1 | 1 | 775 | 21.92 | 0.7787 | 2.43 |
g4dn.4xlarge | 1 | 1 | 770 | 21.92 | 0.7787 | 2.43 |
g4dn.8xlarge | 1 | 1 | 705 | 21.92 | 0.7787 | 2.43 |
g4dn.12xlarge | 4 | 1 | 253 | 21.93 | 0.7787 | 2.44 |
Model quality metrics are similar between the multi-GPU (Dask) training variant and the existing training variant. Model quality remains consistent when using a distributed configuration with multiple instances or GPUs.
conclusion
In this post, we reviewed how you can use combinations of different types and number of instances for distributed GPU training with SageMaker XGBoost. For most use cases, you can use single GPU instances. This option allows for a wide range of use cases and is very efficient. You can use multi-GPU instances to train with large datasets and lots of rounds. It can provide fast training with a small number of instances. Overall, you can use SageMaker XGBoost’s distributed GPU setup to speed up your XGBoost training immensely.
To learn more about distributed training using SageMaker and Dask, see Amazon SageMaker Embedded LightGBM Now Offers Distributed Training Using Dask
About the authors
Dheeraj Thakur is a solutions architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and loves building and experimenting in the analytics and AI/ML space.
Dewan Chowdhury is a software development engineer with Amazon Web Services. It runs on Amazon SageMaker algorithms and JumpStart offers. In addition to building AI/ML infrastructure, he is also passionate about building scalable distributed systems.
Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker Embedded Algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in natural language processing, explanatory deep learning for tabular data, and robust analysis of non-parametric spatio-temporal clustering. He has published numerous papers in ACL, ICDM, KDD conferences and the Royal Statistical Society: Series A Journal.
Tony Cruz
[ad_2]
Source link