The AI Book
    Facebook Twitter Instagram
    The AI BookThe AI Book
    • Home
    • Categories
      • AI Media Processing
      • AI Language processing (NLP)
      • AI Marketing
      • AI Business Applications
    • Guides
    • Contact
    Subscribe
    Facebook Twitter Instagram
    The AI Book
    AI Media Processing

    Amazon SageMaker XGBoost now offers fully distributed GPU training

    31 May 2023No Comments9 Mins Read

    [ad_1]

    Amazon SageMaker provides a set of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) get started training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can handle different types of input data including table, image and text.

    The SageMaker XGBoost algorithm allows you to easily run XGBoost training and inference on SageMaker. XGBoost (eXtreme Gradient Boosting) is a popular and efficient open source implementation of the gradient boosted tree algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm works well in ML competitions because of its powerful handling of different data types, relationships, distributions, and the variety of hyperparameters you can specify. You can use XGBoost for regression, classification (binary and multi-class) and ranking problems. You can use GPUs to speed up training on large datasets.

    Today we are excited to announce that SageMaker XGBoost now offers fully distributed GPU training.

    Starting with version 1.5-1 and higher, you can now use all GPUs when using a multi-GPU instance. The new feature addresses your needs to use fully distributed GPU training when dealing with large datasets. This means you can use multiple Amazon Elastic Compute Cloud (Amazon EC2) instances (GPUs) and use all GPUs per instance.

    Distributed GPU training with multiple GPU instances

    With SageMaker XGBoost 1.2-2 or later, you can use one or more single GPU instances for training. hyperparameter tree_method should be set to gpu_hist. When using more than one instance (distributed setup), the data must be split between the instances as follows (same as the non-GPU distributed training steps mentioned in the XGBoost algorithm). While this setting is effective and can be used in a variety of training settings, it does not apply to all GPUs when you select a multi-GPU instance such as g5.12xlarge.

    With SageMaker XGBoost 1.5-1 and higher, you can now use all GPUs on each instance when using multiple GPU instances. The ability to use all GPUs in multiple GPU instances is offered by integrating the Dask Framework.

    You can use this option to finish training quickly. In addition to saving time, this option is also useful for working around blockers, such as (soft) limits on maximum usable instances, or if a training job cannot provide a large number of single GPU instances for some reason.

    The configuration for using this option is the same as the previous option, except for the following differences:

    • Add a new hyperparameter use_dask_gpu_training with a string value true.
    • Set the distribution parameter when creating the TrainingInput FullyReplicated, using single or multiple instances. The core Dask framework will handle the data loading and split the data between Dask workers. This is different from the data distribution setting for all other distributed training with SageMaker XGBoost.

    Note that splitting the data into smaller files still works for Parquet, where Dask will read each file as a partition. Since you will have a Dask worker per GPU, the number of files must exceed the number of instances * the number of GPUs per instance. Also, having too small a file size and too many files can slow down performance. For more information, see Avoiding very large graphs. For CSV, we still recommend splitting large files into smaller ones to reduce data download times and enable faster reading. However, this is not a requirement.

    Currently, the input formats supported by this option are:

    • text/csv
    • application/x-parquet

    The following input modes are supported:

    The code looks like this:

    import os
    import boto3
    import re
    import sagemaker
    from sagemaker.session import Session
    from sagemaker.inputs import TrainingInput
    from sagemaker.xgboost.estimator import XGBoost
    
    role = sagemaker.get_execution_role()
    region = sagemaker.Session().boto_region_name
    session = Session()
    
    bucket = "<Specify S3 Bucket>"
    prefix = "<Specify S3 prefix>"
    
    hyperparams = 
        "objective": "reg:squarederror",
        "num_round": "500",
        "verbosity": "3",
        "tree_method": "gpu_hist",
        "eval_metric": "rmse",
        "use_dask_gpu_training": "true"
    
    
    
    output_path = "s3:////output".format(bucket, prefix)
    
    content_type = "application/x-parquet"
    instance_type = "ml.g4dn.2xlarge"
    
    xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
    xgb_script_mode_estimator = sagemaker.estimator.Estimator(
        image_uri=xgboost_container,
        hyperparameters=hyperparams,
        role=role,
        instance_count=1,
        instance_type=instance_type,
        output_path=output_path,
        max_run=7200,
    
    )
    
    test_data_uri = " <specify the S3 uri for training dataset>"
    validation_data_uri = “<specify the S3 uri for validation dataset>”
    
    train_input = TrainingInput(
        test_data_uri, content_type=content_type
    )
    
    validation_input = TrainingInput(
        validation_data_uri, content_type=content_type
    )
    
    xgb_script_mode_estimator.fit("train": train_input, "validation": validation_input)

    The following screenshots show a successful training job log from a notebook.

    Mark

    We compared the evaluation metrics to ensure that the model quality did not deteriorate in the multi-GPU training path compared to the single GPU training. We also benchmarked large data sets to ensure that our distributed GPU setups were efficient and scalable.

    Billing time refers to absolute wall clock time. The training time is only the XGBoost training time that is measured train() Call before the model is stored in Amazon Simple Storage Service (Amazon S3).

    Performance benchmarks on large datasets

    Using multiple GPUs is usually appropriate for large datasets with complex training. We created a dummy dataset with 2,497,248,278 rows and 28 features for testing. The data set was 150 GB and consisted of 1,419 files. Each file size was 105-115 MB. We saved the data in Parquet format to an S3 bucket. To simulate somewhat complex training, we used this dataset for a binary classification task with 1000 rounds to compare the performance between a single GPU training path and a multi-GPU training path.

    The following table provides the billing training time and performance comparison between single GPU training path and multi GPU training path.

    One GPU learning path
    The instance type The number of instances Billing Time/Instance(s) training time(s)
    g4dn.xlarge 20 without memory
    g4dn.2xlarge 20 without memory
    g4dn.4xlarge 15 1710 1551.9
    16 1592 1412.2
    17 1542 1352.2
    18 1423 1281.2
    19 1346 1220.3
    Multi-GPU Learning Path
    The instance type The number of instances Billing Time/Instance(s) training time(s)
    g4dn.12xlarge 7 without memory
    8 1143 784.7
    9 1039 710.73
    10 978 676.7
    12 940 614.35

    We can see that using multiple GPU instances leads to lower training time and lower overall time. The single GPU training path still has some advantages in downloading and reading only part of the data for each instance and thus lower data download times. It also doesn’t suffer from Dask’s overhead. Therefore, the difference between training time and total time is smaller. However, due to the use of more GPUs, a multi-GPU setup can significantly reduce training time.

    You should use an EC2 instance that has enough computing power to avoid memory errors when dealing with large datasets.

    It is possible to reduce the total time even further by using a single GPU setup with more instances or a more powerful instance. However, in terms of cost, it can be more expensive. For example, the table below shows a comparison of training time and cost with a single GPU instance of g4dn.8xlarge.

    One GPU learning path
    The instance type The number of instances Billing Time/Instance(s) cost ($)
    g4dn.8xlarge 15 1679 15.22
    17 1509 15.51
    19 1326 15.22
    Multi-GPU Learning Path
    The instance type The number of instances Billing Time/Instance(s) cost ($)
    g4dn.12xlarge 8 1143 9.93
    10 978 10.63
    12 940 12.26

    The cost calculation is based on the asking price for each instance. For more information, see Amazon EC2 G4 Instances.

    Model quality benchmarks

    For model quality, we compared the evaluation metrics between the Dask GPU option and the single-GPU option and trained on different types and number of instances. For different tasks, we used different datasets and hyperparameters, each dataset divided into training, validation and test sets.

    For binary classification (binary:logistic) task, we used the HIGGS dataset in CSV format. The training partition of the dataset has 9,348,181 rows and 28 features. The number of rounds used was 1000. The following table summarizes the results.

    Multi-GPU Training with Dask
    The instance type Number of GPUs / instance The number of instances Billing Time/Instance(s) accuracy % F1 % ROC AUC %
    g4dn.2xlarge 1 1 343 75.97 77.61 84.34
    g4dn.4xlarge 1 1 413 76.16 77.75 84.51
    g4dn.8xlarge 1 1 413 76.16 77.75 84.51
    g4dn.12xlarge 4 1 157 76.16 77.74 84.52

    for regression (reg:squarederror), we used the NYC green cab data set (with some modifications) in parquet format. The training partition of the dataset has 72,921,051 rows and 8 features. The number of rounds used was 500. The following table shows the results.

    Multi-GPU Training with Dask
    The instance type Number of GPUs / instance The number of instances Billing Time/Instance(s) MSE R2 MAE
    g4dn.2xlarge 1 1 775 21.92 0.7787 2.43
    g4dn.4xlarge 1 1 770 21.92 0.7787 2.43
    g4dn.8xlarge 1 1 705 21.92 0.7787 2.43
    g4dn.12xlarge 4 1 253 21.93 0.7787 2.44

    Model quality metrics are similar between the multi-GPU (Dask) training variant and the existing training variant. Model quality remains consistent when using a distributed configuration with multiple instances or GPUs.

    conclusion

    In this post, we reviewed how you can use combinations of different types and number of instances for distributed GPU training with SageMaker XGBoost. For most use cases, you can use single GPU instances. This option allows for a wide range of use cases and is very efficient. You can use multi-GPU instances to train with large datasets and lots of rounds. It can provide fast training with a small number of instances. Overall, you can use SageMaker XGBoost’s distributed GPU setup to speed up your XGBoost training immensely.

    To learn more about distributed training using SageMaker and Dask, see Amazon SageMaker Embedded LightGBM Now Offers Distributed Training Using Dask


    About the authors

    Dheeraj Thakur is a solutions architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and loves building and experimenting in the analytics and AI/ML space.

    Dewan Chowdhury is a software development engineer with Amazon Web Services. It runs on Amazon SageMaker algorithms and JumpStart offers. In addition to building AI/ML infrastructure, he is also passionate about building scalable distributed systems.

    Xin HuangDr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker Embedded Algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in natural language processing, explanatory deep learning for tabular data, and robust analysis of non-parametric spatio-temporal clustering. He has published numerous papers in ACL, ICDM, KDD conferences and the Royal Statistical Society: Series A Journal.

    Tony Cruz

    [ad_2]

    Source link

    Previous ArticleTop AI researchers and CEOs warn against ‘risk of extinction’ in joint statement
    Next Article How Artificial Intelligence Helped a Top Travel and Resort Company in Las Vegas
    The AI Book

    Related Posts

    AI Media Processing

    A new set of Arctic images will help artificial intelligence research MIT News

    25 July 2023
    AI Media Processing

    Analyzing rodent infestations using the geospatial capabilities of Amazon SageMaker

    24 July 2023
    AI Media Processing

    Using knowledge of social context for responsible use of artificial intelligence – Google Research Blog

    23 July 2023
    Add A Comment

    Leave A Reply Cancel Reply

    • Privacy Policy
    • Terms and Conditions
    • About Us
    • Contact Form
    © 2025 The AI Book.

    Type above and press Enter to search. Press Esc to cancel.