The AI Book
    Facebook Twitter Instagram
    The AI BookThe AI Book
    • Home
    • Categories
      • AI Media Processing
      • AI Language processing (NLP)
      • AI Marketing
      • AI Business Applications
    • Guides
    • Contact
    Subscribe
    Facebook Twitter Instagram
    The AI Book
    AI Media Processing

    Customizing a Cloud Based Machine Learning Learning Environment — Part 2 | by Chaim Rand | May 2023

    23 May 2023No Comments8 Mins Read

    [ad_1]

    Additional solutions to increase your development flexibility

    Chaim Rand
    Towards Data Science
    Photo by Murillo Gomez on Unsplash

    This is the second part of a two-part post on customizing your cloud-based AI model training environment. In the first part, as a prerequisite for this part, we will learn about the conflict that can arise between desire use pre-built specially designed learning environments and request that we have the ability to adapt the environment to the needs of our project. The key to discovering potential customization opportunities is a deep understanding of the end-to-end flow of running a learning workload in the cloud. We’ve described this flow for the Amazon SageMaker Managed Learning Service while highlighting the importance of analyzing publicly available underlying code. We then presented the first customization method—setting the pip package dependency at the very beginning of training—and showed its limitations.

    In this post, we present two additional methods. Both methods involve creating our own Docker image, but they are fundamentally different in their approach. The first method uses the image provided by the cloud service and expands it according to the needs of the project. The second takes a user-defined (cloud agnostic) Docker image and extends it to support training in the cloud. As we’ll see, each has its pros and cons, and the best option will depend heavily on the details of your project.

    Building a fully functional, performance-optimized Docker image for training on a cloud-based GPU can be a pain, requiring the navigation of many intertwined HW and SW dependencies. It is even more difficult to do this for a variety of exercise applications and HW platforms. Rather than trying to do it ourselves, our first choice will always be to use a predefined image created for us by our cloud service provider. If we need to customize this image, we simply create a new Dockerfile that extends the official image and adds the required dependencies.

    The AWS Deep Learning Container (DLC) github repository contains instructions for extending the official AWS DLC. This requires logging in to the Deep Learning Containers image repository to extract the image, create an extended image, and then upload it to your Amazon Elastic Container Registry (ECR) account.

    The following code block shows how to extend the official AWS DLC from our SageMaker example (Part 1). We show three types of extensions:

    1. Linux package: We install Nvidia Nsight Systems for advanced GPU profiling of our training work.
    2. Conda package: We install the S5cmd conda package that we use to extract data files from the cloud.
    3. pip package: We install a specific version of the opencv-python pip package.
    From 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker

    # install nsys
    ADD https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/NsightSystems-linux-cli-public-2023.1.1.127-3236574.deb ./
    RUN apt install -y ./NsightSystems-linux-cli-public-2023.1.1.127-3236574.deb

    # install s5cm
    RUN conda install -y s5cmd

    # install opencv
    RUN pip install opencv-python==4.7.0.72

    For more information on the official AWS DLC extension, including how to upload the resulting image to ECR, see here. The code block below shows how to modify the training job deployment script to use the extended image:

    from sagemaker.pytorch import PyTorch

    # define the training job
    estimator = PyTorch(
    entry_point='train.py',
    source_dir='./source_dir',
    role='<arn role>',
    image_uri = '<account-number>.dkr.ecr.us-east-1.amazonaws.com/<tag>'
    job_name='demo',
    instance_type='ml.g5.xlarge',
    instance_count=1
    )

    A similar option you have for customizing an official image, assuming you have access to its corresponding Dockerfile, is to simply make the desired edits to the Dockerfile and build from scratch. This option for AWS DLC is documented here. Note, however, that although based on the same Dockerfile, the resulting image may differ due to differences in build environments and updated package versions.

    Customizing your environment by extending the official Docker image is a great way to get the most out of a fully functional, fully verified, cloud-optimized training environment predefined by a cloud service, while giving you the freedom and flexibility to make additions and adaptations. you need However, this option also has its limitations, as we will show with an example.

    Training in a user-defined Python environment

    For various reasons, you may want to be able to train in a user-defined Python environment. This may be for reproducibility, platform independence, safety/security/compliance considerations, or some other purpose. One option you might consider would be to extend the official Docker image with your own custom Python environment. That way, you’ll at least be able to take advantage of platform-specific installations and image optimizations. However, this can be difficult if your intended use relies on Python-based automation. For example, in a managed learning environment, a Dockerfile ᲢᲕᲕᲚᲦᲨ ᲬᲡᲢᲢᲚᢦ Runs a Python script that performs all kinds of actions, including downloading the source code directory from cloud storage, installing Python dependencies, running a user-defined training script, and more. This Python script resides in the predefined Python environment of the official Docker image. Programming an automated script to run a training script in a separate Python environment is possible, but may require some manual code changes to the predefined environment and can be very messy. In the next section, we present a cleaner way to do this.

    The final scenario we’ll cover is when you need to run on a specific environment defined by your own Docker image. As before, the call for this can be regulatory, or wanting to run the same image in the cloud as you do locally (“on-prem”). Some cloud services allow you to import your own user-defined image and customize it for use in the cloud. In this section, we show two ways that Amazon SageMaker supports this.

    BYO Option 1: SageMaker Training Toolkit

    The first option, documented here, allows you to add the specialized (guided) training launch flow described in Section 1 to your personal Python environment. This essentially allows you to practice using your custom image in SageMaker, just as you would an official image. In particular, you can reuse the same image for multiple projects/experiments and rely on the SageMaker APIs to download experiment-specific code when starting the training environment (as described in Section 1). You don’t need to create and upload a new image every time you change the training code.

    The code block below shows how to take a custom image and enhance it with the SageMaker training toolkit, following the instructions described here.

    FROM user_defined_docker_image

    RUN echo "conda activate user_defined_conda_env" >> ~/.bashrc
    SHELL ["/bin/bash", "--login", "-c"]

    ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
    RUN conda activate user_defined_conda_env \
    && pip install --no-cache-dir -U sagemaker-pytorch-training sagemaker-training

    # sagemaker uses jq to compile executable
    RUN apt-get update \
    && apt-get -y upgrade --only-upgrade systemd \
    && apt-get install -y --allow-change-held-packages --no-install-recommends \
    jq

    # SageMaker assumes conda environment is in Path
    ENV PATH /opt/conda/envs/user_defined_conda_env/bin:$PATH

    # delete entry point and args if provided by parent Dockerfile
    ENTRYPOINT []
    CMD []

    BYO Option 2: Configure the Entrypoint

    The second option, documented here, allows you to train in SageMaker in a user-defined Docker environment. zero Changes to the Docker image. All that is required is to set it clearly ᲢᲕᲕᲚᲦᲨ ᲬᲡᲢᲢᲚᢦ Docker container tutorial. One way to do this (as documented here) is to pass Container entry point and/or container arguments to parameters Algorithm specification About the API request. Unfortunately, as of this writing, this option is not supported by the SageMaker Python API (version 2.146.1). However, we can enable this easily by extending the SageMaker Session class, as shown in the code block below:

    from sagemaker.session import Session

    # customized session class that supports adding container entrypoint settings
    class SessionEx(Session):
    def __init__(self, **kwargs):
    self.user_entrypoint = kwargs.pop('entrypoint', None)
    self.user_arguments = kwargs.pop('arguments', None)
    super(SessionEx, self).__init__(**kwargs)

    def _get_train_request(self, **kwargs):
    train_request = super(SessionEx, self)._get_train_request(**kwargs)
    if self.user_entrypoint:
    train_request["AlgorithmSpecification"]["ContainerEntrypoint"] =\
    [self.user_entrypoint]
    if self.user_arguments:
    train_request["AlgorithmSpecification"]["ContainerArguments"] =\
    self.user_arguments
    return train_request

    from sagemaker.pytorch import PyTorch

    # create session with user defined entrypoint and arguments
    # SageMaker will run 'docker run --entrypoint python <user image> path2file.py
    sm_session = SessionEx(user_entrypoint='python',
    user_arguments=['path2file.py'])

    # define the training job
    estimator = PyTorch(
    entry_point='train.py',
    source_dir='./source_dir',
    role='<arn role>',
    image_uri='<account-number>.dkr.ecr.us-east-1.amazonaws.com/<tag>'
    job_name='demo',
    instance_type='ml.g5.xlarge',
    instance_count=1,
    sagemaker_session=sm_session
    )

    Optimizing your Docker image

    One of the downsides of the BYO option is that you miss out on the official pre-defined image specialization. You can manually and selectively re-include some of these in your custom image. For example, the SageMaker documentation contains detailed instructions for integrating Amazon EFA support. Moreover, you always have the option to browse the publicly available Dockerfile to choose the one you want.

    In this two-part post, we discussed various methods for customizing your cloud-based learning environment. The methods we chose were intended to demonstrate ways to solve different types of use cases. In practice, the best solution will depend directly on the needs of your project. You may decide to create a single custom Docker image for all your training experiments and combine this with the option to install experiment-specific dependencies (using the first method). You may find that there is another method not covered here, such as one that involves tweaking some part of the Python training package Sagemaker to better suit your needs. The bottom line is that when you are faced with the need to customize your learning environment – you have options; And if the standard options we’ve listed aren’t enough, don’t despair, get creative!

    [ad_2]

    Source link

    Previous ArticleCohesity partners with Google Cloud to empower organizations with generative AI and data capabilities
    Next Article Nvidia teams up with Microsoft to accelerate AI efforts for enterprises and individuals
    The AI Book

    Related Posts

    AI Media Processing

    A new set of Arctic images will help artificial intelligence research MIT News

    25 July 2023
    AI Media Processing

    Analyzing rodent infestations using the geospatial capabilities of Amazon SageMaker

    24 July 2023
    AI Media Processing

    Using knowledge of social context for responsible use of artificial intelligence – Google Research Blog

    23 July 2023
    Add A Comment

    Leave A Reply Cancel Reply

    • Privacy Policy
    • Terms and Conditions
    • About Us
    • Contact Form
    © 2025 The AI Book.

    Type above and press Enter to search. Press Esc to cancel.