[ad_1]
Additional solutions to increase your development flexibility
This is the second part of a two-part post on customizing your cloud-based AI model training environment. In the first part, as a prerequisite for this part, we will learn about the conflict that can arise between desire use pre-built specially designed learning environments and request that we have the ability to adapt the environment to the needs of our project. The key to discovering potential customization opportunities is a deep understanding of the end-to-end flow of running a learning workload in the cloud. We’ve described this flow for the Amazon SageMaker Managed Learning Service while highlighting the importance of analyzing publicly available underlying code. We then presented the first customization method—setting the pip package dependency at the very beginning of training—and showed its limitations.
In this post, we present two additional methods. Both methods involve creating our own Docker image, but they are fundamentally different in their approach. The first method uses the image provided by the cloud service and expands it according to the needs of the project. The second takes a user-defined (cloud agnostic) Docker image and extends it to support training in the cloud. As we’ll see, each has its pros and cons, and the best option will depend heavily on the details of your project.
Building a fully functional, performance-optimized Docker image for training on a cloud-based GPU can be a pain, requiring the navigation of many intertwined HW and SW dependencies. It is even more difficult to do this for a variety of exercise applications and HW platforms. Rather than trying to do it ourselves, our first choice will always be to use a predefined image created for us by our cloud service provider. If we need to customize this image, we simply create a new Dockerfile that extends the official image and adds the required dependencies.
The AWS Deep Learning Container (DLC) github repository contains instructions for extending the official AWS DLC. This requires logging in to the Deep Learning Containers image repository to extract the image, create an extended image, and then upload it to your Amazon Elastic Container Registry (ECR) account.
The following code block shows how to extend the official AWS DLC from our SageMaker example (Part 1). We show three types of extensions:
- Linux package: We install Nvidia Nsight Systems for advanced GPU profiling of our training work.
- Conda package: We install the S5cmd conda package that we use to extract data files from the cloud.
- pip package: We install a specific version of the opencv-python pip package.
From 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker# install nsys
ADD https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/NsightSystems-linux-cli-public-2023.1.1.127-3236574.deb ./
RUN apt install -y ./NsightSystems-linux-cli-public-2023.1.1.127-3236574.deb
# install s5cm
RUN conda install -y s5cmd
# install opencv
RUN pip install opencv-python==4.7.0.72
For more information on the official AWS DLC extension, including how to upload the resulting image to ECR, see here. The code block below shows how to modify the training job deployment script to use the extended image:
from sagemaker.pytorch import PyTorch# define the training job
estimator = PyTorch(
entry_point='train.py',
source_dir='./source_dir',
role='<arn role>',
image_uri = '<account-number>.dkr.ecr.us-east-1.amazonaws.com/<tag>'
job_name='demo',
instance_type='ml.g5.xlarge',
instance_count=1
)
A similar option you have for customizing an official image, assuming you have access to its corresponding Dockerfile, is to simply make the desired edits to the Dockerfile and build from scratch. This option for AWS DLC is documented here. Note, however, that although based on the same Dockerfile, the resulting image may differ due to differences in build environments and updated package versions.
Customizing your environment by extending the official Docker image is a great way to get the most out of a fully functional, fully verified, cloud-optimized training environment predefined by a cloud service, while giving you the freedom and flexibility to make additions and adaptations. you need However, this option also has its limitations, as we will show with an example.
Training in a user-defined Python environment
For various reasons, you may want to be able to train in a user-defined Python environment. This may be for reproducibility, platform independence, safety/security/compliance considerations, or some other purpose. One option you might consider would be to extend the official Docker image with your own custom Python environment. That way, you’ll at least be able to take advantage of platform-specific installations and image optimizations. However, this can be difficult if your intended use relies on Python-based automation. For example, in a managed learning environment, a Dockerfile ᲢᲕᲕᲚᲦᲨ ᲬᲡᲢᲢᲚᢦ Runs a Python script that performs all kinds of actions, including downloading the source code directory from cloud storage, installing Python dependencies, running a user-defined training script, and more. This Python script resides in the predefined Python environment of the official Docker image. Programming an automated script to run a training script in a separate Python environment is possible, but may require some manual code changes to the predefined environment and can be very messy. In the next section, we present a cleaner way to do this.
The final scenario we’ll cover is when you need to run on a specific environment defined by your own Docker image. As before, the call for this can be regulatory, or wanting to run the same image in the cloud as you do locally (“on-prem”). Some cloud services allow you to import your own user-defined image and customize it for use in the cloud. In this section, we show two ways that Amazon SageMaker supports this.
BYO Option 1: SageMaker Training Toolkit
The first option, documented here, allows you to add the specialized (guided) training launch flow described in Section 1 to your personal Python environment. This essentially allows you to practice using your custom image in SageMaker, just as you would an official image. In particular, you can reuse the same image for multiple projects/experiments and rely on the SageMaker APIs to download experiment-specific code when starting the training environment (as described in Section 1). You don’t need to create and upload a new image every time you change the training code.
The code block below shows how to take a custom image and enhance it with the SageMaker training toolkit, following the instructions described here.
FROM user_defined_docker_imageRUN echo "conda activate user_defined_conda_env" >> ~/.bashrc
SHELL ["/bin/bash", "--login", "-c"]
ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
RUN conda activate user_defined_conda_env \
&& pip install --no-cache-dir -U sagemaker-pytorch-training sagemaker-training
# sagemaker uses jq to compile executable
RUN apt-get update \
&& apt-get -y upgrade --only-upgrade systemd \
&& apt-get install -y --allow-change-held-packages --no-install-recommends \
jq
# SageMaker assumes conda environment is in Path
ENV PATH /opt/conda/envs/user_defined_conda_env/bin:$PATH
# delete entry point and args if provided by parent Dockerfile
ENTRYPOINT []
CMD []
BYO Option 2: Configure the Entrypoint
The second option, documented here, allows you to train in SageMaker in a user-defined Docker environment. zero Changes to the Docker image. All that is required is to set it clearly ᲢᲕᲕᲚᲦᲨ ᲬᲡᲢᲢᲚᢦ Docker container tutorial. One way to do this (as documented here) is to pass Container entry point and/or container arguments to parameters Algorithm specification About the API request. Unfortunately, as of this writing, this option is not supported by the SageMaker Python API (version 2.146.1). However, we can enable this easily by extending the SageMaker Session class, as shown in the code block below:
from sagemaker.session import Session# customized session class that supports adding container entrypoint settings
class SessionEx(Session):
def __init__(self, **kwargs):
self.user_entrypoint = kwargs.pop('entrypoint', None)
self.user_arguments = kwargs.pop('arguments', None)
super(SessionEx, self).__init__(**kwargs)
def _get_train_request(self, **kwargs):
train_request = super(SessionEx, self)._get_train_request(**kwargs)
if self.user_entrypoint:
train_request["AlgorithmSpecification"]["ContainerEntrypoint"] =\
[self.user_entrypoint]
if self.user_arguments:
train_request["AlgorithmSpecification"]["ContainerArguments"] =\
self.user_arguments
return train_request
from sagemaker.pytorch import PyTorch
# create session with user defined entrypoint and arguments
# SageMaker will run 'docker run --entrypoint python <user image> path2file.py
sm_session = SessionEx(user_entrypoint='python',
user_arguments=['path2file.py'])
# define the training job
estimator = PyTorch(
entry_point='train.py',
source_dir='./source_dir',
role='<arn role>',
image_uri='<account-number>.dkr.ecr.us-east-1.amazonaws.com/<tag>'
job_name='demo',
instance_type='ml.g5.xlarge',
instance_count=1,
sagemaker_session=sm_session
)
Optimizing your Docker image
One of the downsides of the BYO option is that you miss out on the official pre-defined image specialization. You can manually and selectively re-include some of these in your custom image. For example, the SageMaker documentation contains detailed instructions for integrating Amazon EFA support. Moreover, you always have the option to browse the publicly available Dockerfile to choose the one you want.
In this two-part post, we discussed various methods for customizing your cloud-based learning environment. The methods we chose were intended to demonstrate ways to solve different types of use cases. In practice, the best solution will depend directly on the needs of your project. You may decide to create a single custom Docker image for all your training experiments and combine this with the option to install experiment-specific dependencies (using the first method). You may find that there is another method not covered here, such as one that involves tweaking some part of the Python training package Sagemaker to better suit your needs. The bottom line is that when you are faced with the need to customize your learning environment – you have options; And if the standard options we’ve listed aren’t enough, don’t despair, get creative!
[ad_2]
Source link