We are thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This capability allows for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, using automated node and job resiliency features for foundation model (FM) development.

FMs are typically trained on large-scale compute clusters with hundreds or thousands of accelerators. Under such circumstances, hardware failures pose a significant challenge, because a single accelerator failure among thousands can halt the entire training process. For example, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs experienced 419 unexpected interruptions, with 78% attributed to confirmed or suspected hardware issues, and with 58.7% of these interruptions being GPU-related problems, including NVLink failures and HBM3 memory failures.

Since its inception, SageMaker HyperPod was designed with a focus on managed resiliency features to mitigate such hardware failures, enabling FM builders such as Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM training and inference on Slurm clusters. With the EKS support in HyperPod, you can now also benefit from the resiliency features on Kubernetes clusters by managing machine learning (ML) workloads using the HyperPod compute and managed Kubernetes control plane on the EKS cluster.

AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new feature set to manage their ML model development lifecycle:

“Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With Amazon EKS support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.”

– Observea

“As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”

– Articul8 AI

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.

The post is organized into the following three sections:

Overview of Amazon EKS support in SageMaker HyperPod – This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introducing three key resiliency features HyperPod compute provides on the EKS cluster. Additionally, this section explains how HyperPod provides a smooth developer experience for admins and scientists.
HyperPod cluster setup and node resiliency features – This section provides a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, emphasizing how its built-in resiliency features provide infrastructure stability. This section is especially beneficial for admins.
Training job resiliency with the job auto resume functionality – In this section, we demonstrate how scientists can submit and manage their distributed training jobs using either the native Kubernetes CLI (kubectl) or optionally the new HyperPod CLI (hyperpod) with automatic job recovery enabled.

Overview of EKS support in SageMaker HyperPod

This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introduces three key resiliency features HyperPod compute provides on the EKS cluster, and discusses how SageMaker HyperPod provides smooth user experiences for admins and scientists.

Architecture overview

Amazon EKS support in HyperPod supports a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (attached as a group of worker nodes). You have three virtual private clouds (VPCs) in this architecture, hosting different types of resources:

Amazon EKS VPC – An AWS managed VPC hosts the EKS control plane. This VPC doesn’t appear in the customer account. Amazon EKS creates a highly available endpoint for the managed Kubernetes API server that you use to communicate with your cluster (using tools like kubectl). The managed endpoint uses Network Load Balancer to load balance Kubernetes API servers.
HyperPod VPC – An AWS managed VPC hosts the HyperPod compute. This VPC doesn’t appear in the customer account. The nodes connect to the EKS control plane through a cross-account elastic network interface (ENI).
SageMaker user VPC – A user-managed VPC hosts resources such as Amazon FSx for Lustre, which is optionally associated with Amazon Simple Storage Service (Amazon S3) using an data repository association, on your account.

Cross-account ENIs also bridge communication between HyperPod compute instances and other AWS services on your account, such as Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.

The following diagram illustrates the high-level architecture of Amazon EKS support in HyperPod.

HyperPod-managed resiliency features

Amazon EKS support in HyperPod provides the following three capabilities to make sure the cluster stays healthy and training jobs continue under unexpected interruptions:

Deep health checks – This is a managed health check for stress testing GPUs and AWS Trainium instances, as well as performing Elastic Fabric Adapter (EFA) These checks can be run during the cluster creation, update, or node replacement phases, and can be enabled or disabled through HyperPod APIs.
Automated node recovery – HyperPod performs managed, lightweight, and non-invasive checks, coupled with automated node replacement capability. The HyperPod monitoring agent continuously monitors and detects potential issues, including memory exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime issues, and out-of-memory (OOM) crashes. Based on the underlying issue, the monitoring agent either replaces or reboots the node.
Job auto resume – SageMaker HyperPod provides a job auto resume capability using the Kubeflow Training Operator for PyTorch to provide recovery and continuation of training jobs in the event of interruptions or failures. The extension makes sure the job waits and restarts after the node is replaced.

User experiences

In addition to the aforementioned managed resiliency features, SageMaker HyperPod provides smooth user experiences for both admins and scientists that are critical for managing a large cluster and running large-scale training jobs on them as part of the Amazon EKS integration:

Admin experience – SageMaker HyperPod provides APIs and a console experience to create and manage node groups in the EKS cluster, along with the ability to SSH into the cluster nodes. SageMaker HyperPod also provides a mechanism to install additional dependencies on the cluster nodes using lifecycle scripts, and an API-based mechanism to provide cluster software updates and improve overall observability.
Scientist experience – Along with enabling scientists to train FMs using Amazon EKS as the orchestrator, SageMaker HyperPod provides additional capabilities for scientists to effortlessly train models. With the HyperPod CLI, scientists can submit training jobs by providing a .yaml file and manage jobs (list, describe, view, cancel) without needing to use kubectl. Scientists can use open source tools like Kueue (a Kubernetes tool for job queuing) and adjacent SageMaker capabilities like managed MLflow to manage their experiments and training runs. Scientists can also access native SageMaker distributed training libraries that provide performance improvements by up to 20%. You can also enable SageMaker HyperPod compute with Amazon EKS support using third-party tools like KubeRay, which runs on the Kubernetes API. This allows you to bring your preferred job submission and management capabilities used with other Kubernetes clusters into your HyperPod environment.

HyperPod compute setup and node resiliency features

In this section, we provide a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, and discuss how its built-in resiliency features provide infrastructure stability.

Prerequisites

You need to have the following in place prior to the HyperPod compute deployment:

EKS cluster – You can associate HyperPod compute to an existing EKS cluster that satisfies the set of prerequisites. Alternatively, you can deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer the architecture guide for step-by-step setup instruction.
Custom resources – Running multi-node distributed training requires various resources various components, such as device plugins, CSI drivers, and Training Operators, to be pre-deployed on the EKS cluster. You also need to deploy additional resources for the health monitoring agent and deep health check. HyperPodHelmCharts simplify the process using Helm, one of most commonly used package mangers for Kubernetes. Refer the developer guide for installation.

HyperPod compute setup

With the aforementioned resources successfully deployed, you’re now prepared to create the HyperPod compute. The cluster configuration is specified using a JSON file; the following code provides an example:

cat > cluster-config.json << EOL
{
    “ClusterName”: “ml-cluster”,
    “Orchestrator”: {
        “Eks”: {
            “ClusterArn”: “${EKS_CLUSTER_ARN}”
        }
    },
    “InstanceGroups”: [
        {
            “InstanceGroupName”: “worker-group-1”,
            “InstanceType”: “ml.p5.48xlarge”,
            “InstanceCount”: 4,
            “LifeCycleConfig”: {
                “SourceS3Uri”: “s3://${BUCKET_NAME}”,
                “OnCreate”: “on_create.sh”
            },
            “ExecutionRole”: “${EXECUTION_ROLE}”,
            “ThreadsPerCore”: 1,
            “OnStartDeepHealthChecks”: [
                “InstanceStress”,
                “InstanceConnectivity”
            ]         }
    ],
    “VpcConfig”: {
        “SecurityGroupIds”: [
            “$SECURITY_GROUP”
        ],
        “Subnets”: [
            “$SUBNET_ID”
        ]     },
    “NodeRecovery”: “Automatic”
}
EOL

The provided configuration file contains two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs HyperPod to conduct a deep health check whenever new GPU or Trainium instances are added
“NodeRecovery”: “Automatic” – Enables HyperPod’s automated node recovery functionality

You can create a HyperPod compute with the following aws command (you need version 2.17.47 or newer):

aws sagemaker create-cluster
    –cli-input-json file://cluster-config.json

{
    “ClusterArn”: “arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49”
}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters –output table 

This command displays the cluster details, including the cluster name, status, and creation time:

———————————————————————————————————————–
|                                                    ListClusters                                                     |
+———————————————————————————————————————+
||                                                 ClusterSummaries                                                  ||
|+—————————————————————-+————–+—————-+——————+|
||                           ClusterArn                           | ClusterName  | ClusterStatus  |  CreationTime    ||
|+—————————————————————-+————–+—————-+——————+|
||  arn:aws:sagemaker:us-east-2:111111111111:cluster/wccy5z4n4m49 |  ml-cluster  |  Creating      |  1723724079.337  ||
|+—————————————————————-+————–+—————-+——————+|

Alternatively, you can verify the cluster status through the SageMaker console. After a brief period, you can observe that the status for all nodes transitions to Running.

Node resiliency features

To gain further insight into the instances, you can use kubectl get nodes and examine the node labels. The sagemaker.amazonaws.com/node-health-status label reveals the life stage of each node. For instance, nodes with the ml.m5.2xlarge instance type are labeled as Schedulable, indicating that they have successfully passed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge instance type are labeled as Unschedulable, indicating that they have entered the initial deep health checks. The following code shows an example:

# kubectl get nodes –show-labels=true
NAME                         …  LABELS
hyperpod-i-023cfe933b3b34369 …  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  …
hyperpod-i-045961b6424401838 …  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, …
hyperpod-i-074b81fdb5bf52e19 …  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, …
hyperpod-i-0ae97710b3033cdb1 …  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  …

The deep health check logs are stored in the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>. The log streams are logged at DeepHealthCheckResults/<log_stream_id>. When the deep health checks identify an issue, the output log provides detailed information, including the instance ID that failed the deep health checks and the specific failure reason. For example:

# Example1
{
“level”: “error”,
“ts”: “2024-08-15T21:15:22Z”,
“msg”: “Encountered FaultyInstance. Replace the Instance. Region: us-east-2,
InstanceType: p5.48xlarge. ERROR:Bandwidth has less than threshold: Expected minimum
threshold :80,NCCL Test output Bw: 30″
}
# Example2
{
“level”: “error”,
“ts”: “2024-08-15T21:15:22Z”,
“msg”: “Encountered Unknownerror. Replace the Instance. Region: us-east-2,
InstanceType: p5.48xlarge. ERROR: Crash detected in dcgm test”
}

You can check the progress of the deep health check with the following values for the sagemaker.amazonaws.com/deep-health-check label on each node:

amazonaws.com/deep-health-check: InProgress 
amazonaws.com/deep-health-check: Passed
amazonaws.com/deep-health-check: Failed

If a node fails the deep health checks, it will be replaced. Otherwise, it will be marked with the Schedulable label:

sagemaker.amazonaws.com/node-health-status: Schedulable

When you want to manually replace a specific node in your cluster, you can do so by manually modifying the label.

For complete list of resilience-related Kubernetes labels, please refer AWS documentation.

Even after the initial deep health checks, HyperPod periodically runs regular health checks. To view the health events detected by the HyperPod health monitoring agent, you can check the CloudWatch stream log:

Example log group name – /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
Example log stream nameSagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>

The SagemakerHealthMonitoringAgent log stream for each node contains only the detection events from the health monitoring agent. For example:

# Example1
{
    “level”: “info”,
    “ts”: “2024-09-06T03:15:11Z”,
    “msg”: “NPD caught “,
    “condition type: “: “KernelDeadlock”,
    “with condition details “: {
        “type”: “KernelDeadlock”,
        “status”: “False”,
        “transition”: “2024-09-06T03:15:11.539932213Z”,
        “reason”: “KernelHasNoDeadlock”,
        “message”: “kernel has no deadlock”
    },
    “HealthMonitoringAgentDetectionEvent”: “HealthEvent”
}
# Example2
{
    “level”: “info”,
    “ts”: “2024-09-06T03:15:11Z”,
    “msg”: “NPD caught “,
    “condition type: “: “NvidiaErrorTerminate”,
    “with condition details “: {
        “type”: “NvidiaErrorTerminate”,
        “status”: “False”,
        “transition”: “2024-09-06T03:15:11.539932283Z”,
        “reason”: “NvidiaNoErrorRequiredTerminate”,
        “message”: “Nvidia no error required terminate”
    },
    “HealthMonitoringAgentDetectionEvent”: “HealthEvent”
}

The deep health checks or the health monitor agent identify issues in a certain node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule to avoid scheduling pods, and then the node is replaced or rebooted.

You can monitor the health status of HyperPod nodes through CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps collect, aggregate, and summarize metrics and logs from containerized applications and microservices, providing detailed insights into performance, health, and status metrics for CPU, GPU, Trainium, EFA, and file system up to the container level. For the complete list of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can also check the individual node health status and the total number of schedulable and unschedulable nodes, as shown in the following screenshots.

You can find the Container Insights set up guide in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.

Training job resiliency with the job auto resume functionality

In addition to infrastructure resiliency features, you can use the use job auto resume capability using the Kubeflow Training Operator for PyTorch to maintain the recovery and continuation of training jobs in the event of interruptions or failures. The job auto resume feature attempts to continue the job, whereas the HyperPod node auto recovery functionality works on resolving node failures (node reboot or replacement as needed) to minimize training downtime. This section demonstrates the job auto resume feature using a PyTorch FSDP example on the awsome-distributed-training repository.

To enable the job auto resume feature, you create a PyTorchJob with the fsdp.yaml manifest, which includes the following annotations and nodeSelector:

apiVersion: “kubeflow.org/v1”
kind: PyTorchJob
metadata:
    name: fsdpjob
    namespace: kubeflow
    # config for HyperPod job auto-resume
    annotations: {
        sagemaker.amazonaws.com/enable-job-auto-resume: “true”,
        sagemaker.amazonaws.com/job-max-retry-count: “2”
    }
spec:
  pytorchReplicaSpecs:
  ……
  Worker:
      replicas: 10
      restartPolicy: OnFailure

      template:
          spec:
            nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable 
……

With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: “true” and sagemaker.amazonaws.com/job-max-retry-count: “2”, SageMaker HyperPod resumes interrupted training jobs up to two times and schedules the resumed jobs onto healthy nodes. These healthy nodes are identified by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable, ensuring that only nodes that have passed basic health checks and are available for running workloads are used for resumed jobs.

Submit the PyTorchJob using the kubectl command:

kubectl apply -f fsdp.yaml

With the job auto resume feature enabled, if a job fails due to a hardware failure or any transient issues during training, SageMaker HyperPod initiates the node replacement workflow and restarts the job after the faulty nodes are replaced. You can verify the status of job auto resume by describing the PyTorchJob:

kubectl describe pytorchjob -n kubeflow <job-name>

In the event of a hardware failure, the Kubeflow training job restarts as follows:

Start Time: 2024-07-11T05:53:10Z
Enable job auto-resume 27

Events:
Type Reason Age From
Message
—- —— —- —-

Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-0
Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-1
Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-master-0
Warning PyTorchJobRestarting 7m59s pytorchjob-controller
PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-0
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-1
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-master-0
Warning PyTorchJobRestarting 7m58s pytorchjob-controller
PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed

When you submit a training job with the HyperPod CLI, you can also request the job to be auto resumed in the following way:

hyperpod start-job
    –config-file ./config.yaml
   –auto-resume true  
   –max-retry 2

Refer to config.yaml for full configuration. For other CLI options, refer to the documentation on Github repository.

Clean up

To delete your SageMaker HyperPod compute, use either the SageMaker console or the following AWS Command Line Interface (AWS CLI) command:

aws sagemaker delete-cluster –cluster-name <cluster_name>

Cluster deletion can take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker console.

Conclusion

With the support for Amazon EKS in SageMaker HyperPod, customers who have standardized their FM development workflows on Kubernetes can adopt SageMaker HyperPod and manage their cluster resources using a familiar Kubernetes interface in SageMaker HyperPod. When training an FM, SageMaker HyperPod automatically monitors cluster health, and when an infrastructure fault such as a GPU failure occurs, SageMaker HyperPod automatically remediates the issue and restarts the training process from the last saved checkpoint, without any human intervention. Amazon EKS further enhances this capability by running deep health checks. Whenever a new instance is added to the SageMaker HyperPod compute, it undergoes a deep health check process to identify and replace potentially problematic instances. SageMaker HyperPod then automatically replaces or reboots nodes identified as faulty and resumes training processes in the event of unexpected interruptions, involving node replacement and job resubmission.

For an end-to-end tutorial on cluster management and FM training, visit the Amazon EKS Support in Amazon SageMaker HyperPod Workshop. For more information on infrastructure deployment and additional distributed training test cases, refer to the awsome-distributed-training repository. If you’re interested in deploying HyperPod with step-by-step commands, you can start from the aws-do-hyperpod repository.

About the authors

Keita Watanabe is a Senior GenAI Specialist Solutions Architect in the world-wide specialist organization at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect in the world-wide specialist organization at AWS. In his role, he focuses on helping customers with the orchestration and scaling of ML and AI workloads on container-powered AWS services. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing generative AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.

Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in cloud-side technology. In his free time, he enjoys playing video games, reading books, and writing software.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Manoj Ravi is a Senior Product Manager on the Amazon SageMaker team. He is passionate about building next-gen AI products and works on applications and tools to make foundation model development and deployment effortless for customers. He holds an MBA from the Haas School of Business and a master’s degree from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Categorized in: