Tuesday, November 28, 2023
HomeAIKT’s journey to scale back coaching time for a imaginative and prescient...

KT’s journey to scale back coaching time for a imaginative and prescient transformers mannequin utilizing Amazon SageMaker


KT Company is without doubt one of the largest telecommunications suppliers in South Korea, providing a variety of companies together with fixed-line phone, cell communication, and web, and AI companies. KT’s AI Meals Tag is an AI-based dietary administration answer that identifies the sort and dietary content material of meals in images utilizing a pc imaginative and prescient mannequin. This imaginative and prescient mannequin developed by KT depends on a mannequin pre-trained with a considerable amount of unlabeled picture knowledge to research the dietary content material and calorie data of varied meals. The AI Meals Tag may also help sufferers with power ailments akin to diabetes handle their diets. KT used AWS and Amazon SageMaker to coach this AI Meals Tag mannequin 29 instances quicker than earlier than and optimize it for manufacturing deployment with a mannequin distillation approach. On this publish, we describe KT’s mannequin improvement journey and success utilizing SageMaker.

Introducing the KT mission and defining the issue

The AI Meals Tag mannequin pre-trained by KT is predicated on the imaginative and prescient transformers (ViT) structure and has extra mannequin parameters than their earlier imaginative and prescient mannequin to enhance accuracy. To shrink the mannequin measurement for manufacturing, KT is utilizing a information distillation (KD) approach to scale back the variety of mannequin parameters with out important affect to accuracy. With information distillation, the pre-trained mannequin is known as a trainer mannequin, and a light-weight output mannequin is skilled as a scholar mannequin, as illustrated within the following determine. The light-weight scholar mannequin has fewer mannequin parameters than the trainer, which reduces reminiscence necessities and permits for deployment on smaller, cheaper cases. The scholar maintains acceptable accuracy although it’s smaller by studying from the outputs of the trainer mannequin.

The trainer mannequin stays unchanged throughout KD, however the scholar mannequin is skilled utilizing the output logits of the trainer mannequin as labels to calculate loss. With this KD paradigm, each the trainer and the coed must be on a single GPU reminiscence for coaching. KT initially used two GPUs (A100 80 GB) of their inner, on-premises surroundings to coach the coed mannequin, however the course of took about 40 days to cowl 300 epochs. To speed up coaching and generate a scholar mannequin in much less time, KT partnered with AWS. Collectively, the groups considerably diminished mannequin coaching time. This publish describes how the staff used Amazon SageMaker Coaching, the SageMaker Knowledge Parallelism Library, Amazon SageMaker Debugger, and Amazon SageMaker Profiler to efficiently develop a light-weight AI Meals Tag mannequin.

Constructing a distributed coaching surroundings with SageMaker

SageMaker Coaching is a managed machine studying (ML) coaching surroundings on AWS that gives a set of options and instruments to simplify the coaching expertise and might be helpful in distributed computing, as illustrated within the following diagram.

The model distributed training environment with SageMaker Training

SageMaker clients may entry built-in Docker photos with varied pre-installed deep studying frameworks and the mandatory Linux, NCCL, and Python packages for mannequin coaching. Knowledge scientists or ML engineers who need to run mannequin coaching can achieve this with out the burden of configuring coaching infrastructure or managing Docker and the compatibility of various libraries.

Throughout a 1-day workshop, we have been in a position to arrange a distributed coaching configuration primarily based on SageMaker inside KT’s AWS account, speed up KT’s coaching scripts utilizing the SageMaker Distributed Knowledge Parallel (DDP) library, and even take a look at a coaching job utilizing two ml.p4d.24xlarge cases. On this part, we describe KT’s expertise working with the AWS staff and utilizing SageMaker to develop their mannequin.

Within the proof of idea, we wished to hurry up a coaching job by utilizing the SageMaker DDP library, which is optimized for AWS infrastructure throughout distributed coaching. To alter from PyTorch DDP to SageMaker DDP, you merely have to declare the torch_smddp bundle and alter the backend to smddp, as proven within the following code:

import smdistributed.dataparallel.torch.torch_smddp

dist.init_process_group(backend='smddp',

rank=args.rank,

world_size=args.world_size)

To be taught extra concerning the SageMaker DDP library, consult with SageMaker’s Knowledge Parallelism Library.

Analyzing the causes of sluggish coaching velocity with the SageMaker Debugger and Profiler

Step one in optimizing and accelerating a coaching workload includes understanding and diagnosing the place bottlenecks happen. For KT’s coaching job, we measured the coaching time per iteration of the info loader, ahead go, and backward go:

1 iter time – dataloader : 0.00053 sec, ahead : 7.77474 sec, backward: 1.58002 sec
2 iter time – dataloader : 0.00063 sec, ahead : 0.67429 sec, backward: 24.74539 sec
3 iter time – dataloader : 0.00061 sec, ahead : 0.90976 sec, backward: 8.31253 sec
4 iter time – dataloader : 0.00060 sec, ahead : 0.60958 sec, backward: 30.93830 sec
5 iter time – dataloader : 0.00080 sec, ahead : 0.83237 sec, backward: 8.41030 sec
6 iter time – dataloader : 0.00067 sec, ahead : 0.75715 sec, backward: 29.88415 sec

Trying on the time in the usual output for every iteration, we noticed that the backward go’s run time fluctuated considerably from iteration to iteration. This variation is uncommon and may affect whole coaching time. To search out the reason for this inconsistent coaching velocity, we first tried to determine useful resource bottlenecks by using the System Monitor (SageMaker Debugger UI), which lets you debug coaching jobs on SageMaker Coaching and examine the standing of assets such because the managed coaching platform’s CPU, GPU, community, and I/O inside a set variety of seconds.

The SageMaker Debugger UI offers detailed and important knowledge that may assist figuring out and diagnose bottlenecks in a coaching job. Particularly, the CPU utilization line chart and CPU/GPU utilization warmth map per occasion tables caught our eye.

Within the CPU utilization line chart, we observed that some CPUs have been getting used 100%.

The CPU utilization line chart with a CPU bottlenect

Within the warmth map (the place darker colours point out larger utilization), we famous that just a few CPU cores had excessive utilization all through the coaching, whereas GPU utilization wasn’t persistently excessive over time.

The CPU utilization heat-map with a CPU bottlenect

From right here, we started to suspect that one of many causes for the sluggish coaching velocity was a CPU bottleneck. We reviewed the coaching script code to see if something was inflicting the CPU bottleneck. Probably the most suspicious half was the big worth of num_workers within the knowledge loader, so we modified this worth to 0 or 1 to scale back CPU utilization. We then ran the coaching job once more and checked the outcomes.

The next screenshots present the CPU utilization line chart, GPU utilization, and warmth map after mitigating the CPU bottleneck.

The CPU utilization line chart after mitigating a CPU bottleneck

The CPU utilization GPU utilization after mitigating a CPU bottleneckThe CPU utilization heat-map after mitigating a CPU bottleneck

By merely altering num_workers, we noticed a big lower in CPU utilization and an total enhance in GPU utilization. This was an essential change that improved coaching velocity considerably. Nonetheless, we wished to see the place we may optimize GPU utilization. For this, we used SageMaker Profiler.

SageMaker Profiler helps determine optimization clues by offering visibility into utilization by operations, together with monitoring GPU and CPU utilization metrics and kernel consumption of GPU/CPU inside coaching scripts. It helps customers perceive which operations are consuming assets. First, to make use of SageMaker Profiler, you’ll want to add ProfilerConfig to the operate that invokes the coaching job utilizing the SageMaker SDK, as proven within the following code:

from sagemaker import ProfilerConfig, Profiler

from sagemaker.debugger import (ProfilerRule, rule_configs)

guidelines=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]

profiler_config = ProfilerConfig(profile_params = Profiler(cpu_profiling_duration=3600))

from sagemaker.pytorch import PyTorch

region_name="us-west-2"

image_uri=f'763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker'

estimator = PyTorch(

entry_point="practice.py",

source_dir="src",

position=position,

image_uri=image_uri,

instance_count=4,

instance_type="ml.p4d.24xlarge",

distribution={'smdistributed': {'dataparallel': {'enabled': True}}},

profiler_config=profiler_config,

hyperparameters=hyperparameters,

sagemaker_session=sagemaker_session,

)

Within the SageMaker Python SDK, you could have the flexibleness so as to add the annotate features for SageMaker Profiler to pick out code or steps within the coaching script that wants profiling. The next is an instance of the code that it is best to declare for SageMaker Profiler within the coaching scripts:

import smppy

SMProf = smppy.SMProfiler.occasion()

config = smppy.Config()

config.profiler = {

"EnableCuda": "1",

}

SMProf.configure(config)

SMProf.start_profiling()

…

with smppy.annotate("Ahead"):

student_out = student_model(inp)

with smppy.annotate("Backward"):

loss.backward()

…

SMProf.stop_profiling()

After including the previous code, for those who run a coaching job utilizing the coaching scripts, you will get details about the operations consumed by the GPU kernel (as proven within the following determine) after the coaching runs for a time period. Within the case of KT’s coaching scripts, we ran it for one epoch and obtained the next outcomes.

Time Spent By All GPU Kernels(1)

Once we checked the highest 5 operation consumption instances of the GPU kernel among the many outcomes of SageMaker Profiler, we discovered that for the KT coaching script, essentially the most time is consumed by the matrix product operation, which is a basic matrix multiplication (GEMM) operation on GPUs. With this essential perception from the SageMaker Profiler, we started investigating methods to speed up these operations and enhance GPU utilization.

Dashing up coaching time

We reviewed varied methods to scale back computation time of matrix multiplication and utilized two PyTorch features.

Shard optimizer states with ZeroRedundancyOptimizer

In case you take a look at the Zero Redundancy Optimizer (ZeRO), the DeepSpeed/ZeRO approach allows the coaching of a big mannequin effectively with higher coaching velocity by eliminating the redundancies in reminiscence utilized by the mannequin. ZeroRedundancyOptimizer in PyTorch makes use of the strategy of sharding the optimizer state to scale back reminiscence utilization per a course of in Distributed Knowledge Parallel (DDP). DDP makes use of synchronized gradients within the backward go so that every one optimizer replicas iterate over the identical parameters and gradient values, however as an alternative of getting all of the mannequin parameters, every optimizer state is maintained by sharding just for totally different DDP processes to scale back reminiscence utilization.

To make use of it, you’ll be able to depart your present Optimizer in optimizer_class and declare a ZeroRedundancyOptimizer with the remainder of the mannequin parameters and the educational charge as parameters.

student_optimizer = ZeroRedundancyOptimizer(

student_model.parameters(),

optimizer_class=torch.optim.AdamW,

lr=initial_lr

)

Computerized blended precision

Computerized blended precision (AMP) makes use of the torch.float32 knowledge sort for some operations and torch.bfloat16 or torch.float16 for others, for the comfort of quick computation and diminished reminiscence utilization. Specifically, as a result of deep studying fashions are sometimes extra delicate to exponent bits than fraction bits of their computations, torch.bfloat16 is equal to the exponent bits of torch.float32, permitting them to be taught shortly with minimal loss. torch.bfloat16 solely runs on cases with A100 NVIDIA structure (Ampere) or larger, akin to ml.p4d.24xlarge, ml.p4de.24xlarge, and ml.p5.48xlarge.

To use AMP, you’ll be able to declare torch.cuda.amp.autocast within the coaching scripts as proven within the code above and declare dtype as torch.bfloat16.

with torch.cuda.amp.autocast(dtype="torch.bfloat16"):

trainer = teacher_model(input_data)

scholar = student_model(input_data)

loss = loss(trainer, scholar, goal)

loss.requires_grad_(True)

loss.backward()

student_optimizer.step()

student_optimizer.zero_grad(set_to_none=True)

Leads to SageMaker Profiler

After making use of the 2 features to the coaching scripts and operating a practice job for one epoch once more, we checked the highest 5 operations consumption instances for the GPU kernel in SageMaker Profiler. The next determine reveals our outcomes.

Time Spent By All GPU Kernels(2)

We are able to see that the GEMM operation, which was on the high of the record earlier than making use of the 2 Torch features, has disappeared from the highest 5 operations, changed by the ReduceScatter operation, which generally happens in distributed coaching.

Coaching velocity outcomes of the KT distilled mannequin

We elevated the coaching batch measurement by 128 extra to account for the reminiscence financial savings from making use of the 2 Torch features, leading to a remaining batch measurement of 1152 as an alternative of 1024. The coaching of the ultimate scholar mannequin was in a position to run 210 epochs per 1 day; the coaching time and speedup between KT’s inner coaching surroundings and SageMaker are summarized within the following desk.

Coaching Setting Coaching GPU spec. Variety of GPU Coaching Time (hours) Epoch Hours per Epoch Discount Ratio
KT’s inner coaching surroundings A100 (80GB) 2 960 300 3.20 29
Amazon SageMaker A100 (40GB) 32 24 210 0.11 1

The scalability of AWS allowed us to finish the coaching job 29 instances quicker than earlier than utilizing 32 GPUs as an alternative of two on premises. Because of this, utilizing extra GPUs on SageMaker would have considerably diminished coaching time with no distinction in total coaching prices.

Conclusion

Park Sang-min (Imaginative and prescient AI Serving Know-how Crew Chief) from the AI2XL Lab in KT’s Convergence Know-how Middle commented on the collaboration with AWS to develop the AI Meals Tag mannequin:

“Lately, as there are extra transformer-based fashions within the imaginative and prescient subject, the mannequin parameters and required GPU reminiscence are rising. We’re utilizing light-weight know-how to resolve this difficulty, and it takes loads of time, a couple of month to be taught as soon as. By way of this PoC with AWS, we have been in a position to determine the useful resource bottlenecks with assist of SageMaker Profiler and Debugger, resolve them, after which use SageMaker’s knowledge parallelism library to finish the coaching in about someday with optimized mannequin code on 4 ml.p4d.24xlarge cases.”

SageMaker helped save Sang-min’s staff weeks of time in mannequin coaching and improvement.

Based mostly on this collaboration on the imaginative and prescient mannequin, AWS and the SageMaker staff will proceed to collaborate with KT on varied AI/ML analysis initiatives to enhance mannequin improvement and repair productiveness via making use of SageMaker capabilities.

To be taught extra about associated options in SageMaker, try the next:


In regards to the authors

Youngjoon Choi, AI/ML Skilled SA, has skilled enterprise IT in varied industries akin to manufacturing, high-tech, and finance as a developer, architect, and knowledge scientist. He carried out analysis on machine studying and deep studying, particularly on matters like hyperparameter optimization and area adaptation, presenting algorithms and papers. At AWS, he focuses on AI/ML throughout industries, offering technical validation utilizing AWS companies for distributed coaching/giant scale fashions and constructing MLOps. He proposes and opinions architectures, aiming to contribute to the growth of the AI/ML ecosystem.

Jung Hoon Kim is an account SA of AWS Korea. Based mostly on experiences in functions structure design, improvement and programs modeling in varied industries akin to hi-tech, manufacturing, finance and public sector, he’s engaged on AWS Cloud journey and workloads optimization on AWS for enterprise clients.

Rock Sakong is a researcher at KT R&D. He has carried out analysis and improvement for the imaginative and prescient AI in varied fields and primarily carried out facial attributes (gender/glasses, hats, and many others.)/face recognition know-how associated to the face. At present, he’s engaged on light-weight know-how for the imaginative and prescient fashions.

Manoj Ravi is a Senior Product Supervisor for Amazon SageMaker. He’s enthusiastic about constructing next-gen AI merchandise and works on software program and instruments to make large-scale machine studying simpler for purchasers. He holds an MBA from Haas College of Enterprise and a Masters in Info Methods Administration from Carnegie Mellon College. In his spare time, Manoj enjoys enjoying tennis and pursuing panorama pictures.

Robert Van Dusen is a Senior Product Supervisor with Amazon SageMaker. He leads frameworks, compilers, and optimization methods for deep studying coaching.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments