Azure Optimized Stack With DeepSpeed For Hyperscale Model Training |

Attend QCon San Francisco (Oct 24-28) and find practical inspiration from software leaders. Register

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

The panelists discuss ways to improve as developers. Are better tools the solution, or can simple changes in mindset help? And what practices are already here, but not yet universally adopted?

Legacy applications actually benefit the most from concepts like a Minimum Viable Product (MVP) and its related Minimum Viable Architecture (MVA). Once you realize that every release is an experiment in value in which the release either improves the value that customers experience or doesn’t, you realize that every release, even one of a legacy application, can be thought of in terms of an MVP.

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.

In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Arpit Mohan about the importance and value of interpersonal skills in teamwork

Erin Schnabel discusses how application metrics align with other observability and monitoring methods, from profiling to tracing, and the limits of aggregation.

Learn how cloud architectures help organizations take care of application and cloud security, observability, availability and elasticity. Register Now.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.

InfoQ Homepage News Azure Optimized Stack with DeepSpeed for Hyperscale Model Training

Azure Machine Learning (AzureML) now provides an optimized stack that uses the latest NVIDIA GPU technology with Quantum InfiniBand to efficiently train and fine-tune large models like Megatron-Turing and GPT-3.

In recent years, large-scale transformers-based deep learning models trained on huge amounts of data are used for new products and several cognitive tasks. These models have grown in size and magnitude and the customers' needs for training and fine tuning have grown accordingly.

The training and fine tuning of these kinds of models require a complex and distributed architecture and the set up of these architectures require several manual and error prone steps. With this new optimized stack, AzureML allows a better experience in terms of usability and performances, providing a simple to use training pipeline. The AzureML proposed stack includes: hardware, OS, VM image, Docker image (with optimized PyTorch, DeepSpeed, ONNX Runtime and other Python packages) for performance and scalability without complexity.

Optimized stack for scalable distributed training on Azure

A possible experimental setup is composed of NDm A100 v4-series that includes two socket AMD EPYC 7V12 64-Core CPUs, 1.7TB of main memory and eight A100 80GB GPUS. A balanced PCIe topology to connect 4 GPUs to each CPU is used and each GPU has its own topology agnostic 200 Gb/s NVIDIA Mellanox HDR InfiniBand. The 1.7TB of main memory and the DeepSpeed library offload capabilities allows the scaling to large models size. This setup can be used both in AzureML studio and Azure VMSS but, the AzureML studio solution, is recommended because is the easiest way to have the setup up and running in the right and easy way.

Differences between distributed architecture and AzureML training setup

The AzureML proposed stack allows an efficient training of 2x larger model sizes (2 trillion vs. 1 trillion parameters), scaling to 2x more GPUs (1024 vs. 512), and up to 1.8x higher compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs). This stack also has the capability to offer a near-linear scalability in terms of increasing the model size and the increase of the number of GPUs. Thanks to DeepSpeed ZeRO-3 with its CPU offloading capabilities and this new AzureML stack, the efficient throughput/GPU of 157 TFLOPs is maintained as the model increase from 175 billion to 2 trillion parameters and, given a model size (eg 175 Billion in the following graph), a linear scaling is achieved if the number of GPU increase.

More detailed results are described in the deepspeed extended technical blog.

a. throughput/GPU vs model size from 175 billion to 2 trillion parameters (BS/GPU=8),

b. Linear increases performance scaling with the increase in number of GPU devices for the 175B model (BS/GPU=16).

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

Uncover emerging trends and practices from domain experts. Attend in-person at QCon San Francisco (October 24-28, 2022).

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon San Francisco - Oct 24-28, In-person. QCon San Francisco brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now

A Manufacturer You Can Trust

11 Years Of Industry Experience

Strong R&D Strength And Production Capacity

Azure Optimized Stack with DeepSpeed for Hyperscale Model Training

Featured Products

News & Blog

Pilings and Excavation Underway at 29 Jay Street in DUMBO, Brooklyn - New York YIMBY

Modsy gets into the renovation game

3 TYPES OF 3D INTERIOR DESIGN RENDERING SERVICES AND BENEFITS THEY BRING TO BUSINESSES - The Seeker Newsmagazine Cornwall

Prefab Architecture designed to convert you into sustainable architecture advocates - Yanko Design

Zaha Hadid Architects unveils 'Vertical Urbanism' and 'Future Cities' exhibitions

takenaka corporation reveals 'the bridge/the village' proposal

April’s can’t-miss design events

Urban planner wants a Toronto that’s accessible to all | The Star

Global Visualization And 3D Rendering Software Market (2021 to 2030) – by Product Type, Deployment Model, Application and Industry Vertical – The Sabre

Search