Hundreds of Nvidia H100 GPUs failed during Llama 3 training

During Llama 3 training, Nvidia H100 GPUs were the most sensitive components. More specifically, HBM3 memory frequently failed.

Meta trained Llama 3 405B on a GPU supercomputer with 16,384 Nvidia H100 80 GB GPUs. During the 54-day training period, the cluster encountered 419 failed components. On average, something went wrong about every three hours.

Fragile component

In about half of the cases, it had “something” to do with the H100 GPUs. In 17.2 percent of cases, a component failure was due to a fault in the HBM3 memory. In another thirty percent, another problem with the GPU (or NVLink) was the cause.

Only two CPUs failed during training. Other failures were due to a variety of different hardware and software errors. The GPUs proved to be by far the most fragile components of the cluster, although the impact of the defects was limited during the training period. Meta claims to be able to train effectively 90 percent of the time.

The company used several techniques to minimize component failures. Given the size of the training cluster, failures were inevitable. To train efficiently, a large parallel system must be able to handle local problems. Meta used software to quickly detect faulty GPUs and quickly fix problems. This was often done automatically. Only three incidents were resolved through manual intervention alone.

Economies of scale

The numbers illustrate the challenges of training large AI models. The size of the training clusters is so large that problems often arise in the training process. It is up to researchers and specialists to continue training even in the event of hardware failures.

On average, Meta has encountered an issue every three hours, but there are much larger AI clusters in the pipeline. Elon Musk’s xAI wants to build a cluster with 100,000 Nvidia H100 GPUs. If components fail on the same scale, xAI engineers will have to ensure troubleshooting is smooth several times per hour.

Source: IT Daily

Mary

As an experienced journalist and author, Mary has been reporting on the latest news and trends for over 5 years. With a passion for uncovering the stories behind the headlines, Mary has earned a reputation as a trusted voice in the world of journalism. Her writing style is insightful, engaging and thought-provoking, as she takes a deep dive into the most pressing issues of our time.

Recent Posts

An unprecedented ecosystem found under the ice of an Antarctic lake December 24, 2024

Vivo Y29 5G officially launched: entry level with LED Dynamic Light December 24, 2024

Can animals have schizophrenia like humans? December 24, 2024