Alibaba unveils the network and datacenter design it uses for large language model training

Alibaba logo
(Image credit: Shutterstock / monticello)

Alibaba has revealed its datacenter design for LLM training, which apparently consists of an Ethernet-based network in which each host contains eight GPUs and nine NICs that each have two 200 GB/sec ports.

The tech giant, which also offers one of the best large language models (LLM) around via its Qwen model, trained on 110 billion parameters, says this design has been used in production for eight months, and aims to maximize the utilization of a GPU's PCIe capabilities increasing the send/receive capacity of the network. 

Another feature that increases speed is the use of NVlink for the intra-host network providing more bandwidth between hosts. Each port on the NICs is connected to a different top-of-rack switch avoiding a single point of failure a design that Alibaba call rail-optimized.

Each pod contains 15,000 GPUs

A new type of network is required because the traffic patterns in LLM training is different from general cloud computing because of low entropy and bursty traffic. there is also a higher sensitivity to faults and single point failures.

"Based on the unique characteristics of LLM training, we decided to build a new network architecture specifically for LLM training. We should meet the following goals; scalability, high performance, and single-ToR fault tolerance," the company said.

Another part of the infrastructure that was revealed was the cooling mechanism. As no vendors could provide a solution to keep chips below 105C, the temperature at which switches begin to shut down, Alibaba designed and created its own vapor chamber heat sink along with using more wicked pillars at the center of chips carrying heat away more efficiently.

The design for LLM training is encapsulated in pods that contain 15,000 GPUs and each pod can be located in a single datacenter. "All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs. In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building." Alibaba wrote.

Alibaba also wrote it expects model parameters to continue to rise by an order of magnitude in the next several years from one trillion to 10 trillion parameters, and that its new architecture is planned to be able to support this and increase to a scale of 100,000 GPUs.

Via The Register

More from TechRadar Pro

James Capell
B2B Editor, Web Hosting

James is a tech journalist covering interconnectivity and digital infrastructure as the web hosting editor at TechRadar Pro. James stays up to date with the latest web and internet trends by attending data center summits, WordPress conferences, and mingling with software and web developers. At TechRadar Pro, James is responsible for ensuring web hosting pages are as relevant and as helpful to readers as possible and is also looking for the best deals and coupon codes for web hosting.

Read more
Nvidia H800 GPU
A look at the unbelievable Nvidia GPU that powers DeepSeek's AI global ambition
SambaNova runs DeepSeek
Nvidia rival claims DeepSeek world record as it delivers industry-first performance with 95% fewer chips
Cerebras WSE-3
DeepSeek on steroids: Cerebras embraces controversial Chinese ChatGPT rival and promises 57x faster inference speeds
DeepSeek
Nvidia out? DeepSeek pairs with banned Chinese tech giant to deliver unbelievably low pricing on AI inference which could cause Nvidia's house of cards to come crashing
Leaseweb boosts AI-focused services with the inclusion of Nvidia GPU solutions
Nvidia Quantum-X and Spectrum-X Silicon Photonics
Nvidia is planning post-copper 1.6Tbps network tech to connect millions of GPUs as it unveils photonics networking gear at GTC 2025
Latest in Website Hosting
cybersecurity
What's the right type of web hosting for me?
A cloud symbol imposed over a bank of servers in a data center.
What is cloud hosting and who needs it?
Minecraft game server hosting for streamers heading - The Minecraft logo above a Minecraft landscape.
I tried 15 hosts for streaming and hosting Minecraft games and these are the best
Dark web scanning on a laptop
Hostinger integrates dark web scanning into hPanel
WordPress
WordPress Foundation bid for greater trademark control halted, adding to more legal setbacks for CEO Matt Mullenweg
The PebbleHost website.
PebbleHost review
Latest in News
Zendesk Relate 2025
Zendesk Relate 2025 - everything you need to know as the event unfolds
Disney Plus logo with popcorn
You can finally tell Disney+ to stop bugging you about that terrible Marvel show you regret starting
Google Gemini AI
Gemini can now see your screen and judge your tabs
Girl wearing Meta Quest 3 headset interacting with a jungle playset
Latest Meta Quest 3 software beta teases a major design overhaul and VR screen sharing – and I need these updates now
Philips Hue
Philips Hue might be working on a video doorbell, and according to a new report, we just got our first look at it
Microsoft
"Another pair of eyes" - Microsoft launches all-new Security Copilot Agents to give security teams the upper hand