'A virtual DPU within a GPU': Could clever hardware hack be behind DeepSeek's groundbreaking AI efficiency?

A person's hand using DeepSeek on their mobile phone
(Image credit: Adobe Stock)

  • A new approach called DualPipe seems to be the key to DeekSeek's success
  • One expert describes it as an on-GPU virtual DPU that maximizes bandwidth efficiency
  • While DeepSeek has used Nvidia GPUs only, one wonders how AMD's Instinct would fare

China’s DeepSeek AI chatbot has stunned the tech industry, representing a credible alternative to OpenAI’s ChatGPT at a fraction of the cost.

A recent paper revealed DeepSeek V3 was trained on a cluster of 2,048 Nvidia H800 GPUs – crippled versions of the H100 (we can only imagine how much more powerful it would be running on AMD Instinct accelerators!). It reportedly required 2.79 million GPU-hours for pretraining, fine-tuning on 14.8 trillion tokens, and cost - according to calculations made by The Next Platform - a mere $5.58 million.

But exactly how DeepSeek's developers managed this feat is likely down to a clever hack.

A virtual DPU on the GPU itself

First, some background. DeepSeek is an advanced Mixture-of-Experts (MoE) language model designed to optimize performance by selectively activating only the most relevant parts of its architecture for each task. The third version of the model, DeepSeek-V3, features a total of 671 billion parameters, with only 37 billion activated for any given token prediction. This selective activation massively reduces computational costs while maintaining high performance and accuracy – which you’ll see if you try it.

It’s easy to be skeptical of DeepSeek and the claims made regarding its training, but the paper reveals some of the magic the developers came up with to make the most of the crippled hardware they had to work with. This includes the creation of the DualPipe algorithm for efficient pipeline parallelism.

According to the information published by DeepSeek, DualPipe overlaps forward and backward computation, reduces latency, and optimizes data movement across GPUs. By efficiently managing communication, it minimizes idle time (pipeline bubbles) and dynamically balances GPU compute cores (Streaming Multiprocessors) between computation and communication, preventing data transfer bottlenecks as the model scales.

A commenter on The Next Platform describes DualPipe as "essentially creating a virtual DPU on the GPU itself to handle all-to-all communication," which highlights its role in optimizing data transfer efficiency.

The paper goes into further detail, "In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink.”

Example DualPipe scheduling

Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication. (Image credit: DeekSeek)

You might also like

TOPICS
Wayne Williams
Editor

Wayne Williams is a freelancer writing news for TechRadar Pro. He has been writing about computers, technology, and the web for 30 years. In that time he wrote for most of the UK’s PC magazines, and launched, edited and published a number of them too.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read more
Nvidia H800 GPU
A look at the unbelievable Nvidia GPU that powers DeepSeek's AI global ambition
Nvidia HQ
Nvidia calls DeepSeek an 'excellent AI advancement' and praises the Chinese AI app's ingenuity
DeepSeek
Nvidia out? DeepSeek pairs with banned Chinese tech giant to deliver unbelievably low pricing on AI inference which could cause Nvidia's house of cards to come crashing
Cerebras WSE-3
DeepSeek on steroids: Cerebras embraces controversial Chinese ChatGPT rival and promises 57x faster inference speeds
SambaNova runs DeepSeek
Nvidia rival claims DeepSeek world record as it delivers industry-first performance with 95% fewer chips
A person using DeepSeek on their smartphone
Only two weeks in and AI phenomenon DeepSeek is officially growing faster than ChatGPT
Latest in Pro
Finger Presses Orange Button Domain Name Registration on Black Keyboard Background. Closeup View
I visited the world’s first registered .com domain – and you won’t believe what it’s offering today
Racks of servers inside a data center.
Modernizing data centers: an efficient path forward
Dr. Peter Zhou, President of Huawei Data Storage Product Line
Why AI commonization is so important for business intelligent transformation and what Huawei’s data storage has to offer
Wix automation
The world's leading website builder aims to save businesses time with new tool
Data Breach
Thousands of healthcare records exposed online, including private patient information
China
Juniper patches security flaws which could have let hackers take over your router
Latest in News
Quordle on a smartphone held in a hand
Quordle hints and answers for Monday, March 17 (game #1148)
NYT Strands homescreen on a mobile phone screen, on a light blue background
NYT Strands hints and answers for Monday, March 17 (game #379)
NYT Connections homescreen on a phone, on a purple background
NYT Connections hints and answers for Monday, March 17 (game #645)
Apple iPhone 16 Pro HANDS ON
Leaked iPhone 17 dummy units may have given us our best look yet at all four models
A super close up image of the Google Gemini app in the Play Store
It's official: Google Assistant will be retired for phones this year, with Gemini taking over
Quordle on a smartphone held in a hand
Quordle hints and answers for Sunday, March 16 (game #1147)