Nvidia vs AMD vs Huawei: DeepSeek V4 1.6T: Day 0 to Day 43 Performance on GB300, MI355X, Ascend & B200
Nvidia vs AMD vs Huawei: DeepSeek V4 1.6T: Day 0 to Day 43 Performance on GB300, MI355X, Ascend & B200
"We hit 100x faster inference on Day 26. But that wasn't the end of the story." , InferenceX engineering team
Imagine buying a Formula 1 car that arrives in a million pieces. The chassis is revolutionary. The engine specs are insane. But on race day, it's not turning laps.
That's exactly what happened when DeepSeek V4 launched on April 24, 2026.
The model itself was brilliant. A 1.6‑trillion‑parameter Mixture‑of‑Experts (MoE) architecture with 49 billion active parameters per token, a 1‑million‑token context window, and the kind of coding benchmark scores that made Anthropic's Claude Opus 4.6 sweat (93.5 on LiveCodeBench).
But here's the part nobody puts in the press release: Day‑0 inference performance was unpolished.
The open‑source community scrambled. Frameworks crashed. CUDA kernels stuttered. And in the weeks that followed, something remarkable happened, a 100x performance improvement emerged not from Nvidia or AMD, but from the global open‑source ecosystem rallying together.
This is the story of DeepSeek V4's first 43 days across four hardware platforms: GB300 NVL72, MI355X, Ascend 950DT, and B200.
Day 0 Reality Check: The Model Launched Fast, But Not Ready
Let's talk about what April 24, 2026 actually looked like.
DeepSeek V4 dropped as two MIT‑licensed MoE models:
- V4‑Pro: 1.6 trillion total parameters, ~49 billion active per token, 1M context window
- V4‑Flash: 284B total, ~13B active, 1M context window
The architecture was genuinely ahead of its time.
CUDA vLLM and CUDA SGLang worked "out of the box," but "out of the box" didn't mean fast. It meant the model could load and run. That's it. And that's where the real work began.
TensorRT‑LLM, Nvidia's "official" inference engine, didn't even work well at launch. SemiAnalysis had to submit a pull request to fix Nvidia's own open‑source kernel launch code.
The developer reception was mixed. On one hand, architectural innovations like Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) drew genuine praise. On the other hand, there was a consensus that DeepSeek V4 still trailed cutting‑edge closed‑source models by roughly three to six months, particularly in complex reasoning tasks.
DeepSeek itself was honest about the gap. This wasn't an OpenAI killer on Day 0. It was an open‑source foundation waiting to be optimized.
Quick side note: This is actually healthy for the ecosystem. A "perfect" launch leaves no room for community optimization. The messier the Day‑0, the more dramatic the improvement curve.
The 100x Performance Jump in 26 Days
Here's where the story gets interesting.
Under the technical leadership of HaiShaw, the InferenceX engineering team pulled multiple all‑nighters to measure and improve DeepSeek V4's performance across every major framework.
The result? A more than 100x performance improvement by Day 26.
How does 100x happen in less than a month? Three key factors:
Factor 1: Framework optimizations hit master branch fast.
SGLang and vLLM became the backbone of the global ML ecosystem, so much so that both teams launched their own companies (Inferact and RadixArk), raising hundreds of millions of dollars each to continue building open‑source inference infrastructure.
These weren't hobbyist projects. These were production‑grade engines that prioritized iteration speed over feature polish.
Factor 2: The community fixed what vendors missed.
Nvidia's TensorRT‑LLM didn't work well for DeepSeek V4. The SemiAnalysis team literally had to fix the open‑source mHC kernel launch code themselves. ROCm didn't play nicely either.
But here's the thing: when you're working in the open, a bottleneck gets identified, fixed, and merged in days, not quarters.
Factor 3: Batch invariance and KV cache compression changed the game.
vLLM v0.22 introduced batch invariance, delivering 28.9% latency improvements while preserving accuracy. Rust frontends replaced Python inference hot paths. KV cache compression via CSA + HCA, requiring only 27% of the single‑token inference FLOPs and 10% of the KV cache compared to V3.2 in the 1M context setting, meant that long‑context workloads became dramatically cheaper to run.
Let's pause and appreciate that for a moment. 10% of the KV cache. That's the difference between a model that bankrupts your cloud bill and one that scales.
Hardware Wars: How Each Platform Unlocked V4's True Speed
Now, let's talk about the four hardware platforms that ran DeepSeek V4 over those 43 days. Each had a distinct story.
NVIDIA GB300 NVL72: The Rack‑Scale Powerhouse
The GB300 NVL72 is not a server. It's a supercomputer in a rack: 72 NVIDIA Blackwell Ultra GPUs, 36 Grace CPUs, 37 TB of GPU memory, and 130 TB/s of NVLink bandwidth.
When DeepSeek V4 launched, SemiAnalysis's own GB300 cluster was down. CoreWeave scrambled to find two spare dev GB300 NVL72 racks, contributing compute to the open‑source community. Those racks ran around the clock to drive improvements.
The results were staggering. Compared to Hopper‑based platforms, GB300 NVL72 delivers up to a 50x overall increase in AI factory output performance, a 10x boost in user responsiveness, and a 5x improvement in throughput per megawatt.
For DeepSeek V4 specifically, the GB300 NVL72 FP4 disaggregation unlocked test‑time scaling at a level Hopper simply couldn't touch.
NVIDIA B200: The Workhorse Still Has Legs
The B200 is the reliable middle child of the Blackwell family. 180 GB of HBM3E memory, 6.1 TB/s bandwidth, solid performance, but increasingly positioned as the "entry‑level" option as GB300 ramps.
In pure training terms, B200 remains competitive. AMD's MI355X matches it in FP8 Llama 70B training (roughly 2.5–3.5x MI300X, trading blows with B200/GB200).
But for inference, the gap is widening. The GB200 NVL72 (72 B200s) delivered up to 28x the throughput of a comparable MI355X cluster in one DeepSeek‑R1 benchmark.
AMD MI355X: The Upstart with HBM Advantage
AMD's MI355X is the underdog that keeps surprising people.
288 GB of HBM3E memory vs. B200's 180 GB. 8 TB/s bandwidth. FP4 and FP8 optimizations that AMD claims deliver 1.4x higher throughput than B200 when serving DeepSeek‑R1 at scale.
In vLLM/SGLang benchmarks, MI355X delivers roughly 20–30% higher inference throughput on DeepSeek R1 and Llama 3 70B than B200.
However, and this is important, AMD's advantage is regime dependent. For dense architectures and smaller MoE models, B200 still leads. But when scaling to frontier‑class MoE models like DeepSeek‑R1 beyond a single node, all 8‑GPU systems hit a "scaling ceiling" due to communication bottlenecks.
The takeaway: MI355X is a memory monster, and memory matters for MoE.
The Missing Comparison: GB200 NVL72 vs GB300 NVL72 vs MI355X
One critical gap in public benchmarks: direct GB200 vs GB300 vs MI355X comparisons for DeepSeek V4 specifically. Most existing benchmarks focus on DeepSeek‑R1 or Llama. The InferenceX engineering team has been the first to test multi‑node FP4 and FP8 MI355X performance, but even that is early.
Huawei Ascend 950DT: China's Answer to Nvidia Gets Tested
Here's where things get strategically fascinating.
Huawei's Ascend 950DT was co‑designed in part for DeepSeek V4 inference. This wasn't an afterthought. The model's architecture and Huawei's accelerator roadmaps were aligned from the start.
What is the Ascend 950DT?
The 950DT variant includes 144 GB of in‑house HBM with nearly 4 TB/s of bandwidth. The 950PR variant includes 128 GB HBM at 1.6 TB/s. Both target exascale FP8 workloads in Huawei's Atlas SuperPoDs.
Huawei's biggest SuperCluster delivers 524 ExaFLOPS of FP8 compute and a full ZettaFLOP of FP4 compute, using 64 Atlas SuperPoDs housing 524,288 accelerators.
For context, the largest Western cluster in development today is xAI's Colossus 2 with over 550,000 Nvidia GB200 and GB300 GPUs. Huawei's scale is now comparable to Nvidia's best.
DeepSeek V4 on Ascend 950DT: What Actually Worked
SemiAnalysis published the first analysis of Ascend 950DT DeepSeek V4 inference, breaking down compute‑communication laps and the different compute streams Huawei implemented to optimize performance.
Key advantages of Huawei's approach:
- Power constraints? China doesn't have them. US datacenters are often power‑limited; China can afford massive scale even with lower per‑chip efficiency.
- System‑scale competition: Huawei competes on dense packaging and aggressive networking, not raw per‑chip supremacy. If memory and networking can be owned and systems can scale far beyond a single rack, aggregate throughput can offset die‑level gaps.
- The "supernode" architecture: Huawei claims its system outperforms Nvidia's GB200 NVL72 on some metrics due to ultra‑fast chip interconnections.
Where Ascend Lags
Let's be clear: From a single‑chip absolute compute standpoint, Ascend 950PR still trails B200. The 950PR's FP8 compute is 1 PFLOPS; B200's FP4 compute hits 9 PFLOPS. That's a large gap.
But, and this is the critical nuance, raw peak FLOPS isn't everything. In specific scenarios like inference prefill and recommendation workloads, Ascend's deep optimizations deliver tangible gains. For domestic Chinese deployments, a 30–40% power efficiency gap is an acceptable tradeoff for supply chain security and strategic independence.
Huawei has also described Day‑0 inference performance support for DeepSeek V4 in their documentation. But actual training costs on Huawei chips have not been disclosed, and performance comparisons with Nvidia's latest remain difficult due to export restrictions.
The "Day 43" Inflection Point: What Actually Happened
By Day 43, the picture had fundamentally shifted.
The iterative improvements from Day 0 onward, recorded via InferenceX's open‑source images and recipes across multiple frameworks, revealed a clear pattern. Performance was regime dependent.
Long‑context workloads: These saw gains first. V4's hybrid CSA+HCA attention compresses KV caches so aggressively that 1M‑token inference becomes economically viable. The long‑context benefit kicked in immediately because it's baked into the architecture.
Short‑context workloads: These lagged early on, depending more on kernel bring‑up and prefill optimization. But by Day 26–43, vLLM and SGLang patches had closed the gap.
Agentic tasks: DeepSeek V4's true strength. The model scored 80.6% on SWE‑bench Verified (matching Claude Opus 4.6's 80.8%) and led all models on LiveCodeBench at 93.5.
The Role of Quantization: FP4 as a Game Changer
FP4 quantization was critical to the performance curve. By using FP4 precision, DeepSeek V4 could run larger effective batch sizes on the same memory footprint. This is why GB300's FP4 disaggregation and Huawei's FP4 support are so strategically important.
The math is simple: Lower precision = more tokens per second = lower cost per token. DeepSeek V4‑Pro costs $3.48 per million output tokens, compared to Claude's roughly $75. That's a 20x cost advantage.
What DeepSeek V4's Learning Curve Means for Your Deployment
If you take one thing away from this 43‑day analysis, let it be this:
Don't benchmark AI models at launch.
Day‑0 performance is the floor, not the ceiling.
The real competitive advantage isn't which hardware vendor you choose, it's how quickly you can iterate through framework optimizations, kernel patches, and community contributions.
Here are actionable takeaways for your own deployment planning:
If you're prioritizing raw throughput at any scale: GB300 NVL72 is the king. The 50x AI factory output improvement over Hopper is not marketing hype.
If you're memory‑constrained: MI355X's 288 GB HBM3E is a serious advantage for MoE models. Watch for communication bottlenecks beyond 8 GPUs.
If you're deploying in China or have supply chain concerns: Ascend 950DT is becoming a viable alternative, especially at cluster scale. The performance gap is narrowing faster than most Western observers realize.
If you're budget‑conscious: Don't sleep on community‑optimized vLLM + SGLang on B200 clusters. The open‑source performance improvements over 43 days exceeded what any single vendor delivered at launch.
The first 43 days of DeepSeek V4's life taught us something fundamental about modern AI: performance is a process, not a product.
Day 0, the model was rough. TensorRT-LLM didn't work. ROCm lagged. Community patches were flying in faster than vendors could merge them.
Day 26, the model hit 100x performance improvements. Not because of new silicon, not because of a vendor miracle, but because thousands of engineers across vLLM, SGLang, and the open‑source community treated inference optimization as a collective sport.
Day 43, the hardware arms race looked completely different. GB300 NVL72 emerged as the absolute throughput king. MI355X proved memory matters more than peak FLOPS for MoE. And Huawei Ascend 950DT demonstrated that even with a per‑chip disadvantage, cluster‑scale solutions can be genuinely competitive.
The Chinese AI ecosystem now dominates the open‑model landscape. Kimi K2.6 beats Nvidia's own Nemotron 3 Ultra on coding. DeepSeek V4 matches Claude Opus 4.6 on SWE‑bench at 1/20th the cost.
This isn't a race with a finish line. It's an ongoing optimization marathon where the winners will be those who embrace iterative measurement, contribute to open‑source tooling, and plan deployment across multiple hardware architectures.
Comments
Post a Comment