The GPU choice used to be simple. Buy H100s if you could. Rent them on AWS or GCP if you couldn't. Wait in line either way. In 2026, that calculus has changed. The Blackwell generation — NVIDIA B200, B100, and GB200 NVL72 — is shipping in real volume after a long ramp. AMD MI300X and MI325X are now production reality at several hyperscalers, not just a credible threat. Cloud GPU spot prices have cooled off from their 2024 peak. And US export controls (rules from the Bureau of Industry and Security that gate which chips can be sold where) keep reshaping what you can buy directly. The result is a real decision tree — not a default.
Getting this wrong is expensive in both directions. Over-spec on Blackwell when your team mostly fine-tunes 7B models, and you pay a 40–60% hardware premium for FP8 throughput (the speed at which the chip multiplies in low-precision math) that your training loop never touches. Under-spec on a real pre-training run — H100 SXM5 for a 70B+ model on a deadline — and the cluster becomes the bottleneck before your data pipeline does. Neither mistake kills the project. Both compound. GPU lead times sit at 8–14 weeks even in calm supply, and resizing a cluster mid-run is brutal.
The 2026 lineup: what is actually shipping
H100 SXM5 is still the most available high-end training GPU in mid-2026. Memory bandwidth is roughly 3.35 TB/s. FP8 tensor throughput tops out around 3,958 TFLOPS (sparse). 80 GB of HBM3 (high-bandwidth memory stacked next to the chip) is the standard config. An HGX H100 node is eight GPUs wired together over NVLink 4.0 (NVIDIA's GPU-to-GPU fabric). That is the workhorse for most mid-scale runs — 1B to 30B parameter models. On-demand cloud sits around $2.50–$3.50 per GPU-hour on the major providers (estimate; varies by region and commitment).
H200 SXM5 is the obvious upgrade if your training is memory-bound (where the bottleneck is moving weights, not crunching numbers). It packs 141 GB of HBM3e at roughly 4.8 TB/s — 43% more memory bandwidth than H100 at the same compute. Bandwidth is the chip's straw: wider straw, more drink per gulp. If your loss curve is waiting on weights, H200 wins on throughput-per-dollar. Blackwell raises the bar again. B200 delivers about 4.5 petaFLOPS in FP4 and 2.25 petaFLOPS in FP8 (dense), with 192 GB of HBM3e. The interconnect story is where it gets interesting. A GB200 NVL72 rack ties 72 GPUs together over NVLink 5.0 at 1.8 TB/s GPU-to-GPU. That lets you train at scales which used to demand expensive InfiniBand fabrics. Above 30B parameters, the GB200 NVL72 rack — not a single card — is the right unit of planning.
AMD MI300X / MI325X: the ROCm reality check
MI300X is a real competitor. It ships 192 GB of unified HBM3 at roughly 5.3 TB/s, with FP8 throughput near 1.3 petaFLOPS (dense). For inference-heavy work — long-context serving, large KV-cache batches — the memory advantage over H100 is real and measurable. Microsoft Azure runs MI300X at scale on its ND v5 instances, and AMD's training results on MI325X are credible. The honest catch is the software. ROCm (AMD's CUDA equivalent) has come a long way — PyTorch, FlashAttention-3, Triton kernels all work. But the ecosystem gap with CUDA is still real. The major model labs build CUDA-first. Custom CUDA kernels, NVIDIA Apex, and most quantization tooling need porting if your team has no ROCm experience. Practical take: AMD shines if you let a cloud provider manage the ROCm stack, or if your workload is plain PyTorch. If your stack depends on custom kernels or CUDA-specific tooling, price the porting work honestly before you commit.
| Chip | Memory | Bandwidth | FP8 dense | Best fit |
|---|---|---|---|---|
| H100 SXM5 | 80 GB HBM3 | 3.35 TB/s | ~1.98 PFLOPS | Mid-scale fine-tune & training (1B–30B) |
| H200 SXM5 | 141 GB HBM3e | 4.8 TB/s | ~1.98 PFLOPS | Memory-bound training, long-context |
| B100 | 192 GB HBM3e | 8 TB/s | ~1.8 PFLOPS | Drop-in upgrade where power budget is tight |
| B200 | 192 GB HBM3e | 8 TB/s | ~2.25 PFLOPS | Frontier pretraining, FP8 end-to-end |
| GB200 NVL72 (per rack) | 13.4 TB pooled | NVLink 5 — 1.8 TB/s GPU↔GPU | ~162 PFLOPS / rack | 100B+ pretraining at scale |
| AMD MI300X | 192 GB HBM3 | 5.3 TB/s | ~1.3 PFLOPS | KV-cache-heavy serving, memory-bound work |
| AMD MI325X | 256 GB HBM3e | 6.0 TB/s | ~1.3 PFLOPS | Same as MI300X, with more memory headroom |
- Strength: 192 GB of unified HBM3 — the largest single-die memory pool in this class. Strong for 70B+ inference and memory-bound fine-tuning.
- ROCm friction: FlashAttention-3 and Triton work. Custom CUDA kernels and NVIDIA-specific libs still need porting — budget the time if your stack relies on them.
- Availability: AMD is easier to get on cloud (Azure, Oracle Cloud) than to procure on-prem in most regions. Factor this into your capex vs. opex framing.
Cloud vs. on-prem: the capex/opex framing
Cloud is not always more expensive than on-prem over three years. But the comparison only works with honest TCO (Total Cost of Ownership — hardware plus power, cooling, network, storage, and the people who keep it running). On-prem means buying the gear. It also means power and cooling — an HGX H100 node draws 3.5–6.4 kW per card, and a GB200 NVL72 rack can pull more than 120 kW. Most commercial colocation facilities are not provisioned for that without an upgrade. Cloud strips those line items out and replaces them with compute-hours: roughly $2.50/hr (H100 on-demand) up to $8–12/hr (H200 / B200 estimates), depending on provider, region, and commitment. If your team runs training in bursts — a few large jobs per quarter — reserved cloud instances at 30–40% off on-demand usually beat an owned cluster running below 60% utilization. On-prem only wins when you can hold utilization above 70–80% non-stop, your data center has the power density (15+ kW per rack for H100, 60+ kW for Blackwell), and you have the engineers to operate bare-metal GPUs.
One regional wrinkle is worth flagging. In markets where data sovereignty rules push workloads on-prem — including Thailand's PDPA, Indonesia's PDP Law, China's PIPL, and similar regimes globally — high-density GPU power (30+ kW per rack with precision cooling) sits in only a handful of colocation facilities. If sovereignty or sectoral regulation forces on-prem, build the data-center readiness assessment into your timeline. It is not a two-week exercise.
Software readiness: PyTorch, FlashAttention, and the FP8 path
FP8 training is the headline feature on Blackwell and H200. The path from headline to actual throughput matters. FP8 vs BF16 (the higher-precision default) is the difference between drafting in pencil and drafting in ink — faster, but harder to erase mistakes. Transformer Engine — NVIDIA's mixed-precision FP8 library — is stable on PyTorch 2.2+ and plugs into HuggingFace Accelerate and DeepSpeed. FlashAttention-3 supports H100 and H200 natively and gives big speedups on attention-heavy models. On B200 / GB200, end-to-end FP8 training (forward and backward passes in FP8, with selective FP32 accumulation) is the target setup. As of mid-2026, fully automatic FP8 training without manual precision tuning is still maturing for non-standard architectures. The practical baseline for most teams is BF16 + FlashAttention-3 on H100 or H200. It is well-tested and fast. FP8 is worth chasing when your model is a standard transformer and your team has time to watch the loss curves carefully — instability shows up more often in FP8 than BF16.
- BF16 + FlashAttention-3 on H100 / H200: production-stable, well-documented. The default starting point for most fine-tuning and mid-scale pretraining.
- FP8 end-to-end on B200 / GB200: highest theoretical throughput. Needs Transformer Engine and careful loss-curve monitoring. Not the default for non-standard architectures yet.
- Apple M-series (M3 Ultra / M4 Max): fine for prototyping and single-GPU fine-tuning up to ~13B. Memory bandwidth around 800 GB/s on M4 Max is competitive for small-batch inference. Not a substitute for a training cluster.
- AWS Trainium2 / Google TPU v5e: compelling for cloud-only teams running standard transformer architectures at scale. Lower $/FLOPS than on-demand GPUs for sustained workloads, but with real framework lock-in (JAX / XLA on TPU; Neuron SDK on Trainium).
The right GPU is not the fastest GPU. It is the one your software stack can saturate, your data center can power, and your procurement team can actually receive.
Decision matrix: which chip for which workload
Use the list below as a starting point. Adjust for your software stack and what you can actually procure.
- Prototyping and single-GPU fine-tuning (up to 13B): Apple M4 Max / M3 Ultra for local work. A single H100 PCIe or a cloud A100 80 GB for shared team runs. Cost-effective and well-supported.
- Multi-GPU fine-tuning (7B–70B, LoRA / QLoRA / full fine-tune): the H100 SXM5 8-GPU HGX node is the reference. Broadly available, mature CUDA ecosystem, cloud or on-prem both work.
- Memory-bound fine-tuning or long-context training (70B+, long sequences): H200 SXM5 or AMD MI300X — both ship 141–192 GB of HBM. Pick H200 if CUDA matters, MI300X if cloud-managed ROCm is acceptable.
- Pretraining at scale (30B–200B+, runs that last weeks): a GB200 NVL72 cluster via cloud reservation or co-location. Or H100 / H200 clusters with NVLink + InfiniBand if Blackwell is constrained. Trainium2 for AWS-native teams on standard architectures.
- Inference-leaning training (training with production serving in mind, batched inference): AMD MI300X / MI325X. 192 GB HBM3 at 5.3 TB/s is the strongest single-card profile for KV-cache-heavy serving. Pair with vLLM on ROCm.
Supply, lead times, and export controls
US export controls — the Entity List restrictions and the AI Diffusion Rule update — affect direct procurement of H100, H200, B100, and B200 in many countries. The Diffusion Rule sorts countries into tiers. Tier 1 (close US allies) is effectively unrestricted. Tier 2 covers most of the world, with a country-level cap of roughly 1,700 H100-equivalents per year that you can deploy without a license. Tier 3 is restricted. For an enterprise team sizing an 8–64 GPU cluster, the cap is far above what you will actually deploy. The practical effect is on procurement channels: buy through authorized regional distributors, or through cloud providers (who hold their own licenses). Direct grey-market imports of large node counts carry legal and operational risk anywhere in the world. Lead times for H100 HGX nodes through authorized channels run 8–14 weeks globally. B200 / GB200 NVL72 on-prem procurement runs longer and varies by order size and supplier relationship. Cloud is still the fastest path to Blackwell capacity.
If you are sizing a training cluster or weighing cloud vs. on-prem GPU infrastructure, HarmonyX has worked through this decision with banks, telcos, and government agencies in 2025. Workloads ranged from domain-specific fine-tuning to multi-node pretraining runs. We scope the architecture, model the TCO, and navigate the procurement and regulatory landscape — including AI Diffusion Rule tiering, regional incentive programmes (BOI in Thailand, similar elsewhere), and global data-center supply dynamics. Happy to compare notes with your team.