Training neural networks in 32-bit floating point (FP32) is almost always unnecessary and always expensive. Modern GPU architectures — NVIDIA Volta, Ampere, Hopper — include Tensor Cores that execute matrix multiplications in 16-bit precision at 2–4x the throughput of FP32 CUDA cores. Mixed precision training exploits this by running the forward and backward passes in 16-bit while maintaining 32-bit master weights for the optimizer update. The result: near-identical model quality at roughly double the training throughput. This guide explains how it works, which format to choose, and what can go wrong.
1. The Hardware Foundation: Tensor Cores
Understanding why mixed precision training works requires understanding Tensor Cores. A Tensor Core is a specialized matrix-multiplication unit on modern NVIDIA GPUs that performs D = A × B + C on 4×4 matrices in a single clock cycle, operating on FP16, BF16, INT8, or TF32 inputs and accumulating in FP32. This accumulation in FP32 is the key property: Tensor Core operations maintain numerical accuracy for the accumulation while accepting lower-precision inputs.
On an A100 GPU, Tensor Core peak throughput for BF16 matrix multiplication is 312 TFLOPS, versus 19.5 TFLOPS for FP32 CUDA cores — a 16x theoretical ratio. In practice, achievable speedups in deep learning workloads are 1.5–3x depending on operation mix, but even 1.5x translates directly to 33% cost reduction for the same model and dataset.
The critical condition for Tensor Core utilization: matrix dimensions must be multiples of 8 (for FP16/BF16) or multiples of 4 (for TF32). A linear layer with hidden size 512 → 512 fully utilizes Tensor Cores; a layer with 500 → 500 cannot (500 is not divisible by 8). This is why model dimension choices matter for performance: powers of 2, or multiples of 64 or 128, are Tensor Core-friendly. When designing model architectures, align all matrix dimensions to multiples of 64 as a minimum.
2. FP16 vs. BF16: Choosing Your Format
Both FP16 and BF16 are 16-bit floating point formats, but they allocate their bits differently:
- FP16: 1 sign bit, 5 exponent bits, 10 mantissa bits. Dynamic range: ~6×10⁻⁵ to ~65,504. Precision: ~3 decimal digits.
- BF16: 1 sign bit, 8 exponent bits, 7 mantissa bits. Dynamic range: same as FP32 (~1.2×10⁻³⁸ to ~3.4×10³⁸). Precision: ~2 decimal digits.
The practical difference: FP16's narrow dynamic range makes it susceptible to gradient underflow (very small gradients rounding to zero) and overflow (gradient explosion clipping to infinity). BF16's FP32-matching dynamic range makes these problems essentially disappear. The tradeoff is slightly lower mantissa precision in BF16, but neural network training is generally tolerant of this.
The recommendation for 2025: use BF16 unless you have a specific reason not to. BF16 does not require loss scaling, is supported on all A100/H100 class hardware, and produces training runs that are numerically nearly indistinguishable from FP32 runs. The only case for preferring FP16 is when deploying to older hardware (pre-Ampere) that lacks BF16 support.
3. Loss Scaling for FP16 Training
If you must use FP16, loss scaling is essential. FP16's limited dynamic range means gradients computed during the backward pass can underflow to zero — particularly the small gradients deep in the network or early in training. Loss scaling addresses this by multiplying the loss by a large scalar (the "scale factor") before the backward pass, scaling up all gradients proportionally and shifting them out of the underflow region, then dividing the final gradients by the same scale factor before the optimizer update.
Dynamic loss scaling is the standard approach: start with a large scale factor (e.g., 65536) and multiply it by 2 every N steps without overflow. When overflow is detected (any gradient contains NaN or Inf), skip the optimizer step, reduce the scale factor by half, and continue. This adaptive approach finds the maximum stable scale factor automatically without human intervention.
PyTorch's torch.cuda.amp.GradScaler handles this automatically. The integration is minimal:
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast(dtype=torch.float16):
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
For BF16, omit the scaler entirely — it is not needed and adds overhead without benefit.
4. Flash Attention: Exploiting Precision for Memory Efficiency
Flash Attention is one of the most impactful deep learning kernel optimizations of the past few years, and it directly intersects with mixed precision training. Standard attention computes the full O(n²) attention weight matrix and stores it in GPU memory; for long sequences, this becomes the dominant memory bottleneck. Flash Attention reimplements scaled dot-product attention with tiled computation: it processes the attention in blocks that fit in SRAM (the fast on-chip memory), never materializing the full attention matrix in HBM (the slower off-chip GPU memory).
The memory savings are dramatic: Flash Attention v2 reduces attention memory from O(n²) to O(n), enabling sequence lengths that are 4–8x longer than standard attention allows. The throughput improvement comes from memory bandwidth: A100 HBM bandwidth is 2TB/s, but SRAM bandwidth is ~10x higher. By keeping intermediate results in SRAM, Flash Attention is limited by compute rather than memory bandwidth, achieving near-theoretical Tensor Core utilization.
Flash Attention v3 (H100) combines this with hardware-accelerated BF16 GEMM and achieves up to 75% of H100 peak compute for the attention operation — compared to under 30% for standard PyTorch attention. For transformers trained with sequences of 4096+ tokens, Flash Attention is the single highest-ROI optimization available.
5. Ops That Must Stay in FP32
Mixed precision training does not mean everything runs in 16-bit. Certain operations are numerically sensitive and must remain in FP32:
- Softmax: Small probability values can underflow in FP16, producing zero probabilities and zero gradients. Keep softmax in FP32.
- Layer normalization: Variance computation involves subtraction of nearly-equal numbers (catastrophic cancellation risk). Keep layernorm in FP32.
- Loss computation: Cross-entropy loss with small probabilities is prone to FP16 underflow. Keep the loss computation in FP32.
- Optimizer update: Optimizer states (ADAM momentum/variance) accumulate small updates over many steps. FP16 precision is insufficient; keep optimizer states in FP32.
PyTorch's torch.autocast context manager handles these automatically, keeping a list of "always-FP32" operations that override the autocast context. You do not need to manually audit every operation — but you should know the list exists and verify it matches your model's operation profile if you implement custom CUDA kernels.
6. Validating Numerical Equivalence
The final step in deploying mixed precision training is validation: confirm that BF16/FP16 training converges to the same model quality as FP32. This is not guaranteed — some model architectures and tasks are more sensitive to precision loss than others.
The validation protocol: train a small model variant (same architecture, smaller scale) in both FP32 and BF16/FP16 for a sufficient number of steps to converge. Compare: final loss values (should be within 0.5% for BF16), gradient norms (should be qualitatively similar), loss curve shapes (should be visually identical). If these agree, mixed precision is safe to scale up.
Key Takeaways
- BF16 is the right default for Ampere+ hardware. No loss scaling, wide dynamic range, and essentially no model quality degradation.
- Tensor Core utilization requires dimension alignment. Design model dimensions as multiples of 64 to maximize Tensor Core throughput.
- Flash Attention is mandatory for transformer training. 2–4x attention speedup and sequence length scaling are not optional optimizations.
- Some ops must stay in FP32. Softmax, layernorm, loss computation, and optimizer states — PyTorch autocast handles this automatically.
- Validate equivalence before full-scale runs. A small-scale numerical comparison saves expensive surprises at scale.
Conclusion
Mixed precision training is the minimum viable optimization for any serious deep learning training workload. The combination of BF16 computation, Flash Attention, and Tensor Core-aligned dimensions routinely delivers 2–3x throughput improvement over naive FP32 training — translating directly to 50–67% cost reduction per training run at equivalent model quality. For a team running $1M in annual GPU compute, this is $500,000–$670,000 in savings from optimizations that require days of implementation, not months.
The Deepiix platform applies mixed precision and attention optimizations automatically across all training workloads, along with continuous performance regression testing that catches precision-related issues before they reach production runs. Contact us to learn more.