跳转到主要内容

FP4/FP8 量化:低精度推理的存储与计算

· 约 3 分钟阅读

1. 量化格式对比

格式位宽表示Scale 粒度硬件支持
FP8 E4M38 bit4 bit exp + 3 bit mantissaper-128x128 blockH100/H200/B200
FP8 E5M28 bit5 bit exp + 2 bit mantissa同上H100/H200/B200
FP4 E2M1 (NVFP4)4 bit2 bit exp + 1 bit mantissaper-32 element, E8M0 scaleB200 native; H100/H200 无硬件支持
MXFP44 bit同 E2M1per-32, shared exponentB200 (= NVFP4)
INT4 (W4A16/GPTQ/AWQ)4 bit整数 + zero-point + scaleper-group (32-128)所有 GPU (CUDA)
W4A8W
bit, A
bit
混合分别按 groupMarlin kernel (H100+)

NVFP4 (E2M1) 值域{0, +/-0.5, +/-1, +/-1.5, +/-2, +/-3, +/-4, +/-6} — 共 16 个离散值。

2. FP4 存储 + FP8 计算机制

DeepSeek-V4-Pro 是第一个公开发布的 FP4+FP8 mixed checkpoint。核心策略:load FP4, cast to FP8 on-chip, then FP8xFP8 MMA

证据链:

  1. config.json: "expert_dtype": "fp4", "quantization_config": {"fmt": "e4m3", "quant_method": "fp8"}
  2. Model card: “FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.”
  3. kernel.py fp4_gemm_kernel: 明确注释 “FP8 act x FP4 weight GEMM”
  4. Technical Report: “FP4xFP8 peak FLOPs are the same as FP8xFP8 on existing hardware” — FP4 仅省显存/带宽,不加速计算。

完整 Compute Path:

GPU Memory:  weight stored as float4_e2m1fn_x2 [N, K//2] + E8M0 scale [N, K//32]
On-chip:     load 32-element FP4 block -> cast to FP8 -> FP8xFP8 tensor-core MMA
Accumulate:  FP32 -> apply act_scale x weight_scale -> output BF16

关键理解:FP4 的价值在于减少显存占用和带宽消耗,计算本身仍以 FP8 精度执行。

3. B200/GB200 vs H100/H200 硬件差异

特性H100/H200B200/GB200
FP8 Tensor Core989 TFLOPS~2x H100
FP4 Native MMA不支持支持 (NVFP4/MXFP4, ~2x FP8 peak)
HBM 容量80GB (H100) / 141GB (H200)192GB (B200)
HBM 带宽3.35 TB/s (H100) / 4.8 TB/s (H200)8 TB/s (B200)

模拟器建模影响:

  • H100/H200 运行 FP4 checkpoint:必须预转换 FP4 -> FP8(convert.py:cast_e2m1fn_to_e4m3fn),expert 显存翻倍但可正常 FP8 计算。
  • B200:FP4 native MMA 可直接使用(理论 2x FP8 FLOPs),但 DeepSeek-V4 当前 kernel 仍走 cast+FP8 path。

4. FP4 在 H100 上的兼容路径

cast_e2m1fn_to_e4m3fn() 实现:

  1. FP4 值域 {0, +/-0.5, ..., +/-6} 乘以 per-32 E8M0 scale -> FP32
  2. 重新量化为 per-128 FP8 E4M3 + 新 E8M0 block scale
  3. 无损转换(FP4 值域完全在 FP8 E4M3 范围内)
  4. 代价:expert 显存从 N x K / 2 (4bit packed) 变为 N x K (8bit) = 翻倍

5. Scale Metadata 策略对比

策略显存性能适用场景
FP4 存储 + on-chip dequant最小: N*K/2 + N*K/32 (scale)额外 cast 开销(被 MMA 掩盖)B200 或 bandwidth-bound decode
预展开为 FP8 常驻2x FP4: N*K + N*K/128 (FP8 scale)无 cast 开销,直接 FP8 MMAH100/H200(无 FP4 硬件)
Block scale (per-128x128)FP8 标准标准非 expert 参数

6. 模拟器建模公式

Expert 权重显存计算 [per-rank]:

if native_fp4:
  n_local_experts x 3 x (dim x inter_dim / 2 + dim x inter_dim / 32)  [bytes]

if fp4_to_fp8 (H100/H200):
  n_local_experts x 3 x (dim x inter_dim + dim x inter_dim / 128)     [bytes]

if fp8_native:
  n_local_experts x 3 x (dim x inter_dim + dim x inter_dim / (128x128) x 2)  [bytes]

Capability Matrix(模拟器应实现为查找表):

Checkpoint FormatHardwareRuntime FormatMemory MultiplierCompute Speed vs FP8
fp4B200 (native kernel)fp4_native_mma1.0x2.0x
fp4B200 (current V4 kernel)fp4_cast_fp81.0x1.0x
fp4H100fp4_to_fp8_preexpand2.0x1.0x
fp4H200fp4_to_fp8_preexpand2.0x1.0x
fp8B200/H100/H200fp8_native1.0x1.0x

7. 与其他主题的关联

修改历史
修改历史1 次提交