跳转到主要内容

模拟器建模指南:显存与吞吐公式

· 约 3 分钟阅读

1. 显存五大组成

组成生命周期典型占比Scope
Weights常驻30-80%per-rank(被 TP/PP/EP 切分)
KV Cache随请求增长/释放10-60%per-rank(被 TP/CP 切分)
Activations前向计算中瞬态1-15%per-rank
Workspacekernel 临时 buffer1-5%per-rank
Communication BufferNCCL/DeepEP 双缓冲0.5-3%per-rank

常见建模错误:

  • KV cache 报告 “per-request per-layer” 大小,但模拟器需 per-rank peak = Sum(所有并发 request 的 KV) / TP_size
  • Weights 在 EP 模式下每 rank 只有 n_experts / EP_size 个 expert,但 shared expert + attention 不被 EP 切
  • Activation peak 发生在 prefill 最大 chunk 时刻,不是 decode

2. Weights Memory [per-rank]

def weights_memory_per_rank(config, tp, ep, pp):
    pp_layers = config['n_layers'] // pp

    # Attention (per layer, TP-sharded)
    attn_params_per_layer = (
        config['dim'] * config['q_lora_rank'] +                              # wq_a
        config['q_lora_rank'] * config['n_heads'] * config['head_dim'] // tp + # wq_b
        config['dim'] * config['head_dim'] +                                  # wkv
        config['n_heads'] * config['head_dim'] // tp * config['o_lora_rank'] + # wo_a
        config['o_groups'] * config['o_lora_rank'] // tp * config['dim']       # wo_b
    )

    # MoE experts (EP-sharded)
    n_local_experts = config['n_routed_experts'] // ep
    expert_params = 3 * config['dim'] * config['moe_inter_dim']  # w1 + w2 + w3

    # Shared expert (TP-sharded, not EP)
    shared_expert_params = 3 * config['dim'] * config['moe_inter_dim'] // tp

    # Bytes by precision
    attn_bytes = attn_params_per_layer * bytes_per_elem(config['attn_dtype'])
    expert_bytes = n_local_experts * expert_params * bytes_per_elem(config['expert_dtype'])
    shared_bytes = shared_expert_params * bytes_per_elem(config['shared_expert_dtype'])

    # Scale overhead
    attn_scale = scale_overhead(attn_params_per_layer, config['attn_dtype'])
    expert_scale = scale_overhead(n_local_experts * expert_params, config['expert_dtype'])

    # Small components (gate, norms, embeddings)
    gate_bytes = config['n_routed_experts'] * config['dim'] * 4       # FP32
    norm_bytes = 4 * config['dim'] * 4                                # 4 norms, FP32
    embed_bytes = config['vocab_size'] * config['dim'] // tp * 2      # BF16

    # HC (Hyper-Connection) parameters
    hc = config['hc_mult']  # 4
    hc_per_layer = (2 + hc) * hc * config['dim'] * hc * 4 * 2        # attn + ffn

    per_layer = (attn_bytes + expert_bytes + shared_bytes
                 + attn_scale + expert_scale + gate_bytes + norm_bytes + hc_per_layer)
    total = pp_layers * per_layer + embed_bytes
    return total

def bytes_per_elem(dtype):
    return {'fp4': 0.5, 'fp8': 1, 'bf16': 2, 'fp32': 4}[dtype]

def scale_overhead(n_params, dtype):
    if dtype == 'fp4':
        return n_params // 32 * 1       # per-32 E8M0 scale
    elif dtype == 'fp8':
        return n_params // (128*128) * 2  # per-128x128 block, 2D
    return 0

3. KV Cache Memory [per-rank, per-request]

CSA/HCA 异构 KV cache(详见 CSA/HCA 注意力):

def kv_cache_per_request(config, seq_len, tp=1, cp=1):
    head_dim = config['head_dim']         # 512
    nope_dim = head_dim - config['rope_head_dim']  # 448
    bytes_per_entry = nope_dim * 1 + config['rope_head_dim'] * 2  # FP8 + BF16 = 576

    total_entries = 0
    for ratio in config['compress_ratios']:
        win = config['window_size']  # 128
        if ratio == 0:      total_entries += win
        elif ratio == 4:    total_entries += win + seq_len // 4
        else:               total_entries += win + seq_len // ratio

    kv_bytes = total_entries * bytes_per_entry // cp

    # Indexer (FP4, CSA layers only)
    n_csa = sum(1 for r in config['compress_ratios'] if r == 4)
    indexer_bytes = n_csa * (seq_len // 4) * config['index_head_dim'] * 0.5

    return kv_bytes + indexer_bytes

V4-Pro 1M token 示例:

ComponentCalculationSize
CSA (29 layers)29 x 250,128 x 576~4.18 GB
HCA (31 layers)31 x 7,940 x 576~0.14 GB
SWA (1 layer)1 x 128 x 576~0.00007 GB
Indexer (29 layers)29 x 250,000 x 128 x 0.5~0.47 GB
Total per-request~4.79 GB

4. Activation / Workspace Memory [per-rank]

Peak activation 发生在 prefill 的最大 chunk 中。HC multiplier=4 使 residual stream 占用 4x normal。

Decode (bs=B, seq=1):

hc_residual       = B x hc(4) x dim x 4 bytes     (FP32)
q_buffer          = B x (n_heads/tp) x head_dim x 2  (BF16)
moe_workspace     = B x dim x 4                    (FP32 accumulator)
shared_expert_buf = B x inter_dim x 2 x 4          (gate+up, FP32)

peak = hc_residual + max(q_buffer, moe_workspace + shared_expert_buf)

Prefill (chunk_size=C):

activation_peak = B x C x dim x hc_mult x 4       (HC residual, FP32)
                + B x C x (n_heads/tp) x head_dim x 2  (Q buffer)
                + B x C x inter_dim x 2 x 4        (MoE gate+up, FP32)

5. MoE Workspace 建模

详见 MoE 推理 的执行流程。

EP all-to-all dispatch buffer [per-rank]:
  = max_tokens_per_rank x dim x bytes x 2 (double-buffer)
  where max_tokens_per_rank = batch x seq x n_activated_experts / EP_size
  (capped by DeepEP ElasticBuffer config)

Grouped GEMM workspace [per-rank]:
  = max_tokens_per_expert x inter_dim x 4 (FP32 accumulator)
  where max_tokens_per_expert = batch x seq x load_factor / n_local_experts

6. FP4 Capability Matrix

详见 FP4/FP8 量化

CAPABILITY_MATRIX = {
    # (checkpoint_format, hardware) -> (runtime_format, mem_multiplier, speed_vs_fp8)
    ('fp4', 'B200'):              ('fp4_native_mma',      1.0, 2.0),  # 理论值
    ('fp4', 'B200_current'):      ('fp4_cast_fp8',        1.0, 1.0),  # V4 实际
    ('fp4', 'H100'):              ('fp4_to_fp8_preexpand', 2.0, 1.0),
    ('fp4', 'H200'):              ('fp4_to_fp8_preexpand', 2.0, 1.0),
    ('fp8', 'B200/H100/H200'):    ('fp8_native',          1.0, 1.0),
}

H100/H200 运行 FP4 checkpoint 时 expert 显存翻倍。此因子必须纳入模拟器。

7. MTP Overhead 建模

MTP block 复用主模型最后一层参数,不应简单乘以 (nextn+1)

def mtp_overhead(config, batch_size, seq_len_or_1):
    dim = config['dim']
    vocab = config['vocab_size']     # 129,280
    nextn = config['num_nextn_predict_layers']  # 1

    # Extra params (small): enorm + hnorm + eh_proj
    mtp_weight_bytes = (2 * dim + 2 * dim * dim) * 1  # FP8

    # Extra activation
    mtp_activation = batch_size * seq_len_or_1 * 2 * dim * 2  # BF16

    # Logits buffer (persists until verify complete)
    logits_buffer = batch_size * seq_len_or_1 * vocab * 4 * nextn  # FP32

    # Extra KV for speculative tokens (tiny)
    extra_kv = nextn * kv_per_token(config) * config['n_layers']

    return mtp_weight_bytes + mtp_activation + logits_buffer + extra_kv

关键:训练时 MTP 开销是 (nextn+1) x total_activation,但推理时 MTP 与主模型顺序执行,activation 空间可复用。

8. Per-rank vs Global/Cluster

Per-rankGlobal/Cluster Total
WeightsGPU 中实际存储量Sum(all ranks) = 模型总参数 x bytes
KV cache单卡 KV pool 容量Sum(all ranks) x TP
Activation peak单卡 max_chunk prefill 峰值无意义(同步发生,不累加)
Throughput单卡 tokens/s所有 DP replica 之和
Max batch受单卡限制= per-rank max_batch x DP_size

常见错误:

  1. 用 “1.6T params” 算单卡显存(应除以 EP x TP x PP)
  2. 用 global KV 估算单卡(应除以 TP)
  3. 将 activation 跨 rank 累加(无意义)

9. 配置项 vs Calibration Table

应做成配置项(随部署变化):

  • tp, ep, pp, dp, cp
  • batch_size, max_seq_len, chunk_size
  • kv_quant_bits, expert_dtype, attn_dtype
  • block_size (PagedAttention)
  • prefix_cache_hit_ratio
  • Hardware: hbm_capacity, hbm_bandwidth, flops_fp8, flops_fp4, nvlink_bw, pcie_bw

应做成 calibration table(需 profiling):

  • kernel_efficiency: 实际 vs 理论峰值(50-80%)
  • all_to_all_latency(msg_size, ep_size): 非线性,需实测
  • grouped_gemm_efficiency(n_experts, tokens_per_expert): 负载不均时下降
  • sparse_attn_overhead: CSA indexer + sparse gather 额外比
  • prefix_cache_hit_rate(workload): workload-dependent
  • ep_load_balance_factor: 实际 vs 理想均匀

10. 不能硬编码的公式

  1. Arithmetic Intensity 拐点compute_bound if FLOPs/Byte > hardware_oi — OI 随 GPU 型号变化
  2. Expert load imbalance:top-k 理论均匀,实际倾斜 -> grouped GEMM 效率需 profiling
  3. Overlap efficiency:DeepEP 通信-计算 overlap 实际比例取决于 kernel launch pattern
  4. KV cache fragmentation:PagedAttention 利用率取决于 seq_len 分布
  5. CSA indexer FLOPs:理论 seq/4 x 128 x 64,实际有 Hadamard + FP4 量化开销

11. 与其他主题的关联

修改历史
修改历史1 次提交