模拟器建模指南：显存与吞吐公式

1. 显存五大组成

组成	生命周期	典型占比	Scope
Weights	常驻	30-80%	per-rank（被 TP/PP/EP 切分）
KV Cache	随请求增长/释放	10-60%	per-rank（被 TP/CP 切分）
Activations	前向计算中瞬态	1-15%	per-rank
Workspace	kernel 临时 buffer	1-5%	per-rank
Communication Buffer	NCCL/DeepEP 双缓冲	0.5-3%	per-rank

常见建模错误：

KV cache 报告 “per-request per-layer” 大小，但模拟器需 per-rank peak = Sum(所有并发 request 的 KV) / TP_size
Weights 在 EP 模式下每 rank 只有 n_experts / EP_size 个 expert，但 shared expert + attention 不被 EP 切
Activation peak 发生在 prefill 最大 chunk 时刻，不是 decode

2. Weights Memory [per-rank]

def weights_memory_per_rank(config, tp, ep, pp):
    pp_layers = config['n_layers'] // pp

    # Attention (per layer, TP-sharded)
    attn_params_per_layer = (
        config['dim'] * config['q_lora_rank'] +                              # wq_a
        config['q_lora_rank'] * config['n_heads'] * config['head_dim'] // tp + # wq_b
        config['dim'] * config['head_dim'] +                                  # wkv
        config['n_heads'] * config['head_dim'] // tp * config['o_lora_rank'] + # wo_a
        config['o_groups'] * config['o_lora_rank'] // tp * config['dim']       # wo_b
    )

    # MoE experts (EP-sharded)
    n_local_experts = config['n_routed_experts'] // ep
    expert_params = 3 * config['dim'] * config['moe_inter_dim']  # w1 + w2 + w3

    # Shared expert (TP-sharded, not EP)
    shared_expert_params = 3 * config['dim'] * config['moe_inter_dim'] // tp

    # Bytes by precision
    attn_bytes = attn_params_per_layer * bytes_per_elem(config['attn_dtype'])
    expert_bytes = n_local_experts * expert_params * bytes_per_elem(config['expert_dtype'])
    shared_bytes = shared_expert_params * bytes_per_elem(config['shared_expert_dtype'])

    # Scale overhead
    attn_scale = scale_overhead(attn_params_per_layer, config['attn_dtype'])
    expert_scale = scale_overhead(n_local_experts * expert_params, config['expert_dtype'])

    # Small components (gate, norms, embeddings)
    gate_bytes = config['n_routed_experts'] * config['dim'] * 4       # FP32
    norm_bytes = 4 * config['dim'] * 4                                # 4 norms, FP32
    embed_bytes = config['vocab_size'] * config['dim'] // tp * 2      # BF16

    # HC (Hyper-Connection) parameters
    hc = config['hc_mult']  # 4
    hc_per_layer = (2 + hc) * hc * config['dim'] * hc * 4 * 2        # attn + ffn

    per_layer = (attn_bytes + expert_bytes + shared_bytes
                 + attn_scale + expert_scale + gate_bytes + norm_bytes + hc_per_layer)
    total = pp_layers * per_layer + embed_bytes
    return total

def bytes_per_elem(dtype):
    return {'fp4': 0.5, 'fp8': 1, 'bf16': 2, 'fp32': 4}[dtype]

def scale_overhead(n_params, dtype):
    if dtype == 'fp4':
        return n_params // 32 * 1       # per-32 E8M0 scale
    elif dtype == 'fp8':
        return n_params // (128*128) * 2  # per-128x128 block, 2D
    return 0

3. KV Cache Memory [per-rank, per-request]

CSA/HCA 异构 KV cache（详见 CSA/HCA 注意力）：

def kv_cache_per_request(config, seq_len, tp=1, cp=1):
    head_dim = config['head_dim']         # 512
    nope_dim = head_dim - config['rope_head_dim']  # 448
    bytes_per_entry = nope_dim * 1 + config['rope_head_dim'] * 2  # FP8 + BF16 = 576

    total_entries = 0
    for ratio in config['compress_ratios']:
        win = config['window_size']  # 128
        if ratio == 0:      total_entries += win
        elif ratio == 4:    total_entries += win + seq_len // 4
        else:               total_entries += win + seq_len // ratio

    kv_bytes = total_entries * bytes_per_entry // cp

    # Indexer (FP4, CSA layers only)
    n_csa = sum(1 for r in config['compress_ratios'] if r == 4)
    indexer_bytes = n_csa * (seq_len // 4) * config['index_head_dim'] * 0.5

    return kv_bytes + indexer_bytes

V4-Pro 1M token 示例：

Component	Calculation	Size
CSA (29 layers)	29 x 250,128 x 576	~4.18 GB
HCA (31 layers)	31 x 7,940 x 576	~0.14 GB
SWA (1 layer)	1 x 128 x 576	~0.00007 GB
Indexer (29 layers)	29 x 250,000 x 128 x 0.5	~0.47 GB
Total per-request		~4.79 GB

4. Activation / Workspace Memory [per-rank]

Peak activation 发生在 prefill 的最大 chunk 中。HC multiplier=4 使 residual stream 占用 4x normal。

Decode (bs=B, seq=1)：

hc_residual       = B x hc(4) x dim x 4 bytes     (FP32)
q_buffer          = B x (n_heads/tp) x head_dim x 2  (BF16)
moe_workspace     = B x dim x 4                    (FP32 accumulator)
shared_expert_buf = B x inter_dim x 2 x 4          (gate+up, FP32)

peak = hc_residual + max(q_buffer, moe_workspace + shared_expert_buf)

Prefill (chunk_size=C)：

activation_peak = B x C x dim x hc_mult x 4       (HC residual, FP32)
                + B x C x (n_heads/tp) x head_dim x 2  (Q buffer)
                + B x C x inter_dim x 2 x 4        (MoE gate+up, FP32)

5. MoE Workspace 建模

详见 MoE 推理的执行流程。

EP all-to-all dispatch buffer [per-rank]:
  = max_tokens_per_rank x dim x bytes x 2 (double-buffer)
  where max_tokens_per_rank = batch x seq x n_activated_experts / EP_size
  (capped by DeepEP ElasticBuffer config)

Grouped GEMM workspace [per-rank]:
  = max_tokens_per_expert x inter_dim x 4 (FP32 accumulator)
  where max_tokens_per_expert = batch x seq x load_factor / n_local_experts

6. FP4 Capability Matrix

详见 FP4/FP8 量化。

CAPABILITY_MATRIX = {
    # (checkpoint_format, hardware) -> (runtime_format, mem_multiplier, speed_vs_fp8)
    ('fp4', 'B200'):              ('fp4_native_mma',      1.0, 2.0),  # 理论值
    ('fp4', 'B200_current'):      ('fp4_cast_fp8',        1.0, 1.0),  # V4 实际
    ('fp4', 'H100'):              ('fp4_to_fp8_preexpand', 2.0, 1.0),
    ('fp4', 'H200'):              ('fp4_to_fp8_preexpand', 2.0, 1.0),
    ('fp8', 'B200/H100/H200'):    ('fp8_native',          1.0, 1.0),
}

H100/H200 运行 FP4 checkpoint 时 expert 显存翻倍。此因子必须纳入模拟器。

7. MTP Overhead 建模

MTP block 复用主模型最后一层参数，不应简单乘以 (nextn+1)。

def mtp_overhead(config, batch_size, seq_len_or_1):
    dim = config['dim']
    vocab = config['vocab_size']     # 129,280
    nextn = config['num_nextn_predict_layers']  # 1

    # Extra params (small): enorm + hnorm + eh_proj
    mtp_weight_bytes = (2 * dim + 2 * dim * dim) * 1  # FP8

    # Extra activation
    mtp_activation = batch_size * seq_len_or_1 * 2 * dim * 2  # BF16

    # Logits buffer (persists until verify complete)
    logits_buffer = batch_size * seq_len_or_1 * vocab * 4 * nextn  # FP32

    # Extra KV for speculative tokens (tiny)
    extra_kv = nextn * kv_per_token(config) * config['n_layers']

    return mtp_weight_bytes + mtp_activation + logits_buffer + extra_kv

关键：训练时 MTP 开销是 (nextn+1) x total_activation，但推理时 MTP 与主模型顺序执行，activation 空间可复用。

8. Per-rank vs Global/Cluster

量	Per-rank	Global/Cluster Total
Weights	GPU 中实际存储量	Sum(all ranks) = 模型总参数 x bytes
KV cache	单卡 KV pool 容量	Sum(all ranks) x TP
Activation peak	单卡 max_chunk prefill 峰值	无意义（同步发生，不累加）
Throughput	单卡 tokens/s	所有 DP replica 之和
Max batch	受单卡限制	= per-rank max_batch x DP_size

常见错误：

用 “1.6T params” 算单卡显存（应除以 EP x TP x PP）
用 global KV 估算单卡（应除以 TP）
将 activation 跨 rank 累加（无意义）

9. 配置项 vs Calibration Table

应做成配置项（随部署变化）：

tp, ep, pp, dp, cp
batch_size, max_seq_len, chunk_size
kv_quant_bits, expert_dtype, attn_dtype
block_size (PagedAttention)
prefix_cache_hit_ratio
Hardware: hbm_capacity, hbm_bandwidth, flops_fp8, flops_fp4, nvlink_bw, pcie_bw

应做成 calibration table（需 profiling）：

kernel_efficiency: 实际 vs 理论峰值（50-80%）
all_to_all_latency(msg_size, ep_size): 非线性，需实测
grouped_gemm_efficiency(n_experts, tokens_per_expert): 负载不均时下降
sparse_attn_overhead: CSA indexer + sparse gather 额外比
prefix_cache_hit_rate(workload): workload-dependent
ep_load_balance_factor: 实际 vs 理想均匀

10. 不能硬编码的公式

Arithmetic Intensity 拐点：compute_bound if FLOPs/Byte > hardware_oi — OI 随 GPU 型号变化
Expert load imbalance：top-k 理论均匀，实际倾斜 -> grouped GEMM 效率需 profiling
Overlap efficiency：DeepEP 通信-计算 overlap 实际比例取决于 kernel launch pattern
KV cache fragmentation：PagedAttention 利用率取决于 seq_len 分布
CSA indexer FLOPs：理论 seq/4 x 128 x 64，实际有 Hadamard + FP4 量化开销

11. 与其他主题的关联

Weights 中 MoE expert 切分详见 MoE 推理
FP4/FP8 精度对显存的具体影响详见 FP4/FP8 量化
KV cache 的 CSA/HCA 分层计算详见 CSA/HCA 注意力
框架选型对配置参数的约束详见推理框架对比 2026

模拟器建模指南：显存与吞吐公式

1. 显存五大组成

2. Weights Memory [per-rank]

3. KV Cache Memory [per-rank, per-request]

4. Activation / Workspace Memory [per-rank]

5. MoE Workspace 建模

6. FP4 Capability Matrix

7. MTP Overhead 建模

8. Per-rank vs Global/Cluster

9. 配置项 vs Calibration Table

10. 不能硬编码的公式

11. 与其他主题的关联

← 被以下页面引用(5)

目录 11

模拟器建模指南：显存与吞吐公式

1. 显存五大组成

2. Weights Memory [per-rank]

3. KV Cache Memory [per-rank, per-request]

4. Activation / Workspace Memory [per-rank]

5. MoE Workspace 建模

6. FP4 Capability Matrix

7. MTP Overhead 建模

8. Per-rank vs Global/Cluster

9. 配置项 vs Calibration Table

10. 不能硬编码的公式

11. 与其他主题的关联

← 被以下页面引用(5)

相关阅读