MoE 推理：Expert 并行、显存与调度机制

2026年7月10日 · 约 4 分钟阅读

MoE 推理中的 Router、Dispatch、Grouped GEMM、Combine 和 Shared Expert 路径

1. Expert 的本质

每个 expert 就是一个独立的 SwiGLU FFN，包含三个权重矩阵：

w1: [dim, inter_dim] — gate 投影
w3: [dim, inter_dim] — up 投影
w2: [inter_dim, dim] — down 投影

Dense FFN 中所有 token 共用同一组参数；MoE 中每个 token 经 router 选出 top-k 个 expert，只过这 k 个 FFN 并加权求和。

DeepSeek-V4-Pro 具体参数：

参数	值
Routed experts	384
Activated (top-k)	6
Shared experts	1 (始终激活)
`moe_inter_dim`	3072
`dim`	7168
每个 routed expert 大小	3 x 3072 x 7168 x bytes_per_param

关键区分：“49B activated” 只意味着每个 token 的计算量是 49B 级别，但显存必须容纳全部 1.6T 参数。 任何 token 都可能被路由到任何 expert，batch 越大 expert 访问越均匀。

2. 完整执行流程：Dispatch - Compute - Combine

Input x [bs, seq, dim]
    |
    v
+-- Gate/Router --+
|  scores = sqrt_softplus(x @ W_gate)  [bs*seq, n_experts]
|  top-k indices + weights (k=6)
|  Hash routing (前3层): 直接查 tid2eid 表
+-----------------+
    |
    v
+-- Dispatch (all-to-all) --+
|  EP模式: token 按 expert 归属发送到对应 rank
|  TP模式: expert 在本地，按 expert ID 分组
+----------------------------+
    |
    v
+-- Grouped GEMM --+
|  每个 expert 处理分配到的 token 子集
|  SwiGLU: silu(w1(x)) * w3(x) -> w2(...)
|  FP4 weight x FP8 activation (见量化页面)
+-------------------+
    |
    v
+-- Combine (all-to-all) --+
|  结果按原始 token 位置聚合
|  y[idx] += expert_output * weight
+---------------------------+
    |
    v
y += shared_expert(x)   <-- 所有 token 都过

Router 使用 sqrt_softplus 评分函数（非标准 softmax），前 3 层使用 hash routing 直接查 tid2eid 表绕过 learned gating。

3. 并行切分策略

并行方式	切什么	通信开销	适用场景
TP (Tensor Parallel)	每个 expert 权重按行/列切分	all-reduce per layer	小 expert 数，单节点内
EP (Expert Parallel)	每个 rank 持有 `n_experts/EP_size` 个完整 expert	all-to-all dispatch+combine	大 expert 数（如 384），跨节点
DP (Data Parallel)	每个 rank 完整模型副本，切 batch	仅训练 gradient sync	推理时 = 多实例
PP (Pipeline Parallel)	按 layer 切	send/recv 激活	超大模型 + 高吞吐
CP (Context Parallel)	按 seq_len 切 KV cache	all-gather KV	超长上下文

DeepSeek-V4-Pro reference code 使用 TP（expert 按 world_size 均分）：

self.n_local_experts = args.n_routed_experts // world_size  # 384 / TP

实际生产部署使用 EP（DeepEP），支持到 EP2048。

模拟器建模公式：

weights_memory [per-rank] = (n_local_experts x expert_size + shared_expert_size + attn_size) x bytes_per_param

其中 n_local_experts = n_routed_experts / EP_size。shared expert 和 attention 层不被 EP 切分。

4. 单卡显存怎么估算

MoE 模型不能只看总参数。推理部署需要同时区分三种口径：

口径	含义	用途
总参数	shared/dense 参数 + 所有 experts 参数	checkpoint 规模与全局模型容量
active 参数	每个 token 实际经过的 shared/dense 参数 + top-k experts 参数	单 token 计算量
单 expert 参数	一个 expert 自己的权重大小	判断简单 EP 能否让 expert 单卡常驻

400B active 表示 400B 个参数参与一次 token 计算，不是 400GB 显存。权重显存还要乘精度：

400B BF16 ≈ 800GB
400B FP8  ≈ 400GB
400B INT4 ≈ 200GB

单卡部署约束是：

单卡显存 ≈
  本卡持有的 dense / attention shard
+ 本卡持有的 expert shard 或 expert 副本
+ KV Cache
+ runtime / workspace / 通信 buffer

EP 解决的是“experts 数量多”，把不同 experts 分散到不同 GPU；TP 解决单个矩阵、dense 层或单个 expert 太大；PP 按层减少每个 stage 持有的层数；DP 复制模型实例，提高吞吐但不降低单实例显存。

如果单个 expert 自己就超过单卡可用 HBM，简单 EP 仍然不够，需要 expert 内 TP：

一个 expert 的 FFN 权重切到多张 GPU 上计算

这会在 MoE 原有 token dispatch/combine all-to-all 之外叠加 TP 通信。可部署不等于部署代价低；单 expert 越大，系统越容易被推向更复杂的并行组合。

以 B200 常见的 180–192GB HBM 口径只看权重：

单 expert 参数	BF16	FP8	INT4
40B	80GB	40GB	20GB
80B	160GB	80GB	40GB
150B	300GB	150GB	75GB

还必须给 KV Cache、dense shard、通信 buffer、CUDA graph、workspace、量化 scale 和显存碎片留余量。因此 80B expert 的 BF16 权重即使理论上接近放得下，生产服务也会非常紧。

一个实用的判断顺序是：

确认 active 参数是参数量，不是显存。
用权重精度把参数量换算成字节。
拆出 shared/dense、top-k 和单 expert 参数。
判断 dense/attention 是否需要 TP。
判断单 expert 能否在预留 KV Cache 和 buffer 后单卡常驻。
用 EP 分散 experts，用 TP 切过大的 dense 或 expert，用 PP 降低单 stage 层数。

5. Expert 常驻 vs Offload

策略	描述	代价	适用场景
常驻 GPU	所有 local expert weights 在 HBM	显存大	生产推理（延迟敏感）
Layer-wise loading	每层从 CPU/NVMe 加载当前层 expert	PCIe 4.0 ~32 GB/s 带宽瓶颈	离线/低吞吐
Expert Cache (LRU)	缓存热门 expert，冷 expert 按需加载	miss penalty 10-100ms	资源受限单卡
CPU/NVMe Offload	FlexGen 类 swap in/out	3-10x 延迟增加	研究/低成本部署

结论：生产 MoE 推理中 expert 权重必须常驻 GPU 显存（EP 切分后每 rank 的量）。Offload 仅用于资源极度受限场景。

6. DeepEP 工程实践

DeepEP 是 EP 专用的 all-to-all dispatch/combine 库：

FP8 低精度通信支持
V2: 统一高吞吐 + 低延迟 API 为 ElasticBuffer
支持 EP2048，SM 占用从 24 降到 4-6
0-SM PP/CP/Engram (RDMA)
NCCL Gin 后端（轻量级）

对模拟器的影响：

EP all-to-all 通信量 [per-rank] = bs x seq x dim x 2 x bytes / EP_size（dispatch + combine 各一次）
通信可与计算 overlap（DeepEP V2 设计目标）
Overlap 条件：compute_per_byte >= 2 x d_moe = 6144 FLOPs/Byte

7. 与其他主题的关联

Token、hidden state、Attention、MoE、LM Head 的端到端关系详见 Token Flow 与 Hidden State
Expert 权重的 FP4 存储机制详见 FP4/FP8 量化
MoE workspace 的显存建模详见模拟器建模指南
MoE 框架支持对比详见推理框架对比 2026
KV cache 与 MoE 层的交互模式详见 CSA/HCA 注意力

← 被以下页面引用(6)

模拟器建模指南：显存与吞吐公式ai-systems · synthesis
推理框架对比 2026：从 Engine 到 Serving Stackai-systems · synthesis
CSA/HCA 注意力：DeepSeek-V4 的混合压缩稀疏机制ai-systems · synthesis
FP4/FP8 量化：低精度推理的存储与计算ai-systems · synthesis
Token Flow 与 Hidden State：从 Attention 到 LM Headai-systems · concept
DeepSeek-V3 Technical Report：中英对照解读ai-systems · source-summary

修改历史5 次提交

feat(wiki): enforce lifecycle metadata and search aliases
xiaocheng·刚刚·8098d0c
feat(wiki): connect core topics and add reading series
xiaocheng·6 小时前·e947096
docs(wiki): publish July inference research
xiaocheng·21 小时前·5d6504e
docs: add inference profiling notes
xiaocheng·07-01·6a94c5d
feat(wiki): ingest 4 raw articles + split inference survey into 5 pages
xiaocheng·06-07·0521533