⏶18
Step-3:规模庞大又经济实惠:通过模型与系统协同设计实现高性价比解码
发表
由
Elie Bakouch 提交

作者: StepFun, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen,
Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu,
Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang
摘要
大型语言模型 (LLM) 在解码过程中面临硬件效率低下的问题,尤其是在长上下文推理任务中。本文介绍了 Step-3,一个拥有 321B 参数的 VLM,它通过硬件感知的模型-系统协同设计,为最小化解码成本进行了优化。Step-3 在两个关键维度上进行了创新:(1) 一种新颖的多矩阵分解注意力 (MFA) 机制,它在保持高注意力表达能力的同时,显著减少了 KV 缓存大小和计算量;以及 (2) 注意力-前馈网络解耦 (AFD),这是一种将注意力和前馈网络 (FFN) 层解耦到专门化子系统中的分布式推理系统。这种协同设计实现了前所未有的成本效益:与 DeepSeek-V3 和 Qwen3 MoE 235B 等模型相比,Step-3 显著降低了理论解码成本,并且随着上下文变长,其优势愈发明显。Step-3 在每个 token 激活 38B 参数(超过 DeepSeek-V3 和 Qwen3 MoE 235B)的同时实现了低成本,这表明硬件对齐的注意力算术强度、MoE 稀疏性以及 AFD 对于实现成本效益至关重要。我们在 DeepSeek-V3 的有利场景下,与其进行了直接对比。我们在 Hopper GPU 上的实现,在 50ms TPOT SLA(4K 上下文,FP8,无 MTP)的条件下,实现了高达每 GPU 每秒 4,039 个 token 的解码吞吐量。这比在相同设置下 DeepSeek-V3 的 2,324 更高,并为 LLM 解码设定了新的帕累托前沿。
Step 3 系统报告与模型报告请见:https://stepfun.ai/research/en/step3