⏶81
MiniCPM4: 终端设备上的超高效大语言模型
发表
由
Chaojun XIAO 提交
作者: MiniCPM Team,
Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan,
Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun,
Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang,
Yudong Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Yukun Yan, Jiarui Yuan, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Ge Zhou, Jie Zhou, Wei Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun


摘要
This paper introduces MiniCPM4, a highly efficient large language model (LLM)
designed explicitly for end-side devices. We achieve this efficiency through
systematic innovation in four key dimensions: model architecture, training
data, training algorithms, and inference systems. Specifically, in terms of
model architecture, we propose InfLLM v2, a trainable sparse attention
mechanism that accelerates both prefilling and decoding phases for long-context
processing. Regarding training data, we propose UltraClean, an efficient and
accurate pre-training data filtering and generation strategy, and UltraChat v2,
a comprehensive supervised fine-tuning dataset. These datasets enable
satisfactory model performance to be achieved using just 8 trillion training
tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient
pre-training strategy search, and improve existing post-training methods by
introducing chunk-wise rollout for load-balanced reinforcement learning and
data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose
CPM.cu that integrates sparse attention, model quantization, and speculative
sampling to achieve efficient prefilling and decoding. To meet diverse
on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B
parameters, respectively. Sufficient evaluation results show that MiniCPM4
outperforms open-source models of similar size across multiple benchmarks,
highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B
demonstrates significant speed improvements over Qwen3-8B when processing long
sequences. Through further adaptation, MiniCPM4 successfully powers diverse
applications, including trustworthy survey generation and tool use with model
context protocol, clearly showcasing its broad usability.
本文介绍了MiniCPM4,这是一种专为端侧设备设计的高效大型语言模型(LLM)。我们通过在四个关键维度上进行系统创新实现了这一效率:模型架构、训练数据、训练算法和推理系统。具体来说,在模型架构方面,我们提出了InfLLM v2,这是一种可训练的稀疏注意力机制,可加速长上下文处理的预填充和解码阶段。在训练数据方面,我们提出了UltraClean,一种高效准确的预训练数据过滤和生成策略,以及UltraChat v2,一个全面的监督微调数据集。这些数据集使得模型只需使用8万亿训练词元即可达到令人满意的性能。在训练算法方面,我们提出了ModelTunnel v2,用于高效的预训练策略搜索,并通过引入分块式rollout(用于负载均衡强化学习)和数据高效的三元LLM BitCPM,改进了现有的后训练方法。在推理系统方面,我们提出了CPM.cu,它集成了稀疏注意力、模型量化和推测采样,以实现高效的预填充和解码。为满足多样化的设备端需求,MiniCPM4提供两个版本,参数量分别为0.5B和8B。充分的评估结果表明,MiniCPM4在多个基准测试中优于同等规模的开源模型,凸显了其效率和有效性。值得注意的是,MiniCPM4-8B在处理长序列时比Qwen3-8B展现出显著的速度提升。通过进一步的适应,MiniCPM4成功地支持了各种应用,包括可信调查问卷生成和模型上下文协议下的工具使用,清晰展示了其广泛的可用性。