⏶222
强化预训练
发表
由
Li Dong 提交
作者: Qingxiu Dong,
Li Dong,
Yao Tang,
Tianzhu Ye, Yutao Sun, Zhifang Sui,
Furu Wei
摘要
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling
paradigm for large language models and reinforcement learning (RL).
Specifically, we reframe next-token prediction as a reasoning task trained
using RL, where it receives verifiable rewards for correctly predicting the
next token for a given context. RPT offers a scalable method to leverage vast
amounts of text data for general-purpose RL, rather than relying on
domain-specific annotated answers. By incentivizing the capability of
next-token reasoning, RPT significantly improves the language modeling accuracy
of predicting the next tokens. Moreover, RPT provides a strong pre-trained
foundation for further reinforcement fine-tuning. The scaling curves show that
increased training compute consistently improves the next-token prediction
accuracy. The results position RPT as an effective and promising scaling
paradigm to advance language model pre-training.

在这项工作中,我们引入了强化预训练(Reinforcement Pre-Training, RPT),作为大型语言模型和强化学习(RL)的一种新的扩展范式。具体来说,我们将下一词元预测重构为一个使用RL训练的推理任务,它通过正确预测给定上下文的下一词元来获得可验证的奖励。RPT提供了一种可扩展的方法,可以利用大量的文本数据进行通用RL,而不是依赖领域特定的标注答案。通过激励下一词元推理的能力,RPT显著提高了预测下一词元的语言模型准确性。此外,RPT为进一步的强化微调提供了坚实的基础。扩展曲线表明,增加训练计算量能持续提高下一词元预测的准确性。这些结果使RPT成为推进语言模型预训练的有效且有前景的扩展范式。