良好的开端是成功的一半:通过弱到强解码实现低资源偏好对齐

发表
Feifan SongFeifan Song 提交
作者: Feifan SongFeifan Song, Shaohang WeiShaohang Wei, Wen Luo, Yuxuan Fan, Tianyu Liu, Guoyin Wang, Houfeng Wang

摘要

Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.
查看 arXiv 页面查看 PDF

评论

Feifan SongFeifan Song
论文作者
论文提交者

代码与资源:https://github.com/F2-Song/Weak-to-Strong-Decoding

大型语言模型(LLMs)需要与人类偏好对齐,以避免生成冒犯性、虚假或无意义的内容。最近,LLM对齐的低资源方法受到欢迎,但仍面临难以同时获得高质量和对齐内容的挑战。受观察到生成对齐响应的难度集中在解码开始阶段的启发,我们提出了一种新颖的框架——弱到强解码(Weak-to-Strong Decoding, WSD),通过小型对齐模型的指导来增强基础模型的对齐能力。小型模型首先草拟出良好对齐的开头,然后由大型基础模型通过精心设计的自动切换机制继续完成其余部分。我们还收集了一个新的数据集GenerAlign,用于微调小型Pilot-3B作为草稿模型,该模型在WSD框架下有效增强了不同的基础模型,使其超越所有基线方法,同时避免了下游任务的性能下降(这被称为对齐税)。进一步进行了大量实验,以深入探究不同设置和时间效率的影响,以及WSD的内在机制。

head.png