Saffron-1: 迈向大语言模型安全保障的推理扩展范式

发表
Ruizhong QiuRuizhong Qiu 提交
作者: Ruizhong QiuRuizhong Qiu, Gaotang LiGaotang Li, Tianxin WeiTianxin Wei, Jingrui He, Hanghang Tong

摘要

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .
查看 arXiv 页面查看 PDF
Saffron-1: 迈向大语言模型安全保障的推理扩展范式

评论

Ruizhong QiuRuizhong Qiu
论文作者
论文提交者

😲 不仅仅是推理?!推理扩展现在可以提升LLM安全性!

🚀 介绍我们的开创性工作Saffron-1:

  • 将攻击成功率从66%降低到17.5%

    • 仅使用59.7 TFLOP计算量
    • 抵御最新的越狱攻击
    • 无需对语言模型进行微调

在Ai2 Refusals基准测试上。

📖 论文:https://arxiv.org/pdf/2506.06444

🖥️ 代码:https://github.com/q-rz/saffron

🌐 网页:https://q-rz.github.io/p/saffron