⏶4
大语言模型推理超频:监测和控制大语言模型中的思维路径长度
发表
由
Itamar Zimerman 提交
作者:
Roy Eisenstadt, Itamar Zimerman, Lior Wolf
摘要
Recently, techniques such as explicit structured reasoning have demonstrated
strong test-time scaling behavior by enforcing a separation between the model's
internal "thinking" process and the final response. A key factor influencing
answer quality in this setting is the length of the thinking stage. When the
reasoning is too short, the model may fail to capture the complexity of the
task. Conversely, when it is too long, the model may overthink, leading to
unnecessary computation and degraded performance. This paper explores and
exploits the underlying mechanisms by which LLMs understand and regulate the
length of their reasoning during explicit thought processes. First, we show
that LLMs encode their progress through the reasoning process and introduce an
interactive progress bar visualization, which is then used to reveal insights
on the model's planning dynamics. Second, we manipulate the internal progress
encoding during inference to reduce unnecessary steps and generate a more
concise and decisive chain of thoughts. Our empirical results demonstrate that
this "overclocking" method mitigates overthinking, improves answer accuracy,
and reduces inference latency. Our code is publicly available.

最近,显式结构化推理等技术通过强制分离模型的内部“思考”过程和最终响应,展示了强大的测试时扩展行为。在这种情况下,影响答案质量的一个关键因素是思考阶段的长度。当推理过短时,模型可能无法捕捉任务的复杂性。相反,当推理过长时,模型可能会过度思考,导致不必要的计算和性能下降。本文探索并利用了LLM在显式思考过程中理解和调节推理长度的潜在机制。首先,我们展示了LLM编码其在推理过程中的进展,并引入了一种交互式进度条可视化,然后用其揭示模型规划动态的见解。其次,我们在推理过程中操纵内部进度编码,以减少不必要的步骤,并生成更简洁和果断的思维链。我们的实证结果表明,这种“超频”方法减轻了过度思考,提高了答案准确性,并减少了推理延迟。我们的代码已公开可用。