GTR-CoT: 将图遍历作为分子结构识别的视觉思维链

发表
Haote YangHaote Yang 提交
作者: Jingchao WangJingchao Wang, Haote YangHaote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, Dahua Lin, Conghui He

摘要

Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.
查看 arXiv 页面查看 PDF

评论

Haote YangHaote Yang
论文作者
论文提交者

我们引入了GTR-Mol-VLM,这是一种用于光学化学结构识别(OCSR)的新颖框架。GTR-Mol-VLM具有两项关键创新:(1) 图遍历作为视觉思维链机制,通过顺序原子-键预测逐步解析分子图,从而模拟人类推理;(2) 数据中心原则“忠实识别所见”,解决了图像中缩写结构与其扩展注释之间的不匹配问题。

为了支持模型开发,我们构建了GTR-CoT-1.3M,这是一个具有精心校正注释的大规模指令微调数据集。我们还引入了MolRec-Bench,这是第一个专为OCSR中图解析准确性细粒度评估设计的基准。

全面的实验表明,与专业模型、化学领域VLM和商业通用VLM相比,GTR-Mol-VLM取得了卓越的成果。

我们的仓库地址是 https://github.com/opendatalab/GTR-CoT。