⏶99

灵枢：面向统一多模态医学理解与推理的通用基础模型

06月08日发表

06月10日由 Hou Pong (Ken) Chan 提交

作者: LASA Team, Xu Weiwen Weiwen Xu, Hou Pong (Ken) Chan Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, ZHANG HAO Hao Zhang, Yu Rong Yu Rong

摘要

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

查看 arXiv 页面查看 PDF

Hou Pong (Ken) Chan

论文作者

论文提交者

🌟 亮点：

灵枢支持超过12种医学影像模态，包括X射线、CT扫描、MRI、显微镜、超声、组织病理学、皮肤镜、眼底、OCT、数码摄影、内窥镜和PET。
灵枢模型在大多数医学多模态/文本问答和报告生成任务中，对于7B和32模型尺寸，均达到了最先进（SOTA）水平。
灵枢-32B 在大多数多模态问答和报告生成任务中，性能优于GPT-4.1和Claude Sonnet 4。

Henri Gelender

灵枢：面向统一多模态医学理解与推理的通用基础模型

摘要

评论