Step-Audio 2 技术报告

发表
Gang YuGang Yu 提交
作者: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei LiJingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, ZhangYuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, HYHaiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

摘要

本文介绍了Step-Audio~2,一个端到端的多模态大型语言模型,专为行业级别的音频理解和语音对话而设计。通过集成潜在音频编码器和以推理为中心的强化学习(RL),Step-Audio 2在自动语音识别(ASR)和音频理解方面取得了令人满意的表现。为了促进真正的端到端语音对话,Step-Audio 2将离散音频令牌的生成融入到语言建模中,显著增强了其对副语言信息(如说话风格和情感)的响应能力。为了有效利用真实世界数据中丰富的文本和声学知识,Step-Audio 2集成了检索增强生成(RAG),并能够调用外部工具,如网络搜索以减少幻觉,以及音频搜索以切换音色。Step-Audio 2在数百万小时的语音和音频数据上进行训练,在各种对话场景中展现出智能和表现力。评估结果表明,与T其他开源和商业解决方案相比,Step-Audio 2在各种音频理解和对话基准测试中达到了最先进的性能。更多信息请访问 https://github.com/stepfun-ai/Step-Audio2
查看 arXiv 页面查看 PDF

评论

Gang YuGang Yu
论文提交者

项目页面:https://github.com/stepfun-ai/Step-Audio2