⏶35
NVIDIA Nemotron Nano 2:一种准确高效的混合 Mamba-Transformer 推理模型
发表
由
taesiri 提交

作者: NVIDIA, Aarti Basant, Abhijit Khairnar,
Abhijit Paithankar,
Abhinav Khattar,
Adi Renduchintala, Adithya Renduchintala,
Aditya Malte,
Akhiad Bercovich,
Akshay Hazare, Alejandra Rico,
Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov,
Ali Taghibakhshi, Amelia Barton,
Ameya Sunil Mahabaleshwarkar,
Amy Shen, Andrew Tao,
Ann Guan,
Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman,
Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Stefania Alborghetti,
Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi,
Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen



摘要
我们推出了 Nemotron-Nano-9B-v2,这是一款混合 Mamba-Transformer 语言模型,旨在提高推理工作负载的吞吐量,同时与同类模型相比实现最先进的准确性。Nemotron-Nano-9B-v2 基于 Nemotron-H 架构,其中通用 Transformer 架构中的大部分自注意力层被 Mamba-2 层取代,以提高生成推理所需的长思考痕迹时的推理速度。我们通过首先使用 FP8 训练方案在 20 万亿个 token 上预训练一个 120 亿参数的模型 (Nemotron-Nano-12B-v2-Base) 来创建 Nemotron-Nano-9B-v2。在对 Nemotron-Nano-12B-v2-Base 进行对齐后,我们采用 Minitron 策略来压缩和蒸馏模型,目标是能够在单个 NVIDIA A10G GPU(22GiB 内存,bfloat16 精度)上进行高达 128k token 的推理。与现有的同类模型(例如 Qwen3-8B)相比,我们证明了 Nemotron-Nano-9B-v2 在推理基准上达到了相当或更好的准确性,同时在 8k 输入和 16k 输出 token 等推理场景下实现了高达 6 倍的推理吞吐量。我们将在 Hugging Face 上发布 Nemotron-Nano-9B-v2、Nemotron-Nano12B-v2-Base 和 Nemotron-Nano-9B-v2-Base 检查点,以及我们大部分的预训练和后训练数据集。
> 我们推出了Nemotron-Nano-9B-v2,这是一款混合Mamba-Transformer语言模型,旨在提高推理工作负载的吞吐量,同时与同等规模的模型相比,实现最先进的准确性。Nemotron-Nano-9B-v2建立在Nemotron-H架构之上,在该架构中,常见的Transformer架构中的大部分自注意力层被Mamba-2层替换,从而在生成推理所需的冗长思考过程时提高推理速度。我们首先使用FP8训练配方在20万亿个token上预训练一个120亿参数的模型(Nemotron-Nano-12B-v2-Base),然后创建Nemotron-Nano-9B-v2。在对Nemotron-Nano-12B-v2-Base进行对齐后,我们采用Minitron策略对模型进行压缩和蒸馏,目标是能够在一个NVIDIA A10G GPU(22GiB内存,bfloat16精度)上推理多达128k的token。与现有的同等规模模型(例如,Qwen3-8B)相比,我们发现Nemotron-Nano-9B-v2在推理基准上实现了相当或更好的准确性,同时在8k输入和16k输出token等推理场景下实现了高达6倍的推理吞吐量。我们正在Hugging Face上发布Nemotron-Nano-9B-v2、Nemotron-Nano12B-v2-Base和Nemotron-Nano-9B-v2-Base的检查点,以及我们大部分的预训练和后训练数据集。