通用扩散噪声调度和采样步骤存在缺陷

发表
AKAK 提交
作者: Peter LinShanchuan Lin, bingchen liuBingchen Liu, Jiashi Li, Xiao YangXiao Yang

摘要

AI 生成总结
提出的解决方案解决了扩散噪声计划(noise schedules)和采样器(samplers)中的缺陷,以确保训练和推理之间的一致性,从而实现更准确的图像生成。
我们发现,常见的扩散噪声计划不会强制最后一个时间步具有零信噪比 (SNR),并且扩散采样器的一些实现不会从最后一个时间步开始。这种设计存在缺陷,并且没有反映模型在推理时被赋予纯高斯噪声的事实,从而在训练和推理之间造成差异。我们表明,有缺陷的设计会在现有实现中造成实际问题。在 Stable Diffusion 中,它严重限制了模型仅生成中等亮度的图像,并阻止其生成非常亮和非常暗的样本。我们提出了一些简单的修复方法:(1) 重新缩放噪声计划以强制零终端 SNR;(2) 使用 v 预测训练模型;(3) 更改采样器以始终从最后一个时间步开始;(4) 重新缩放无分类器引导以防止过度曝光。这些简单的更改确保了扩散过程在训练和推理之间是一致的,并允许模型生成更忠实于原始数据分布的样本。
查看 arXiv 页面查看 PDF

评论

Martin KubovcikMartin Kubovcik

计划到零终端信噪比 (SNR) 不起作用,因为它在最后的时间步变为无穷大。

sqrtrecipalphascumprod tensor([ 1.0003, 1.0009, 1.0017, 1.0027, 1.0040, 1.0056, 1.0074, 1.0094, 1.0117, 1.0143, 1.0171, 1.0201, 1.0235, 1.0271, 1.0310, 1.0352, 1.0397, 1.0444, 1.0495, 1.0549, 1.0605, 1.0665, 1.0729, 1.0795, 1.0866, 1.0939, 1.1017, 1.1098, 1.1184, 1.1273, 1.1367, 1.1464, 1.1567, 1.1674, 1.1786, 1.1903, 1.2026, 1.2154, 1.2288, 1.2428, 1.2574, 1.2726, 1.2886, 1.3053, 1.3228, 1.3410, 1.3601, 1.3801, 1.4011, 1.4230, 1.4460, 1.4701, 1.4954, 1.5220, 1.5499, 1.5792, 1.6101, 1.6426, 1.6768, 1.7129, 1.7511, 1.7915, 1.8342, 1.8794, 1.9275, 1.9785, 2.0329, 2.0908, 2.1526, 2.2188, 2.2898, 2.3660, 2.4481, 2.5368, 2.6327, 2.7370, 2.8505, 2.9746, 3.1108, 3.2608, 3.4270, 3.6120, 3.8190, 4.0522, 4.3170, 4.6199, 4.9698, 5.3785, 5.8620, 6.4427, 7.1530, 8.0416, 9.1848, 10.7100, 12.8463, 16.0520, 21.3966, 32.0883, 64.1689, inf], dtype=torch.float64) sqrtrecipm1alphascumprod tensor([2.5133e-02, 4.1840e-02, 5.7956e-02, 7.3890e-02, 8.9761e-02, 1.0562e-01, 1.2150e-01, 1.3742e-01, 1.5340e-01, 1.6944e-01, 1.8555e-01, 2.0175e-01, 2.1805e-01, 2.3446e-01, 2.5099e-01, 2.6764e-01, 2.8443e-01, 3.0136e-01, 3.1846e-01, 3.3572e-01, 3.5317e-01, 3.7081e-01, 3.8865e-01, 4.0671e-01, 4.2499e-01, 4.4353e-01, 4.6231e-01, 4.8138e-01, 5.0072e-01, 5.2038e-01, 5.4035e-01, 5.6066e-01, 5.8133e-01, 6.0238e-01, 6.2383e-01, 6.4570e-01, 6.6801e-01, 6.9079e-01, 7.1407e-01, 7.3787e-01, 7.6223e-01, 7.8717e-01, 8.1273e-01, 8.3894e-01, 8.6585e-01, 8.9350e-01, 9.2193e-01, 9.5118e-01, 9.8132e-01, 1.0124e+00, 1.0445e+00, 1.0776e+00, 1.1118e+00, 1.1473e+00, 1.1841e+00, 1.2222e+00, 1.2618e+00, 1.3031e+00, 1.3460e+00, 1.3908e+00, 1.4375e+00, 1.4864e+00, 1.5376e+00, 1.5913e+00, 1.6478e+00, 1.7072e+00, 1.7699e+00, 1.8361e+00, 1.9063e+00, 1.9807e+00, 2.0599e+00, 2.1443e+00, 2.2346e+00, 2.3314e+00, 2.4354e+00, 2.5477e+00, 2.6693e+00, 2.8014e+00, 2.9456e+00, 3.1037e+00, 3.2779e+00, 3.4708e+00, 3.6858e+00, 3.9269e+00, 4.1995e+00, 4.5103e+00, 4.8681e+00, 5.2847e+00, 5.7760e+00, 6.3646e+00, 7.0828e+00, 7.9792e+00, 9.1302e+00, 1.0663e+01, 1.2807e+01, 1.6021e+01, 2.1373e+01, 3.2073e+01, 6.4161e+01, inf], dtype=torch.float64)

Peter LinPeter Lin
论文作者

> 计划到零终端信噪比 (SNR) 不起作用,因为它在最后的时间步变为无穷大。 > > sqrtrecipalphascumprod tensor([ 1.0003, 1.0009, 1.0017, 1.0027, 1.0040, 1.0056, 1.0074, 1.0094, > 1.0117, 1.0143, 1.0171, 1.0201, 1.0235, 1.0271, 1.0310, 1.0352, > 1.0397, 1.0444, 1.0495, 1.0549, 1.0605, 1.0665, 1.0729, 1.0795, > 1.0866, 1.0939, 1.1017, 1.1098, 1.1184, 1.1273, 1.1367, 1.1464, > 1.1567, 1.1674, 1.1786, 1.1903, 1.2026, 1.2154, 1.2288, 1.2428, > 1.2574, 1.2726, 1.2886, 1.3053, 1.3228, 1.3410, 1.3601, 1.3801, > 1.4011, 1.4230, 1.4460, 1.4701, 1.4954, 1.5220, 1.5499, 1.5792, > 1.6101, 1.6426, 1.6768, 1.7129, 1.7511, 1.7915, 1.8342, 1.8794, > 1.9275, 1.9785, 2.0329, 2.0908, 2.1526, 2.2188, 2.2898, 2.3660, > 2.4481, 2.5368, 2.6327, 2.7370, 2.8505, 2.9746, 3.1108, 3.2608, > 3.4270, 3.6120, 3.8190, 4.0522, 4.3170, 4.6199, 4.9698, 5.3785, > 5.8620, 6.4427, 7.1530, 8.0416, 9.1848, 10.7100, 12.8463, 16.0520, > 21.3966, 32.0883, 64.1689, inf], dtype=torch.float64) > sqrtrecipm1alphascumprod tensor([2.5133e-02, 4.1840e-02, 5.7956e-02, 7.3890e-02, 8.9761e-02, 1.0562e-01, > 1.2150e-01, 1.3742e-01, 1.5340e-01, 1.6944e-01, 1.8555e-01, 2.0175e-01, > 2.1805e-01, 2.3446e-01, 2.5099e-01, 2.6764e-01, 2.8443e-01, 3.0136e-01, > 3.1846e-01, 3.3572e-01, 3.5317e-01, 3.7081e-01, 3.8865e-01, 4.0671e-01, > 4.2499e-01, 4.4353e-01, 4.6231e-01, 4.8138e-01, 5.0072e-01, 5.2038e-01, > 5.4035e-01, 5.6066e-01, 5.8133e-01, 6.0238e-01, 6.2383e-01, 6.4570e-01, > 6.6801e-01, 6.9079e-01, 7.1407e-01, 7.3787e-01, 7.6223e-01, 7.8717e-01, > 8.1273e-01, 8.3894e-01, 8.6585e-01, 8.9350e-01, 9.2193e-01, 9.5118e-01, > 9.8132e-01, 1.0124e+00, 1.0445e+00, 1.0776e+00, 1.1118e+00, 1.1473e+00, > 1.1841e+00, 1.2222e+00, 1.2618e+00, 1.3031e+00, 1.3460e+00, 1.3908e+00, > 1.4375e+00, 1.4864e+00, 1.5376e+00, 1.5913e+00, 1.6478e+00, 1.7072e+00, > 1.7699e+00, 1.8361e+00, 1.9063e+00, 1.9807e+00, 2.0599e+00, 2.1443e+00, > 2.2346e+00, 2.3314e+00, 2.4354e+00, 2.5477e+00, 2.6693e+00, 2.8014e+00, > 2.9456e+00, 3.1037e+00, 3.2779e+00, 3.4708e+00, 3.6858e+00, 3.9269e+00, > 4.1995e+00, 4.5103e+00, 4.8681e+00, 5.2847e+00, 5.7760e+00, 6.3646e+00, > 7.0828e+00, 7.9792e+00, 9.1302e+00, 1.0663e+01, 1.2807e+01, 1.6021e+01, > 2.1373e+01, 3.2073e+01, 6.4161e+01, inf], dtype=torch.float64)

零终端信噪比本身没有问题,完全可以工作。问题是许多采样器的数学公式是在假设模型 epsilon 预测的情况下推导出来的,因此许多实现会首先将模型输出转换为 epsilon,然后再执行数学计算。但是 epsilon 预测永远无法与零终端信噪比一起工作,因此会出现未定义的除法错误。解决方案是直接从 v 预测或 x0 预测中推导出数学公式。

diffusers 中的 DDPM、DDIM 实现应该完全可以工作。你正在使用什么采样器?

Михаил Константинович КозловМихаил Константинович Козлов

嗨!感谢您的出色工作!

我想对 SDXL 进行端到端微调,SDXL 是使用 EulerDiscreteScheduler(epsilon 预测和 edm 风格训练)训练的,并且不确定如果我切换到带有尾部时间步的 v 预测并将 SNR 设置为零,它是否可以工作。请您给我一个提示,我应该使用我建议的流程还是切换到使用 DDPM 和 epsilon 预测进行微调?

在 SDXL-Lightning 论文中,它谈到了 T 步的纯噪声,但没有提及调度和预测类型

Martin KubovcikMartin Kubovcik

是的,我从论文中了解到它默认只适用于 v 预测,而噪声是无意义的。我在我自己的环境(Jupyter notebook)中测试了它。我正在使用 DDIM。作者谈到了 v 预测,但没有说不可能使用另一种配置。谢谢。

y1Rany1Ran

ZeroSNR 方法可以在 EulerDiscreteScheduler 或 EulerAncestralDiscreteScheduler 中正确工作吗?