通用扩散噪声调度和采样步骤存在缺陷

05月15日发表
04月12日由 AKAK 提交
作者: Peter LinShanchuan Lin, bingchen liuBingchen Liu, Jiashi Li, Xiao YangXiao Yang

摘要

我们发现,常见的扩散噪声计划不会强制最后一个时间步具有零信噪比 (SNR),并且扩散采样器的一些实现不会从最后一个时间步开始。这种设计存在缺陷,并且没有反映模型在推理时被赋予纯高斯噪声的事实,从而在训练和推理之间造成差异。我们表明,有缺陷的设计会在现有实现中造成实际问题。在 Stable Diffusion 中,它严重限制了模型仅生成中等亮度的图像,并阻止其生成非常亮和非常暗的样本。我们提出了一些简单的修复方法:(1) 重新缩放噪声计划以强制零终端 SNR;(2) 使用 v 预测训练模型;(3) 更改采样器以始终从最后一个时间步开始;(4) 重新缩放无分类器引导以防止过度曝光。这些简单的更改确保了扩散过程在训练和推理之间是一致的,并允许模型生成更忠实于原始数据分布的样本。

评论

Martin KubovcikMartin Kubovcik

计划到零终端信噪比 (SNR) 不起作用,因为它在最后的时间步变为无穷大。

sqrtrecipalphas_cumprod tensor([ 1.0003, 1.0009, 1.0017, 1.0027, 1.0040, 1.0056, 1.0074, 1.0094,

     1.0117,  1.0143,  1.0171,  1.0201,  1.0235,  1.0271,  1.0310,  1.0352,

     1.0397,  1.0444,  1.0495,  1.0549,  1.0605,  1.0665,  1.0729,  1.0795,

     1.0866,  1.0939,  1.1017,  1.1098,  1.1184,  1.1273,  1.1367,  1.1464,

     1.1567,  1.1674,  1.1786,  1.1903,  1.2026,  1.2154,  1.2288,  1.2428,

     1.2574,  1.2726,  1.2886,  1.3053,  1.3228,  1.3410,  1.3601,  1.3801,

     1.4011,  1.4230,  1.4460,  1.4701,  1.4954,  1.5220,  1.5499,  1.5792,

     1.6101,  1.6426,  1.6768,  1.7129,  1.7511,  1.7915,  1.8342,  1.8794,

     1.9275,  1.9785,  2.0329,  2.0908,  2.1526,  2.2188,  2.2898,  2.3660,

     2.4481,  2.5368,  2.6327,  2.7370,  2.8505,  2.9746,  3.1108,  3.2608,

     3.4270,  3.6120,  3.8190,  4.0522,  4.3170,  4.6199,  4.9698,  5.3785,

     5.8620,  6.4427,  7.1530,  8.0416,  9.1848, 10.7100, 12.8463, 16.0520,

    21.3966, 32.0883, 64.1689,     inf], dtype=torch.float64)

sqrtrecipm1alphas_cumprod tensor([2.5133e-02, 4.1840e-02, 5.7956e-02, 7.3890e-02, 8.9761e-02, 1.0562e-01,

    1.2150e-01, 1.3742e-01, 1.5340e-01, 1.6944e-01, 1.8555e-01, 2.0175e-01,

    2.1805e-01, 2.3446e-01, 2.5099e-01, 2.6764e-01, 2.8443e-01, 3.0136e-01,

    3.1846e-01, 3.3572e-01, 3.5317e-01, 3.7081e-01, 3.8865e-01, 4.0671e-01,

    4.2499e-01, 4.4353e-01, 4.6231e-01, 4.8138e-01, 5.0072e-01, 5.2038e-01,

    5.4035e-01, 5.6066e-01, 5.8133e-01, 6.0238e-01, 6.2383e-01, 6.4570e-01,

    6.6801e-01, 6.9079e-01, 7.1407e-01, 7.3787e-01, 7.6223e-01, 7.8717e-01,

    8.1273e-01, 8.3894e-01, 8.6585e-01, 8.9350e-01, 9.2193e-01, 9.5118e-01,

    9.8132e-01, 1.0124e+00, 1.0445e+00, 1.0776e+00, 1.1118e+00, 1.1473e+00,

    1.1841e+00, 1.2222e+00, 1.2618e+00, 1.3031e+00, 1.3460e+00, 1.3908e+00,

    1.4375e+00, 1.4864e+00, 1.5376e+00, 1.5913e+00, 1.6478e+00, 1.7072e+00,

    1.7699e+00, 1.8361e+00, 1.9063e+00, 1.9807e+00, 2.0599e+00, 2.1443e+00,

    2.2346e+00, 2.3314e+00, 2.4354e+00, 2.5477e+00, 2.6693e+00, 2.8014e+00,

    2.9456e+00, 3.1037e+00, 3.2779e+00, 3.4708e+00, 3.6858e+00, 3.9269e+00,

    4.1995e+00, 4.5103e+00, 4.8681e+00, 5.2847e+00, 5.7760e+00, 6.3646e+00,

    7.0828e+00, 7.9792e+00, 9.1302e+00, 1.0663e+01, 1.2807e+01, 1.6021e+01,

    2.1373e+01, 3.2073e+01, 6.4161e+01,        inf], dtype=torch.float64)
Peter LinPeter Lin
论文作者

> 计划到零终端信噪比 (SNR) 不起作用,因为它在最后的时间步变为无穷大。

>

> sqrtrecipalphas_cumprod tensor([ 1.0003, 1.0009, 1.0017, 1.0027, 1.0040, 1.0056, 1.0074, 1.0094,

> 1.0117, 1.0143, 1.0171, 1.0201, 1.0235, 1.0271, 1.0310, 1.0352,

> 1.0397, 1.0444, 1.0495, 1.0549, 1.0605, 1.0665, 1.0729, 1.0795,

> 1.0866, 1.0939, 1.1017, 1.1098, 1.1184, 1.1273, 1.1367, 1.1464,

> 1.1567, 1.1674, 1.1786, 1.1903, 1.2026, 1.2154, 1.2288, 1.2428,

> 1.2574, 1.2726, 1.2886, 1.3053, 1.3228, 1.3410, 1.3601, 1.3801,

> 1.4011, 1.4230, 1.4460, 1.4701, 1.4954, 1.5220, 1.5499, 1.5792,

> 1.6101, 1.6426, 1.6768, 1.7129, 1.7511, 1.7915, 1.8342, 1.8794,

> 1.9275, 1.9785, 2.0329, 2.0908, 2.1526, 2.2188, 2.2898, 2.3660,

> 2.4481, 2.5368, 2.6327, 2.7370, 2.8505, 2.9746, 3.1108, 3.2608,

> 3.4270, 3.6120, 3.8190, 4.0522, 4.3170, 4.6199, 4.9698, 5.3785,

> 5.8620, 6.4427, 7.1530, 8.0416, 9.1848, 10.7100, 12.8463, 16.0520,

> 21.3966, 32.0883, 64.1689, inf], dtype=torch.float64)

> sqrtrecipm1alphas_cumprod tensor([2.5133e-02, 4.1840e-02, 5.7956e-02, 7.3890e-02, 8.9761e-02, 1.0562e-01,

> 1.2150e-01, 1.3742e-01, 1.5340e-01, 1.6944e-01, 1.8555e-01, 2.0175e-01,

> 2.1805e-01, 2.3446e-01, 2.5099e-01, 2.6764e-01, 2.8443e-01, 3.0136e-01,

> 3.1846e-01, 3.3572e-01, 3.5317e-01, 3.7081e-01, 3.8865e-01, 4.0671e-01,

> 4.2499e-01, 4.4353e-01, 4.6231e-01, 4.8138e-01, 5.0072e-01, 5.2038e-01,

> 5.4035e-01, 5.6066e-01, 5.8133e-01, 6.0238e-01, 6.2383e-01, 6.4570e-01,

> 6.6801e-01, 6.9079e-01, 7.1407e-01, 7.3787e-01, 7.6223e-01, 7.8717e-01,

> 8.1273e-01, 8.3894e-01, 8.6585e-01, 8.9350e-01, 9.2193e-01, 9.5118e-01,

> 9.8132e-01, 1.0124e+00, 1.0445e+00, 1.0776e+00, 1.1118e+00, 1.1473e+00,

> 1.1841e+00, 1.2222e+00, 1.2618e+00, 1.3031e+00, 1.3460e+00, 1.3908e+00,

> 1.4375e+00, 1.4864e+00, 1.5376e+00, 1.5913e+00, 1.6478e+00, 1.7072e+00,

> 1.7699e+00, 1.8361e+00, 1.9063e+00, 1.9807e+00, 2.0599e+00, 2.1443e+00,

> 2.2346e+00, 2.3314e+00, 2.4354e+00, 2.5477e+00, 2.6693e+00, 2.8014e+00,

> 2.9456e+00, 3.1037e+00, 3.2779e+00, 3.4708e+00, 3.6858e+00, 3.9269e+00,

> 4.1995e+00, 4.5103e+00, 4.8681e+00, 5.2847e+00, 5.7760e+00, 6.3646e+00,

> 7.0828e+00, 7.9792e+00, 9.1302e+00, 1.0663e+01, 1.2807e+01, 1.6021e+01,

> 2.1373e+01, 3.2073e+01, 6.4161e+01, inf], dtype=torch.float64)

零终端信噪比本身没有问题,完全可以工作。问题是许多采样器的数学公式是在假设模型 epsilon 预测的情况下推导出来的,因此许多实现会首先将模型输出转换为 epsilon,然后再执行数学计算。但是 epsilon 预测永远无法与零终端信噪比一起工作,因此会出现未定义的除法错误。解决方案是直接从 v 预测或 x0 预测中推导出数学公式。

diffusers 中的 DDPM、DDIM 实现应该完全可以工作。你正在使用什么采样器?

Михаил Константинович КозловМихаил Константинович Козлов

嗨!感谢您的出色工作!

我想对 SDXL 进行端到端微调,SDXL 是使用 EulerDiscreteScheduler(epsilon 预测和 edm 风格训练)训练的,并且不确定如果我切换到带有尾部时间步的 v 预测并将 SNR 设置为零,它是否可以工作。请您给我一个提示,我应该使用我建议的流程还是切换到使用 DDPM 和 epsilon 预测进行微调?

在 SDXL-Lightning 论文中,它谈到了 T 步的纯噪声,但没有提及调度和预测类型

Martin KubovcikMartin Kubovcik

是的,我从论文中了解到它默认只适用于 v 预测,而噪声是无意义的。我在我自己的环境(Jupyter notebook)中测试了它。我正在使用 DDIM。作者谈到了 v 预测,但没有说不可能使用另一种配置。谢谢。

y1Rany1Ran

ZeroSNR 方法可以在 EulerDiscreteScheduler 或 EulerAncestralDiscreteScheduler 中正确工作吗?