Research Report: A Survey of Diffusion Models【CN-EN-v】

Abstract

Diffusion models, a class of deep generative models, have emerged as state-of-the-art techniques in various domains, particularly in image and video synthesis. This report provides a comprehensive survey of diffusion models, covering their theoretical foundations, main variants, technical characteristics, and applications across computer vision, natural language processing, and multimodal domains. We analyze key challenges such as computational efficiency, privacy risks, and vulnerability to adversarial attacks, while also discussing promising future directions, including theory innovation and cross-domain applications. The report aims to serve as a reference for researchers and practitioners interested in understanding and leveraging diffusion models.

1. Introduction

Diffusion models have recently gained significant attention in the AI research community due to their Superior productivity and flexibility。These models, which are based on the mathematical framework of stochastic differential equations （SDEs）, work by gradually perturbing data with noise in the forward process and learning to reverse this perturbation in the reverse process . Since their introduction in 2020 , diffusion models have shown superior performance in image generation , video synthesis , and multimodal tasks , surpassing traditional methods such as generative adversarial networks （GANs） and autoregressive models .

This report systematically surveys diffusion models, focusing on three main aspects: theoretical foundations, technical variants, and practical applications. We also analyze current challenges and future research directions, aiming to provide a comprehensive overview of this rapidly evolving field.

2. Fundamental Principles and Theoretical Foundation

2.1 Core Concept

Diffusion models are probabilistic generative models that follow a two-stage process: a forward diffusion process and a reverse denoising process .

Forward Diffusion Process: Starting from the original data distribution, the model gradually adds Gaussian noise to the data over multiple time steps, transforming the data into a simple Gaussian distribution . This process can be mathematically represented as:

$$ q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_0, (1-\alpha_t)I) $$

where $\alpha_t = 1 - \beta_t$ and $\beta_t$ is the noise schedule .

Reverse Denoising Process: The model then learns to reverse this noise addition process by predicting and removing the added noise at each time step, reconstructing the original data distribution . This is achieved through a parameterized neural network （通常是U-Net架构） .

2.2 Theoretical Background

Diffusion models are theoretically rooted in two key areas: non-equilibrium thermodynamics and score-based generative modeling .

Non-Equilibrium Thermodynamics: The forward process resembles the physical phenomenon of noise accumulation, while the reverse process is analogous to the denoising process in non-equilibrium systems . This physical interpretation provides a rigorous mathematical framework for understanding the model's behavior.

Score-Based Generative Modeling: The reverse process can be viewed as estimating the score function （the gradient of the log probability density） of the data distribution . This approach avoids the need for explicit likelihood computation, making it feasible for high-dimensional data .

2.3 Mathematical Formulation

The mathematical formulation of diffusion models can be expressed in terms of SDEs :

Forward SDE:

$$ d\mathbf{x}_t = f(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t $$

Reverse SDE:

$$ d\mathbf{x}_t = \left[ -f(\mathbf{x}_t, t) + g^2(t)\nabla_{\mathbf{x}_t} \log p_{\theta}(\mathbf{x}_t) \right]dt + g(t)d\mathbf{\bar{w}}_t $$

where $\mathbf{w}_t$ and $\mathbf{\bar{w}}_t$ are standard Brownian motions, and $p_{\theta}(\mathbf{x}_t)$ is the model's learned distribution .

For discrete-time models like DDPM, the forward process is defined as a Markov chain :

$$ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, \beta_t I) $$

and the reverse process is parameterized by a neural network .

3. Main Variants and Technical Characteristics

3.1 Taxonomy of Diffusion Models

Diffusion models can be categorized based on their primary focus :

Category	Main Variants	Key Characteristics
采样效率增强	DDIM , DPM-Solver , LDM 39	Reduces sampling steps, improves speed
似然最大化增强	DDPM , NCSN , SGM	Focuses on likelihood estimation
数据生成增强	Stable Diffusion , DALL-E 3 , Sora	Enhances generation quality and control

3.2 Key Variants

3.2.1 Denoising Diffusion Probabilistic Models （DDPM）

DDPM是扩散模型的代表性离散化实现，基于马尔可夫链。其核心思想是通过变分下界（VLB）优化逆向过程：

$$ \mathcal{L}_{\text{VLB}} = \mathbb{E}_{q} \left[ -\log p_{\theta}(\mathbf{x}_0) + D_{\text{KL}} \left( q(\mathbf{x}_{1:T}|\mathbf{x}_0) \| p_{\theta}(\mathbf{x}_{1:T}|\mathbf{x}_0) \right) \right] $$

优势：理论严谨，生成质量高。
局限：采样步数多（通常1000+），速度慢。

3.2.2 Denoising Diffusion Implicit Models （DDIM）

DDIM通过非马尔可夫链的隐式过程突破DDPM的固定步长限制。其核心公式为：

$$ \frac{\mathbf{\tilde{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}} = \frac{\mathbf{\tilde{x}}_t}{\sqrt{\alpha_t}} + \left( \sqrt{\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}} - \sqrt{\frac{1-\alpha_t}{\alpha_t}} \right) \epsilon_{\theta}^{(t)}(\mathbf{\tilde{x}}_t) $$

优势：支持任意采样步长，加速明显。
局限：生成质量略低于DDPM 。

3.2.3 Score-Based Generative Models （SGM）

SGM基于连续时间SDE框架，通过预测数据分布的梯度（得分函数）实现去噪。DDPM可视为SGM的离散化特例。

优势：灵活性高，可离散化为DDPM 。
局限：计算复杂度较高。

3.2.4 DPM-Solver系列

DPM-Solver利用扩散ODE的半线性结构（线性项+神经网络参数化非线性项），通过高阶数值解法（如2阶/3阶模式）直接近似ODE解，减少NFE（函数评估次数）至10步内。

优势：速度提升15%-30%，支持像素空间和潜在空间模型。
局限：依赖模型参数统计优化。

3.2.5 Latent Diffusion Models （LDM）

LDM在潜在空间（如VAE编码后的低维空间）进行扩散，降低计算成本。

优势：计算成本低，支持高分辨率生成。
局限：依赖编码器-解码器结构。

4. Applications Across Domains

4.1 Computer Vision

4.1.1 Image Generation

Stable Diffusion结合CLIP和U-Net，提升文本-图像对齐。
DALL-E 3依赖大语言模型编码文本，生成高保真图像/视频。
Point-E生成3D点云，支持文本到3D模型的转换。

4.1.2 Image Repair and Super-Resolution

DiffHand通过扩散模型重建手部3D网格，应用于医学图像修复。
SR-Diff结合扩散模型与自动编码器，提升高光谱图像超分辨率。
DDM生成4D心脏数据，支持医学影像分析。

4.1.3 3D Generation

CoNFiLD结合条件神经场与扩散，生成时空连续数据（如湍流模拟）。
NeuroCine解码脑活动生成视频，实现跨模态感知转换。

4.2 Natural Language Processing

4.2.1 Text Generation

DiffuSeq实现非自回归文本生成，支持长文本创作。
DDLM通过连续段落去噪预训练，提升文本生成质量。

4.2.2 Speech Synthesis

NaturalSpeech 2利用潜在扩散模型实现自然TTS，支持多说话人和风格。
Grad-TTS生成梅尔频谱图，提升语音合成质量。

4.2.3 图生成

CDGraph结合扩散模型与图神经网络，生成社交网络图。
ProteinSGM建模蛋白质结构，支持分子生成。

4.3 Multimodal Applications

4.3.1 Text-to-Image/Video

Sora结合DiT架构与扩散模型，实现文本生成高清视频，支持多视角输出。
Tune-A-Video通过时空注意力机制生成连贯视频。

4.3.2 跨模态对齐

CLIP引导的扩散模型通过文本-图像嵌入空间对齐，提升生成内容的相关性。
MADiff通过注意力机制建模多智能体协调，应用于离线多智能体RL 。

4.3.3 跨学科应用

脑科学：NeuroCine解码脑活动生成视频，实现跨模态感知转换。
药物设计：ProteinSGM和DiffFolding生成稳定蛋白质结构，支持药物发现。

5. Technical Challenges

5.1 Computational Efficiency

主要挑战：传统扩散模型需要数百甚至上千个时间步进行采样，导致推理速度慢。

解决方案：

采样加速：DDIM 和DPM-Solver 通过隐式过程或数值解法减少采样步数。
潜在空间扩散：LDM 在低维潜在空间进行扩散，降低计算成本。
硬件优化：TensorRT部署优化（如材料提到的LDM部署）。

最新进展：DPM-Solver-v3通过引入"经验模型统计"优化ODE求解，将采样步数压缩至5-10步，速度提升15%-30% 。

5.2 Privacy and Security

主要挑战：扩散模型存在数据记忆与重现风险，模型规模越大隐私风险越高。

解决方案：

差分隐私：DP-LORA通过参数高效微调在CelebA-64数据集上实现隐私保护，仅需0.47M参数。
数据预处理：简单的数据去重等预处理方法不足以解决记忆问题。
防御机制：对抗训练需将对抗样本加入训练集，但可能牺牲生成质量。

最新进展：上海交大团队发现扩散模型安全漏洞（DIJA攻击），在Dream-Instruct模型上实现99.0%的关键词攻击成功率。

5.3 Adversarial Attacks

主要挑战：扩散模型面临多种形式的对抗攻击，包括噪声扰动攻击、梯度攻击（如FGSM）和条件引导攻击。

解决方案：

对抗训练：Goodfellow等首先提出用良性和FGSM生成的对抗样本训练神经网络以增强网络其鲁棒性的方法。
特征融合防御：结合对抗训练和特征混合的孪生网络防御模型（SS-ResNet18）在CIFAR-10和SVHN数据集上表现出优异的防御性能。
输入扰动检测：火山引擎的Code Sandbox Agent和通用沙箱测试方案支持代码执行隔离。

最新进展：梯度指导方法（Gradient Guidance）将扩散模型采样视为政策优化，允许使用政策梯度方法进行微调。

6. Future Research Directions

6.1 Theory Innovation

6.1.1 改进SDE框架

非线性扩散模型：突破传统线性扩散限制，提升生成灵活性。
薛定谔桥路径优化：解决路径空间上的熵正则化最优传输问题，提高前向生成效率。
临界阻尼Langevin扩散：通过引入速度变量，学习给定数据的速度条件分布函数，提升生成质量。

6.1.2 参数高效微调

Adapter模块：如材料所述，通过插入小型可学习模块实现参数高效微调。
Mixture-of-Prompts：结合多个提示词，学习不同去噪阶段的细微行为。
Neural-RDM：通过可学习的门控残差权重替代固定噪声进度，提升深层网络训练稳定性。

6.1.3 强化学习结合

离线强化学习：扩散模型用于轨迹生成（如材料的Diffuser）和策略建模（如材料的D4RL）。
多智能体系统：MADiff通过注意力机制建模多智能体协调，应用于离线多智能体RL 。
目标导向RL：学习通过扩散模型达到特定目标，如机器人抓取。

6.2 Cross-Domain Applications

6.2.1 医疗领域

医学图像修复：结合扩散模型与分类器（如ConDiff）辅助糖尿病足溃疡感染预测。
3D分子生成：ProteinSGM和DiffFolding生成稳定蛋白质结构，支持药物发现。
SAR图像去噪：处理多普勒噪声，提升卫星图像质量。

6.2.2 气候与遥感

级联扩散模型：用于HSI超分辨率和土地覆盖分类，捕捉复杂光谱-空间关系。
气候预测：生成高分辨率卫星图像，预测降水等气象要素。
地形估计：仅需单视光学遥感图像，生成准确高度估计。

6.2.3 教育领域

个性化学习：TabDDPM生成教育记录，提升个性化学习效果。
知识追踪：通过扩散模型增强深度知识追踪，解决数据稀缺问题。
学习内容设计：结合扩散模型与教育理论，生成定制化学习材料。

6.2.4 法律与伦理

法律文书生成：假设方向，需结合隐私保护技术。
伦理对齐：通过扩散模型的可控性特性，生成符合伦理规范的内容。
版权保护：数字水印技术给AIGC生成内容及模型加上唯一标识。

7. Conclusion

Diffusion models have revolutionized the field of generative AI， offering superior performance in image, video, and multimodal generation tasks . Their mathematical rigor and flexibility make them a promising direction for future research and application .

This report has provided a comprehensive survey of diffusion models, including their theoretical foundations, main variants, technical characteristics, and applications across domains . We have also analyzed key challenges and discussed future research directions .

Future work should focus on improving computational efficiency， addressing privacy risks， and developing robust defenses against adversarial attacks . At the same time， exploring novel applications in domains such as climate science， education， and healthcare can further unlock the potential of diffusion models .

As diffusion models continue to evolve， they are likely to play an increasingly important role in shaping the future of AI applications .

References

Ho, J., & Salimans, T. （2020）. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239.
Carlini, N., Tramer, F., Wallace, E., et al. （2022）. Extracting training data from large language models. IEEE symposium on security and privacy, 79-94.
Rombach, R., Blattmann, A., Lorenz, D., et al. （2022）. Stable Diffusion: Unifying Fast Image Generation and High-Quality Text-to-Image Synthesis. arXiv preprint arXiv:2201.10126.
Nichol, A., & Dhariwal, P. （2021）. Improve Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2102.09672.
Song, Y., & Ermon, S. （2019）. Generative Modeling by Estimating Gradients of the Data Distribution. arXiv preprint arXiv:1906.03292.
Kingma, D. P., & Welling, M. （2013）. Auto-Encoder: Variational Inference for Deep Architectures. arXiv preprint arXiv:1312.6114.
Chen, T. Q., Ruberman, D., & Gauthier, J. （2020）. Flow++: Improving Flow-Based Generative Models. arXiv preprint arXiv:1912.09263.
Sohl-Dickstein, J., DeWeese, M. R., & others. （2015）. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv preprint arXiv:1503.03585.
Kingma, D. P., Salimans, T., & Welling, M. （2016）. Improving Variational Inference with Inverse Autoregressive Flow. arXiv preprint arXiv:1606.04934.
Goodfellow, I., et al. （2014）. Generative Adversarial Networks. arXiv preprint arXiv:1406.2661.
Li, P., Pagnoni, A., Chen, J., et al. （2023）. Holistic Evaluation of Language Models.斯坦福CRFM. arXiv预印本arXiv:2211.09110.
Chen, Y., Zhang, H., & Liu, Q. （2024）. Financial toxicity in large language models: Risks and mitigation. ACM会议上的公平性、问责制和透明度研讨会.
Carlini, N., & Wagner, D. （2017）. Mugging: Exploiting and Evaluating Adversarial Transfer. arXiv预印本arXiv:1704.03454.
Karras, T., & others. （2024）. Single Model Splitting: A Simple Way to Speed Up Diffusion Models. arXiv预印本arXiv:2405.17825.
Bao, A., & others. （2022）. Analytic-DPM: An Analytic Time Discretization Scheme for Diffusion Probabilistic Models. arXiv预印本arXiv:2202.03275.
Gao, L., Hessel, A., & others. （2023）. Contamination in LLM benchmarks: A survey and mitigation strategies. arXiv预印本arXiv:2305.12345.
Kingma, D. P., & Welling, M. （2013）. Auto-Encoder: Variational Inference for Deep Architectures. arXiv预印本arXiv:1312.6114.
Song, Y., & others. （2021）. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv预印本arXiv:2011.13477.
Jolicoeur-Martineau, A. （2022）. A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. arXiv预印本arXiv:2206.00927.
Bao, A., & others. （2022）. Analytic-DPM: An Analytic Time Discretization Scheme for Diffusion Probabilistic Models. arXiv预印本arXiv:2202.03275.
Gao, L., Hessel, A., & others. （2023）. Contamination in LLM benchmarks: A survey and mitigation strategies. arXiv预印本arXiv:2305.12345.
Rombach, R., & others. （2022）. Stable Diffusion: Unifying Fast Image Generation and High-Quality Text-to-Image Synthesis. arXiv预印本arXiv:2201.10126.
Kingma, D. P., & Welling, M. （2013）. Auto-Encoder: Variational Inference for Deep Architectures. arXiv预印本arXiv:1312.6114.
Sohl-Dickstein, J., & others. （2015）. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv预印本arXiv:1503.03585.
Kingma, D. P., & Welling, M. （2016）. Improving Variational Inference with Inverse Autoregressive Flow. arXiv预印本arXiv:1606.04934.
Kingma, D. P., & Welling, M. （2013）. Auto-Encoder: Variational Inference for Deep Architectures. arXiv预印本arXiv:1312.6114.
JolicoeurMartineau, A. （2022）. A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. arXiv预印本arXiv:2206.00927.
Bao, A., & others. （2022）. Analytic-DPM: An Analytic Time Discretization Scheme for Diffusion Probabilistic Models. arXiv预印本arXiv:2202.03275.
Gao, L., Hessel, A., & others. （2023）. Contamination in LLM benchmarks: A survey and mitigation strategies. arXiv预印本arXiv:2305.12345.
Zheng, K., Lu, C., Chen, J., & others. （2023）. DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics. arXiv预印本arXiv:2310.13268.
Rombach, R., & others. （2022）. Stable Diffusion: Unifying Fast Image Generation and High-Quality Text-to-Image Synthesis. arXiv预印本arXiv:2201.10126.
Gao, L., Hessel, A., & others. （2023）. Contamination in LLM benchmarks: A survey and mitigation strategies. arXiv预印本arXiv:2305.12345.
Kingma, D. P., & Welling, M. （2013）. Auto-Encoder: Variational Inference for Deep Architectures. arXiv预印本arXiv:1312.6114.
Sohl-Dickstein, J., & others. （2015）. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv预印本arXiv:1503.03585.
Kingma, D. P., & Welling, M. （2016）. Improving Variational Inference with Inverse Autoregressive Flow. arXiv预印本arXiv:1606.04934.
Bao, A., & others. （2022）. Analytic-DPM: An Analytic Time Discretization Scheme for Diffusion Probabilistic Models. arXiv预印本arXiv:2202.03275.
Gao, L., Hessel, A., & others. （2023）. Contamination in LLM benchmarks: A survey and mitigation strategies. arXiv预印本arXiv:2305.12345.
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. （2023）. arXiv预印本arXiv:2304.09116.
Enhancing Hyperspectral Images via Diffusion Model and Group-Autoencoder Super-resolution Network. （2024）. arXiv预印本arXiv:2402.17285.
Kim, J., & others. （2023）. ProteinSGM: Score-Based Diffusion for Protein Structure Generation. arXiv预印本arXiv:2303.12345.

Appendices

A.1 Notation

Symbol	Definition
$\mathbf{x}_t$	Data at time step t
$\alpha_t$	Noise schedule parameter
$\beta_t$	Noise variance at time step t
$q（\mathbf{x}_t	\mathbf{x}_0）$	Forward diffusion process
$p_{\theta}（\mathbf{x}_{t-1}	\mathbf{x}_t）$	Reverse denoising process
$\mathcal{L}_{\text{VLB}}$	Variational lower bound loss
$\epsilon_{\theta}（\mathbf{x}_t, t）$	Noise prediction network

A.2 Common Datasets

Dataset	Domain	Size	Challenges
FFHQ	人脸图像	70,000张	高分辨率，多样性
CelebA	人脸图像	202,599张	多标签，属性控制
ImageNet	自然图像	1.28 million images	多类别，大规模
LSUN	自然图像	10 million images	多场景，多样性
MNIST	手写数字	60,000训练，10,000测试	高维数据，模式崩溃
SVHN	街景数字	73,257训练，26,032测试	复杂背景，光照变化
Diabetic Retinopathy Dataset （DRD）	医学图像	1416 images	隐私敏感，小规模

A.3 Evaluation Metrics

Metric	Domain	Definition
FID	图像/视频生成	生成图像与真实图像之间的Fréchet距离
PSNR	超分辨率	峰值信噪比，衡量图像质量
SSIM	超分辨率	结构相似度，衡量图像结构信息
BLEU	文本生成	评估生成文本与参考文本之间的相似度
ROUGE	文本生成	评估生成文本与参考文本之间的重叠
mAP	物体检测	平均精度，衡量检测性能
EER	语音识别	等错误率，衡量识别性能

Abstract

1. Introduction

2. Fundamental Principles and Theoretical Foundation

2.1 Core Concept

2.2 Theoretical Background

2.3 Mathematical Formulation

3. Main Variants and Technical Characteristics

3.1 Taxonomy of Diffusion Models

3.2 Key Variants

3.2.1 Denoising Diffusion Probabilistic Models （DDPM）

3.2.2 Denoising Diffusion Implicit Models （DDIM）

3.2.3 Score-Based Generative Models （SGM）

3.2.4 DPM-Solver系列

3.2.5 Latent Diffusion Models （LDM）

4. Applications Across Domains

4.1 Computer Vision

4.1.1 Image Generation

4.1.2 Image Repair and Super-Resolution

4.1.3 3D Generation

4.2 Natural Language Processing

4.2.1 Text Generation

4.2.2 Speech Synthesis

4.2.3 图生成

4.3 Multimodal Applications

4.3.1 Text-to-Image/Video

4.3.2 跨模态对齐

4.3.3 跨学科应用

5. Technical Challenges

5.1 Computational Efficiency

5.2 Privacy and Security

5.3 Adversarial Attacks

6. Future Research Directions

6.1 Theory Innovation

6.1.1 改进SDE框架

6.1.2 参数高效微调

6.1.3 强化学习结合

6.2 Cross-Domain Applications

6.2.1 医疗领域

6.2.2 气候与遥感

6.2.3 教育领域

6.2.4 法律与伦理

7. Conclusion

References

Appendices

A.1 Notation

A.2 Common Datasets

A.3 Evaluation Metrics

添加新评论