A Self-Contained Tutorial on Ito Stochastic Differential Equations for Diffusion Models¶

In the machine learning community in the context of Diffusion Models, there has been a big interest in Stochastic Differential Equations (SDEs) recently. SDEs are a powerful concept driving powerful generative AI tools from image generation (see DALL-E-2 or Stable Diffusion) to protein generation (see RF-Diffusion). However, the accompanying papers assume understanding of complex results from the theory of SDEs - such as the time-reversal formula. As a result, the works are more and more inaccessible for non-SDE experts. This blog post aims to help out here and give a self-contained introduction to SDEs for diffusion models. All mathematical results for SDEs that are used for diffusion models are proven here. More specifically, in this tutorial, you will know:

  1. What SDEs are.
  2. How SDEs are implemented.
  3. The distribution and convergence properties of SDEs with affine drift coefficients.
  4. An overview of famous diffusion models and how they can be framed as SDEs with affine drift coefficients.
  5. How SDEs can be time-reversed and how that corresponds to novel generation.
  6. How to train a diffusion model.

First time hear about SDEs?: If you have never heard about SDEs, I would recommend you to first read my previous tutorial where I explore the Langevin SDE that is fundamental for machine learning.

More interested in practical implementations? in my next tutorial, we are going to put the theory into practice.

Required background: I assume that you have background knowledge in probability theory.

Acknowledgement: the content of this tutorial has partially grown out of several nights trying to understand the famous diffusion model SDE paper by Song et al..

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Callable, List
from itertools import product
from tqdm.notebook import tqdm
#!pip install ipywidgets

from celluloid import Camera # getting the camera
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML
import seaborn as sns
from IPython.display import Video
import os
import pandas as pd

1. Introduction: What are Stochastic Differential Equations?¶

The solutions of a stochastic differential equation (SDE) is a stochastic process. A stochastic process is a collection of random variables $X_t\in\mathbb{R}^d$ that evolve over the time $t\geq 0$. While each $X_t$ is random for each individual $t$, the beauty of stochastic processes comes in studying how $X_{t+s}$ is related to $X_{t}$ - i.e. how values at different time points depend on each other.

1.1. Stochastic Processes and Brownian Motion¶

A fundamental stochastic process is a Brownian motion $W_t$. A Brownian motion is characterized by:

  1. Normal increments: Increments are normally distributed with variance scaling proportional to the time difference: $W_{t+s}-W_{t}\sim\mathcal{N}(0,s\mathbf{I}_d)$ for all $t,s\geq 0$.
  2. Independent increments: $W_{t_1}-W_{t_2}$ is independent of $W_{t_2}-W_{t_3}$ for $t_1>t_2>t_3$.

We can sample trajectories of a Brownian motion by discretizing the stochastic process by: $$W_{t+s} = W_{t} +\sqrt{s} \epsilon\quad\text{where }\epsilon\sim\mathcal{N}(0,1)$$

Let's plot 5 trajectories:

In [2]:
def run_brownian_motion(n_traj: int = 5, n_grid_points: int = 10000, t_start: float = 0.0, t_end: float = 10.0):
    time_grid = np.linspace(t_start,t_end,n_grid_points)
    grid_size = (t_end-t_start)/n_grid_points
    x_0 = np.array([n_traj*[0.0]])
    gaussian_noise = np.random.normal(size=(len(time_grid)-1,n_traj))
    brownian_motion = np.sqrt(grid_size)*gaussian_noise.cumsum(axis=0)
    brownian_motion = np.concatenate([x_0,brownian_motion],axis=0)
    return time_grid,brownian_motion

fig,ax = plt.subplots(figsize=(9,9))
time,X_t = run_brownian_motion()
sample_labels = [f"Trajectory {i}" for i in range(1,6)]
ax.plot(time,X_t,label=sample_labels)
ax.set_xlabel("time t")
ax.set_ylabel("W_t")
ax.legend();

If you feel reminded of a graph describing a stock market index, good guess! Brownian motion is widely used in Mathematical Finance to model stock prices.

1.2 From Ordinary Differential Equations to Stochastic Differential Equations¶

To understand SDE, let's start by reviewing the classical ordinary differential equations (ODEs). An ODE is an equation with the following characteristics:

  1. Its solution $\mathbf{x}(t)$ is a deterministic function: $x:\mathbb{R}\to\mathbb{R}^d$.
  2. The rate of change $$\frac{d}{dt}\mathbf{x}(t) = f(\mathbf{x}(t),t)$$ is time- and location-dependent and given by a function $$f:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^d,(\mathbf{x},t)\to f(\mathbf{x},t)$$.
  3. A fixed start value: $x(0)=v\in\mathbb{R}^d$.

A Stochastic Differential Equation "is an ODE but just random". In other words, an SDE is an equation with the following characteristics:

  1. Its solution $X_t$ is a stochastic process. I.e. a sample of a stochastic process is a function $t\to X_t\in\mathbb{R}^d$.
  2. The rate of change $$dX(t) = f(X_t,t)dt+g(t)dW_t$$ is given by
    • A. A deterministic drift specified by a function $f$.
    • B. A random drift specified by a function $g\geq 0$ and a Brownian motion $W_t$.
  3. A start distribution $X_0\sim p_0$.

Intuitively, the above equation means that for (infinitisimally) small $s>0$, we have

$$X_{t+s}\approx X_{t}+s f(X_t,t)+g(t)\sqrt{s}(W_{t+s}-W_{t})$$

In other words, the conditional distribution for the next time step $X_{t+s}$ given the current time step $X_{t}$ is given by $$X_{t+s}|X_t\sim\mathcal{N}(X_{t}+s f(X_t,t),sg^2(t))$$

So the deterministic drift $f$ describes the infinitesimal change in mean and the volatility function $g$ describes the infinitesimal standard deviation.

2. Implementing SDEs with the Euler-Maruyama method¶

In this section, we simulate SDEs and look at a few example SDEs.

To simulate SDEs, we discretize them in time and then sampling the next time step with the above update rule:

  • Input: number of steps $n_{\text{steps}}$, step size $s>0$
  • Sample $X_0\sim p_0$
  • Set $t=0$
  • For $i=1,...n_{\text{steps}}$:

    Sample $\epsilon\sim\mathcal{N}(0,I)$

    Set $X_{t+s} = X_{t} + sf(X_{t},t)+g(t)\sqrt{s}\epsilon$

  • Return: $[X_{0},X_{s},X_{2s},X_{3s},\dots,X_{sn_{\text{steps}}}]$

The above method is called the Euler-Maruyama method.

In [3]:
def run_sde(f_determ_drift: Callable,
            g_random_drift: Callable,
            x_start: np.array,
            t_start: float = 0.0, 
            t_end: float = 1.0, 
            n_steps: int = 10000,
            **kwargs):
    """Function to run stochastic differential equation. We assume a deterministic initial distribution p_0."""
    
    #Number of trajectories, dimension of data:
    n_traj,dim_x = x_start.shape

    #Compute time grid for discretization and step size:
    time_grid = np.linspace(t_start,t_end,n_steps)
    step_size = time_grid[1]-time_grid[0]

    #Compute the random drift at every time point:
    random_drift_grid = g_random_drift(time_grid)
    
    #Sample random drift at every time point:
    noise = np.random.normal(size=(n_steps,n_traj,dim_x))
    random_drift_grid_sample = np.sqrt(step_size)*random_drift_grid[:,None,None]*noise
    
    #Initialize list of trajectory:
    x_traj = [x_start]
    
    
    for idx,time in tqdm(enumerate(time_grid)):
        
        #Get last location and time
        x = x_traj[idx]
        t = float(time_grid[idx])
        
        #Get deterministic drift and random drift sample
        determ_drift = step_size*f_determ_drift(x,t)
        random_drift_sample = random_drift_grid_sample[idx]
        
        #Compute next step:
        next_step = x + determ_drift + random_drift_sample
        
        #Save step:
        x_traj.append(next_step)

    return np.stack(x_traj),time_grid    

Note: There are fancier (but much less intuitive) methods to compute solutions other than the Euler-Maryuama method. This is beyond the scope of this tutorial.

2.1. Define Example Random Drift Functions g¶

Let's define two example drift functions and plot them:

  • Linear: $g(t)=0.1+0.05t$
  • Periodic: $g(t)=\sin(2\pi t)+1.10$
In [4]:
def linear_g(t,constant=0.2):
    """t - 1d np.ndarray or float
    Returns: 1d np.ndarray or float"""
    return 0.1+constant*t

def periodic_g(t):
    """t - 1d np.ndarray or float
    Returns: 1d np.ndarray or float"""
    return np.sqrt(np.sin(2*np.pi*t)+1.10)

def plot_g_random_drift(drift_func_list: List[Callable], min_t: float = 0.0, max_t: float = 5.0,n_grid_points: int = 10000):
    """Function to plot a random drift function g."""
    fig, axs = plt.subplots(1,len(drift_func_list),figsize=(len(drift_func_list)*6,6))
    time = np.linspace(min_t,max_t,n_grid_points)
    for idx,drift_func in enumerate(drift_func_list):
        axs[idx].plot(time,drift_func(time))
        axs[idx].set_title(drift_func.__name__)
        axs[idx].set_xlabel("time")
        axs[idx].set_ylabel("g")

plot_g_random_drift([linear_g,periodic_g])

2.2. Define Example Deterministic Drift Functions f¶

Let's define example drift functions $f$:

  • Shift: $f(x,t)=-(x-4)$
  • Periodic: $f(x,t)=\cos(2\pi t^2)\cos(2\pi x^2)$
In [5]:
def shift_f(x,t,shift=4.0):
    """x - shape (n,k): a space coefficients 
       t - shape (n) or float: time coefficients"""
    return -(x-shift)

def periodic_f(x,t):
    """x - shape (n,k): space coefficients 
       t - shape (n) or float: time coefficients"""

    if isinstance(t,float):
        return 2*np.cos(2*np.pi*t**2)*(np.cos(2*np.pi*x)) #*normalizer[:,None])
    else:
        return 2*np.cos(2*np.pi*t**2)[:,None]*(np.cos(2*np.pi*x**2)) #*normalizer[:,None])

def plot_f_determ_drift(determ_func_list: List[Callable],
                min_t: float = 0.0, 
                max_t: float = 2.5, 
                min_x: float = 0.0,
                max_x: float = 2.5,
                n_grid_points: int = 1000,
                plot_contours=False):
    """Function to plot the function f(x,t) in an Ito-SDE."""
    time = np.linspace(min_t,max_t,n_grid_points)
    oned_grid = np.linspace(min_x, max_x, n_grid_points)
    twod_grid = np.array([[x,t] for t,x in product(time,oned_grid)])
    extent = [min_x, max_x, min_t, max_t]
    
    fig,axs = plt.subplots(1,len(determ_func_list),figsize=(6*len(determ_func_list),6))
    
    for idx, determ_func in enumerate(determ_func_list):
        derivative = determ_func(twod_grid[:,0].reshape(-1,1),twod_grid[:,1])
        axs[idx].imshow(derivative.reshape(n_grid_points,n_grid_points).transpose(),interpolation='bilinear',origin='lower', extent = extent, cmap=plt.get_cmap('YlOrRd'))
        if plot_contours:
            axs[idx].contour(derivative.reshape(n_grid_points,n_grid_points).transpose(),interpolation='bilinear',origin='lower', extent = extent,color='black')
        axs[idx].set_title(determ_func.__name__)
        axs[idx].set_xlabel("time t")
        axs[idx].set_ylabel("location x")

plot_f_determ_drift([shift_f,periodic_f])

2.3 Animation of SDEs¶

Let's write a function that animates SDEs over time

In [6]:
def animate_ito_sde(f_drift: Callable, g_drift: Callable, fpath_anim: str = None, n_steps: int = 100000, n_samples: int = 400,n_grid_points: int = 100, initial_dist:str ="normal", t_end=4.0):
    """Function to animate and plot a 1d SDE over time."""
    if initial_dist == "normal":
        x_start = np.random.normal(size=n_samples).reshape(-1,1)
    elif initial_dist == "uniform":
        x_start = np.random.uniform(size=n_samples).reshape(-1,1)
    else:
        raise ValueError
        
    x_traj,time_grid = run_sde(f_drift,g_drift,x_start=x_start,t_end=t_end,n_steps=n_steps)
    x_traj = x_traj.squeeze()
    
    fig, axs = plt.subplots(3,1,figsize=(24,36))
    camera = Camera(fig)
    
    oned_grid = np.linspace(x_traj.min(), x_traj.max(), len(time_grid))
    twod_grid = np.array([[x,t] for t,x in product(time_grid,oned_grid)])
    extent = [time_grid.min(), time_grid.max(),x_traj.min(), x_traj.max()]
    derivative = f_drift(twod_grid[:,0].reshape(-1,1),twod_grid[:,1])
    derivative_plot = derivative.reshape(len(time_grid),len(time_grid)).transpose()
    
    g_drift_per_time = g_drift(time_grid)

    for idx in tqdm(range(1,len(x_traj),max(int(len(x_traj)/200),1))):
    
        #Plot evolution over time over deterministic drift:
        axs[0].imshow(derivative_plot,interpolation='bilinear',origin='lower', extent = extent, cmap=plt.get_cmap('YlOrRd'),aspect='auto')
        if idx>5:
            for idy in range(10):
                axs[0].plot(time_grid[:idx],x_traj[:idx,idy].squeeze()) #,label=sample_labels)
            axs[0].plot(time_grid[:idx],x_traj[:idx].squeeze().mean(axis=1),color='black',linewidth=10)
        axs[0].set_title("Example trajectories (color=f)")

        #Plot distribution:
        sns.kdeplot(x_traj[idx],ax=axs[1])
        axs[1].set_title("Distribution of X_t")

        #Plot random drift:
        axs[2].plot(time_grid[:idx],g_drift_per_time[:idx])
        axs[2].set_title("Random drift g")
        camera.snap()
        
    animation = camera.animate() # animation ready
    plt.close()
    return animation

Set hyperparameters.

In [7]:
RUN_ANIMATION = False
N_STEPS = 1000
N_SAMPLES = 10000

2.4. Run first simple SDE¶

Let's set:

  • $X_0\sim\mathcal{N}(0,1)$
  • $f(x,t)=-(x-4)$
  • $g(t)=\sin(2\pi t)+1.10$

and run the SDE.

The below animation shows:

  • Top figure: 5 example trajectories (plot over the function $f(x,t)$)
  • Middle figure: the change of distribution over time
  • Bottom figure: the change in drift $g(t)$ over time.
In [8]:
fpath = f"simple_sde.mp4"
animation = animate_ito_sde(shift_f,periodic_g,n_steps=N_STEPS,n_samples=N_SAMPLES)
animation.save(fpath)
HTML(animation.to_html5_video())
0it [00:00, ?it/s]
  0%|          | 0/200 [00:00<?, ?it/s]
Out[8]:
Your browser does not support the video tag.

We can observe that:

  • The mean $\mathbb{E}[X_t]$ converges towards $4$ over time (black line in animation in 1st figure). The reason for that is that $f(x,t)=-(x-4)$ pushes $x$ towards $4$.
  • The curves becomes "rougher" or noisier periodically due to the periodic $g(t)=\sin(2\pi t)+1.10$ (see plot 2 and 3 above).
  • In combination of both effects, the distribution moves from a Gaussian distribution $\mathcal{N}(0,1)$ towards a Gaussian distribution $\mathcal{N}(4,\sigma^2(t))$ with mean 4 and variance $\sigma^2(t)$ shifting periodically (see plot 1 above).

2.5. Run complex SDE¶

Let's run a more complex SDE with:

  • $X_0\sim\mathcal{N}(0,1)$
  • $f(x,t)=\cos(2\pi t^2)\cos(2\pi x^2)$
  • $g(t)=0.1+0.1t$
In [9]:
fpath = f"complex_sde.mp4"
animation = animate_ito_sde(periodic_f,linear_g,n_steps=N_STEPS,n_samples=N_SAMPLES)
animation.save(fpath)
HTML(animation.to_html5_video())
# HTML(f"""
# <video muted autoplay width="1280" height="960" controls>
#   <source src="complex_sde.mp4" type="video/mp4">
# </video>
# """)
0it [00:00, ?it/s]
  0%|          | 0/200 [00:00<?, ?it/s]
Out[9]:
Your browser does not support the video tag.

We can observe that:

  • The mean $\mathbb{E}[X_t]$ stays constant over time (black line in animation in 1st figure). The reason for that $f$ is symmetric around $0$ and therefore does not push probability mass in any direction and that we initialised with zero-mean ($\mathcal{N}(0,1)$).
  • The curves become "rougher" or noisier over time due to the linear increase of $g$.
  • The periodicity of $f$ causes the distribution oscillates between a Gaussian-shape and a shape with "up-s and downs". However, as time increases the noise coming from $g$ dominates and it resembles more a Gaussian.

3. The expectation, variance, and distribution of SDEs with affine drift coefficients¶

We want to answer 3 fundamental questions in this chapter:

  1. Can we say something about the expectation value $\mathbb{E}[X_t]$?
  2. Can we say something about the variance $\mathbb{V}[X_t]$?
  3. Can we say something about the distribution $X_t\sim p_t$?

In general, the distribution of $X_t$ can be relatively complex. However, under a simple assumption that covers all relevant models for the context of machine learning we derive closed form equations.

3.1. Affine SDEs¶

In order to to be able say more, we have to make the following assumption of affine drift coefficients. We assume that $f$ has an affine form, i.e. $f(x,t)=a(t)x+b(t)$ for functions $a,b:\mathbb{R}\to\mathbb{R}$.

Then we have that for $t\geq 0$ and small $s\geq 0$: $$ \begin{align} X_{t+s}-X_{t}\approx sa(t)X_t+sb(t)+g(t)\sqrt{s}\epsilon \quad\text{(1)} \end{align} $$ where $\epsilon\sim\mathcal{N}(0,\mathbb{1})$.

3.2 Expectation for affine drift coefficients¶

Equation $(1)$ implies that: $$ \begin{align*} \mathbb{E}[X_{t+s}|X_0]-\mathbb{E}[X_{t}|X_0]\approx &sa(t)\mathbb{E}[X_t|X_0]+sb(t)+g(t)\sqrt{s}\mathbb{E}[\epsilon]\\ =&sa(t)\mathbb{E}[X_t|X_0]+sb(t) \end{align*} $$ as $\epsilon$ is independent of $X_0$. In other words, the function $e(t)=\mathbb{E}[X_t|X_0]$ satisfies the linear ODE: $$\begin{align*} e'(t)=\lim\limits_{s\to 0} \left[\frac{\mathbb{E}[X_{t+s}|X_0]-\mathbb{E}[X_{t}|X_0]}{s}\right]=a(t)e(t)+b(t),\quad e(0)=X_0 \end{align*}$$ The solution is therefore given by: $$\begin{align*} e(t)=\left(X_0+\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}a(r)dr)b(s)ds\right)\exp\left(\int\limits_{0}^{t}a(s)ds\right) \end{align*}$$ (This can be verified by inserting the right-hand side into the ODE.) Therefore, $$\begin{align*} \mathbb{E}[X_t]=\mathbb{E}[\mathbb{E}[X_t|X_0]]=\left(\mathbb{E}[X_0]+\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}a(r)dr)b(s)ds\right)\exp\left(\int\limits_{0}^{t}a(s)ds\right) \end{align*}$$

3.3 Variance for affine drift coefficients¶

Let $v(t)=\mathbb{V}[X_t|X_0]$. From equation $(1)$, we can derive: $$\begin{align*} &\mathbb{V}[X_{t+s}|X_0]=(1+sa(t))^2\mathbb{V}[X_t|X_0]+sg^2(t)=\left(1+2sa(t)+s^2a^2(t)\right)\mathbb{V}[X_t|X_0]+sg^2(t)\\ \Rightarrow&v'(t)=\lim\limits_{s\to 0}\frac{\mathbb{V}[X_{t+s}|X_0]-\mathbb{V}[X_t|X_0]}{s}=\lim\limits_{s\to 0}\left[2a(t)+sa^2(t)\right]\mathbb{V}[X_t|X_0]+g^2(t)=2a(t)\mathbb{V}[X_t|X_0]+g^2(t)=2a(t)v(t)+g^2(t)\\ \end{align*}$$

As $v(0)=0$, the solution is given by: $$\begin{align*} \mathbb{V}[X_{t}|X_0]=v(t)=\left(\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}2a(r)dr)g^2(s)ds\right)\exp\left(\int\limits_{0}^{t}2a(s)ds\right) \end{align*}$$

We use the law of total variance to get: $$\begin{align*} \mathbb{V}[X_{t}]=&\mathbb{E}[\mathbb{V}[X_{t}|X_0]]+\mathbb{V}[\mathbb{E}[X_t|X_0]]\\ =&\left(\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}2a(r)dr)g^2(s)ds\right)\exp\left(\int\limits_{0}^{t}2a(s)ds\right)+\exp\left(2\int\limits_{0}^{t}a(s)ds\right)\mathbb{V}[X_0] \end{align*}$$

3.4. Gaussian distribution of conditional distribution¶

Let's reshape equation $(1)$ to get: $$ \begin{align} X_{t+s}\approx (1+sa(t))X_t+sb(t)+g(t)\sqrt{s}\epsilon \end{align} $$ We see that in discretized terms, $X_{t+s}$ is an autoregressive process with Gaussian white noise - i.e. this process is again normally distributed. The limit of Gaussian distributions is also Gaussian. Therefore, $X_t|X_0$ is also Gaussian distributed.

3.5. Summary¶

If the drift coefficients are affine, i.e. $f(x,t)=a(t)x+b(t)$, the SDE $$dX(t) = f(X_t,t)dt+g(t)dW_t=[a(t)X_t+b(t)]dt+g(t)dW_t$$ fulfils:

  1. Expectation value: $$\begin{align*}\mathbb{E}[X_t|X_0]&=\left(X_0+\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}a(r)dr)b(s)ds\right)\exp\left(\int\limits_{0}^{t}a(s)ds\right)\quad\text{(ECond)}\\ \mathbb{E}[X_t]&=\left(\mathbb{E}[X_0]+\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}a(r)dr)b(s)ds\right)\exp\left(\int\limits_{0}^{t}a(s)ds\right)\quad\text{(E)} \end{align*}$$
  2. Variance: $$\begin{align*} \mathbb{V}[X_{t}|X_0] =&\left(\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}2a(r)dr)g^2(s)ds\right)\exp\left(\int\limits_{0}^{t}2a(s)ds\right)\quad\text{(VCond)}\\ \mathbb{V}[X_{t}] =&\left(\mathbb{V}[X_0]+\int\limits_{0}^{t}\exp(-\int\limits_{0}^{s}2a(r)dr)g^2(s)ds\right)\exp\left(\int\limits_{0}^{t}2a(s)ds\right)\quad\text{(V)} \end{align*}$$
  3. Normal conditional distribution: $$\begin{align*} X_{t}|X_0 \sim&\mathcal{N}(\mathbb{E}[X_t|X_0],\mathbb{V}[X_t|X_0]) \end{align*}$$

Note the above is crucial for diffusion models: the affine drift coefficients let us explicilty compute the distribution of $X_t|X_0$ as a normal distribution. This will later become crucial to train these models via score-matching.

3.6. Equations in action¶

Let's see the above equations in actions and revisit the SDE with:

  • $X_0\sim\mathcal{N}(0,1)$
  • $f(x,t)=-(x-4)$
  • $g(t)=\sqrt{\sin(2\pi t)+1.10}$

$f$ is affine with $f(x,t)=a(t)x+b(t)$ with $a(t)=-1$ and $b(t)=+4$. Therefore, we can derive that: $$ \begin{align*} \mathbb{E}[X_t]&=\exp(-t)\int\limits_{0}^{t}\exp(s)4ds=4\exp(-t)\left[\exp(t)-1\right]=4[1-\exp(-t)]\\ \mathbb{V}[X_t]&=\left(1+\int\limits_{0}^{t}\exp(2s)[\sin(2\pi s)+1.10]ds\right)\exp(-2t)\\ &=\left(1+\frac{1.10}{2}[\exp(2t)-1]+\left[\exp(2s)\frac{2\sin(2\pi s)-2\pi\cos(2\pi s)}{4+4\pi^2}\right]_{0}^{t}\right)\exp(-2t)\\ &=\left(1+0.55[\exp(2t)-1]+\frac{1}{2}\exp(2t)\frac{\sin(2\pi t)-\pi\cos(2\pi t)}{1+\pi^2}+\frac{\pi}{2+2\pi^2}\right)\exp(-2t)\\ &=0.55+\exp(-2t)[1-0.55+\frac{\pi}{2+2\pi^2}]+\frac{1}{2}\frac{\sin(2\pi t)-\pi\cos(2\pi t)}{1+\pi^2}\\ &=0.55+\exp(-2t)[0.45+\frac{\pi}{2+2\pi^2}]+\frac{1}{2}\frac{\sin(2\pi t)-\pi\cos(2\pi t)}{1+\pi^2} \end{align*} $$

In [10]:
n_traj = 1000
import math

def ground_truth_mean(t):
    return 4*(1-np.exp(-t))

def ground_truth_std(t):
    summand_1 = 0.55 
    summand_2 = np.exp(-2*t)*(0.45+0.5*math.pi/(1+math.pi**2))
    summand_3 = 0.5*(np.sin(2*math.pi*t)-math.pi*np.cos(2*math.pi*t))/(1+math.pi**2)
    return np.sqrt(summand_1 + summand_2 + summand_3)

x_init = np.random.normal(size=n_traj).reshape(-1,1)
x_traj, time_grid = run_sde(shift_f,periodic_g,x_start=x_init,t_end=10.0)
mean_x_traj = x_traj.squeeze().mean(axis=1)[:-1]
std_x_traj = x_traj.squeeze().std(axis=1)[:-1]

fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(18,6))
ax1.plot(time_grid,ground_truth_mean(time_grid),label="derived formula")
ax1.plot(time_grid,mean_x_traj,label="simulation")
ax1.legend()
ax1.set_title("E[X_t]")

ax2.plot(time_grid,ground_truth_std(time_grid),label="derived formula")
ax2.plot(time_grid,std_x_traj,label="simulation")
ax2.legend()
ax2.set_title("std of X_t")

pd.DataFrame(x_traj[:-1,:10].squeeze(),index=time_grid).plot(ax=ax3)
ax3.set_title("Example trajectories")
0it [00:00, ?it/s]
Out[10]:
Text(0.5, 1.0, 'Example trajectories')

4. SDE types for diffusion models¶

Next, we would like to see how the above formulas can be applied to diffusion models. There are 3 frameworks to build SDEs for diffusion models: variance-preserving and variance-exploding.

Throughout this chapter, we consider noise functions $\beta(t)$ that satisfy:

  1. $\beta(0)=0$
  2. $\beta'(t)\geq 0$
  3. $\beta(t)\to\infty$ for $t\to\infty$

4.1. Variance Preserving (VP) SDE¶

In the paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics, the authors proposed diffusion models derived dynamically adding noise to add inspired by thermodynamics. Later, it was shown by Song et al. that this can be re-framed as an SDE. Their diffusion runs from time $t=0$ to $t=T$ in $n$ steps $t_i=i\cdot T/n$ for $i=0,1,...,n$ and at each step

$$X_{t_{i+1}}=\sqrt{1-(\beta(t_{i+1})-\beta(t_i))}X_i+\sqrt{(\beta(t_{i+1})-\beta(t_i))}\epsilon$$

So the conditional distribution is given by:

\begin{align*} q(x_{t_{i+1}}|x_{t_i}) &= \mathcal{N}(x_{t_{i+1}};x_{t_{i}}\sqrt{1-(\beta(t_{i+1})-\beta(t_i))},(\beta(t_{i+1})-\beta(t_i))\mathbb{1}) \end{align*}

Let's compute $f$ for $n\to\infty$: \begin{align*} f(x,t)&=\lim\limits_{h\to 0}\frac{\mathbb{E}[X_{t+h}-X_{t}|X_t=x]}{h}\\ &=\lim\limits_{h\to 0}\frac{x\sqrt{1-(\beta(t+h)-\beta(t))}-x}{h}\\ &=x\lim\limits_{h\to 0}\frac{\sqrt{1-(\beta(t+h)-\beta(t))}-1}{h}\\ &=x\frac{d}{dh}\sqrt{1-\beta(t+h)+\beta(t)}|_{h=0}\\ &=-x\frac{1}{2}\beta'(t) \end{align*} by the chain rule.

Let's compute $g$ for $n\to\infty$: \begin{align*} g(t)=& \sqrt{\lim\limits_{h\to 0}\frac{\mathbb{V}[X_{t+h}|X_t=x]}{h}}\\ =&\sqrt{\lim\limits_{h\to 0}\frac{\beta(t+h)-\beta(t)}{h}}\\ =&\sqrt{\beta'(t)} \end{align*}

So the time-continuous diffusion model is given by the SDE: \begin{align*} dX_t = -\frac{1}{2}X_t\beta'(t)dt+\sqrt{\beta'(t)}dW_t \end{align*} Therefore, using equation (E) above, the expecation value is given by (setting $b(s)=0,a(t)=-\frac{1}{2}\beta'(t)$): $$\begin{align*} \mathbb{E}[X_t|X_0]&=\left(X_0+0\right)\exp\left(\int\limits_{0}^{t}-\frac{1}{2}\beta'(s)ds\right) =X_0\exp\left(-\frac{1}{2}\beta(t)\right)\\ \mathbb{E}[X_t]&=\mathbb{E}[X_0]\exp\left(-\frac{1}{2}\beta(t)\right) \end{align*} $$ As $\beta(t)$ is monotonically increasing, we see that the process has an exponential decay towards $0$.

Using equation (V) above, the variance is given by (setting $a(t)=-\frac{1}{2}\beta'(t),g(t)=\sqrt{\beta'(t)}$):

$$\begin{align*} \mathbb{V}[X_{t}|X_0]=&\left(\int\limits_{0}^{t}\exp(\int\limits_{0}^{s}\beta'(r)dr)\beta'(s)ds\right)\exp\left(-\beta(t)\right)\\ =&\left(\int\limits_{0}^{t}\exp(\beta(s))\beta'(s)ds\right)\exp\left(-\beta(t)\right)\\ =&\left(\exp(\beta(t))-1\right)\exp\left(-\beta(t)\right)\\ =&1-\exp\left(-\beta(t)\right)\\ \\ \mathbb{V}[X_{t}] =&1-\exp\left(-\beta(t)\right)+\mathbb{V}[X_0]\exp\left(-\beta(t)\right)\\ =&1+\left(\mathbb{V}[X_0]-1\right)\exp\left(-\beta(t)\right) \end{align*}$$

Therefore, we see that the variance is exponentially converging to $1$. If $\mathbb{V}[X_0]=1$, i.e. the distribution has unit variance, we see that the variance is constantly $1$ - hence, the name variance-preserving (VP). In any case, we find that there is exponential convergence into the normal distribution:

$$X_t|X_0\sim \mathcal{N}(\mathbb{E}[X_t|X_0],\mathbb{V}[X_t|X_0])\to \mathcal{N}(0,1)\quad\text{as }t\to\infty$$

Let's simulate the SDE for constant $\beta(t)=t$ and $X_0\sim\mathcal{N}(1,9)$. We observe the proven exponential convergence into zero mean and unit standard deviation:

In [11]:
n_traj = 10000
INIT_MEAN = 1
INIT_STD = 9
INIT_VAR = INIT_STD**2
BETA_MIN = 0.0
BETA_MAX = 1.0
import math

def linear_beta(t): return BETA_MIN+(BETA_MAX-BETA_MIN)*t
def vp_g(t): return np.ones(shape=t.shape)*np.sqrt(BETA_MAX-BETA_MIN)
def vp_f(x,t):
    if isinstance(t,float): return -0.5*x*(BETA_MAX-BETA_MIN)
    else: return -0.5*(BETA_MAX-BETA_MIN)[:,None]*x
        

def ground_truth_mean(t): return INIT_MEAN*np.exp(-0.5*linear_beta(t))
def ground_truth_std(t): return np.sqrt(1+(INIT_VAR-1)*np.exp(-linear_beta(t)))

x_init = np.random.normal(loc=INIT_MEAN,scale=INIT_STD,size=n_traj).reshape(-1,1)
x_traj, time_grid = run_sde(vp_f,vp_g,x_start=x_init,t_end=15.0)
mean_x_traj = x_traj.squeeze().mean(axis=1)[:-1]
std_x_traj = x_traj.squeeze().std(axis=1)[:-1]

fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(18,6))
ax1.plot(time_grid,ground_truth_mean(time_grid),label="derived formula")
ax1.plot(time_grid,mean_x_traj,label="simulation")
ax1.legend()
ax1.set_title("E[X_t]")
ax1.hlines(0,ax1.get_xlim()[0],ax1.get_xlim()[1],linestyle='--',color='black')
#ax1.set_ylim(-1,INIT_MEAN+1)

ax2.plot(time_grid,ground_truth_std(time_grid),label="derived formula")
ax2.plot(time_grid,std_x_traj,label="simulation")
ax2.legend()
ax2.set_title("std of X_t")
ax2.hlines(1,ax2.get_xlim()[0],ax2.get_xlim()[1],linestyle='--',color='black')
ax2.set_ylim(0,INIT_STD+1)

pd.DataFrame(x_traj[:-1,:10].squeeze(),index=time_grid).plot(ax=ax3)
ax3.set_title("Example trajectories")
0it [00:00, ?it/s]
Out[11]:
Text(0.5, 1.0, 'Example trajectories')

Deep Dive (Optional): Rigorous proof for convergence. A rigorous proof for the convergence uses the Wasserstein distance. More specifically, for every test function $f:\mathbb{R}\to\mathbb{R}$ with Lipschitz constant $L\leq 1$, we find that: $$\begin{align*} |\mathbb{E}[f(X_t)]-\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[f(Z)]|&\leq \mathbb{E}[|\mathbb{E}[f(X_t)|X_0]-\mathbb{E}[f(Z)]|] \\ &= \mathbb{E}[|\mathbb{E}[f(\sqrt{v(t)}Z+\mathbb{E}[X_t|X_0])-\mathbb{E}[f(Z)]|]\\ &\leq \mathbb{E}[|\sqrt{\mathbb{V}[X_t|X_0]}Z+\mathbb{E}[X_t|X_0]-Z|]\\ &\leq \mathbb{E}[|\sqrt{\mathbb{V}[X_t|X_0]}-1||Z|]+\mathbb{E}[|\mathbb{E}[X_t|X_0]|]\quad\quad(\text{VP-Conv})\\ &\leq \mathbb{E}[|\sqrt{1-\exp(-\beta t))}-1|]+{E}[|X_0|]\exp\left(-\frac{1}{2}\beta(t)\right)\\ &\leq \mathbb{E}[|\exp(-\beta t))|]+{E}[|X_0|]\exp\left(-\frac{1}{2}\beta(t)\right)\\ &= (\exp(-0.5\beta t)+{E}[|X_0|])\exp\left(-\frac{1}{2}\beta(t)\right) \to 0 \quad \text{as }t\to\infty \end{align*}$$

As the right-hand side is independent of $f$, we find that the Wasserstein distance is given by $$\mathcal{W}(X_t,\mathcal{N}(0,1))\leq (\exp(-0.5\beta t)+{E}[|X_0|])\exp\left(-\frac{1}{2}\beta(t)\right) \to 0 \quad \text{ as }t\to\infty$$ I.e. we find that the variance-preserving SDE convergence exponentially in the Wasserstein distance to a normal distribution (assuming $\beta(t)\to\infty$ as $t\to\infty$).

4.2. Variance-Exploding SDE¶

Song et al. introduced a different flavour of diffusion models leading to a process that adds more and more noise. Here, we simply inject more and more noise over a certain number of steps: \begin{align*} q(x_{t+h}|x_{t}) =& N (x_{t+h}; x_{t},(\beta(t_{i+1})-\beta(t_i))\mathbb{1}) \end{align*} where $\beta(t)$ satisifes the same properties as above (i.e. $\beta(0)=0,\beta'(t)>0,\beta(t)\to\infty$ for $t\to\infty$). We can immediately see that this can be described by an SDE with: \begin{align*} f(x,t)&=0\\ g(t)&=\lim\limits_{h\to 0}\sqrt{\frac{\beta(t+h)-\beta(t)}{h}}\\ &=\beta'(t) \end{align*}

Therefore, using equation (E) above, the expecation value is given by (setting $b(s)=0,a(0)=0$): $$\begin{align*} \mathbb{E}[X_t|X_0]&=X_0, \quad \mathbb{E}[X_t]=\mathbb{E}[X_0] \end{align*} $$ I.e. the expectation value is constant.

Using equation (V) above, the variance is given by (setting $a(t)=0,g(t)=\sqrt{\beta'(t)}$):

$$\begin{align*} \mathbb{V}[X_{t}|X_0] =&\int\limits_{0}^{t}\exp(\int\limits_{0}^{s}0dr)\beta'(s)ds\exp\left(0\right) =\beta(t)\\ \mathbb{V}[X_{t}] =&\mathbb{V}[X_0]+\beta(t) \end{align*}$$

Therefore, we see that the variance is increasing monotonically with $\beta(t)$ (remember, we needed to set $\beta'(t)>0$) - hence, the name variance-exploding (VE) SDE.

Let's simulate the SDE for constant $\beta(t)=t$ and $X_0\sim\mathcal{N}(1,9)$.

In [12]:
n_traj = 10000
INIT_MEAN = 1
INIT_STD = 1
INIT_VAR = INIT_STD**2
BETA_MIN = 0.0
BETA_MAX = 1.0
import math

def linear_beta(t): return BETA_MIN+(BETA_MAX-BETA_MIN)*t
def ve_g(t): return np.ones(shape=t.shape)*np.sqrt(BETA_MAX-BETA_MIN)
def ve_f(x,t): return 0*x
        

def ground_truth_mean(t): return INIT_MEAN*np.zeros(shape=t.shape)
def ground_truth_std(t): return np.sqrt(INIT_VAR+linear_beta(t))

x_init = np.random.normal(loc=INIT_MEAN,scale=INIT_STD,size=n_traj).reshape(-1,1)
x_traj, time_grid = run_sde(ve_f,ve_g,x_start=x_init,t_end=20.0,n_steps=10000)
mean_x_traj = x_traj.squeeze().mean(axis=1)[:-1]
std_x_traj = x_traj.squeeze().std(axis=1)[:-1]

fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(18,6))
ax1.plot(time_grid,ground_truth_mean(time_grid),label="derived formula")
ax1.plot(time_grid,mean_x_traj,label="simulation")
ax1.legend()
ax1.set_title("E[X_t]")
ax1.hlines(INIT_MEAN,ax1.get_xlim()[0],ax1.get_xlim()[1],linestyle='--',color='black')
ax1.set_ylim(INIT_MEAN-1,INIT_MEAN+1)

ax2.plot(time_grid,ground_truth_std(time_grid),label="derived formula")
ax2.plot(time_grid,std_x_traj,label="simulation")
ax2.legend()
ax2.set_title("std of X_t")

pd.DataFrame(x_traj[:-1,:10].squeeze(),index=time_grid).plot(ax=ax3)
ax3.set_title("Example trajectories")
0it [00:00, ?it/s]
Out[12]:
Text(0.5, 1.0, 'Example trajectories')

Naturally, the model is not converging as the variance is increasing monotonically. However, a rescaled version does converge. Let $Y_t=\frac{X_t}{\sqrt{\beta(t)}}$. We find that using equation (VCond) and (ECond) that $v(t)=\beta(t)$ and therefore

Deep Dive (Optional): Rescaled VE-SDE convergence. $$Y_t|X_0 \sim \frac{1}{\sqrt{\beta(t)}}(\sqrt{v(t)}Z+e(t))=Z+\frac{\mathbb{E}[X_0]}{\sqrt{\beta}(t)}\quad\text{for }Z\sim\mathcal{N}(0,1)$$ Then for every test function $f:\mathbb{R}\to\mathbb{R}$ with Lipschitz constant $L\leq 1$ $$\begin{align*} |\mathbb{E}[f(Y_t)]-\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[f(Z)]|&\leq \mathbb{E}[|\mathbb{E}[f(Y_t)|X_0]-\mathbb{E}[f(Z)]|]\\ &\leq \mathbb{E}[|Z+\frac{e(t)}{\sqrt{\beta}(t)}-Z]|]\\ &= \frac{|\mathbb{E}[X_0]|}{\sqrt{\beta}(t)}\to 0 \quad \text{for }t\to\infty \end{align*}$$

As the right-hand side is independent of $f$, we find that the Wasserstein distance is given by $$\mathcal{W}(Y_t,\mathcal{N}(0,1))\leq \frac{|\mathbb{E}[X_0]|}{\sqrt{\beta}(t)}\to 0 \quad \text{for }t\to\infty$$ I.e. we find that the rescaled variance-exploding SDE convergences in the Wasserstein distance to a normal distribution (assuming $\beta(t)\to\infty$ as $t\to\infty$).

4.3. Sub-VP SDE¶

This SDE type introduces a modified version of the $VP$ SDE. It is defined by:

$$f(x,t)=-\frac{1}{2}\beta'(t)x,\quad g(t)=\sqrt{\beta'(t)(1-\exp(-2\beta(t))}$$

The expectation is given by $$\begin{align*}\mathbb{E}[X_t|X_0]&=X_0\exp\left(\int\limits_{0}^{t}-\frac{1}{2}\beta'(s)ds\right)\\ &=X_0\exp\left(-\frac{1}{2}\beta(t)\right)\\ \mathbb{E}[X_t]&=\mathbb{E}[X_0]\exp\left(-\frac{1}{2}\beta(t)\right) \end{align*}$$

and the conditional variance:

$$\begin{align*}\mathbb{V}[X_{t}|X_0] =&\left(\int\limits_{0}^{t}\exp(\int\limits_{0}^{s}\beta'(r)dr)\beta'(s)(1-\exp(-2\beta(s))ds\right)\exp\left(-\int\limits_{0}^{t}\beta'(s)ds\right)\\ =&\left(\int\limits_{0}^{t}\exp(\beta(s))\beta'(s)(1-\exp(-2\beta(s))ds\right)\exp\left(-\beta(t)\right)\\ =&\left(\int\limits_{0}^{t}\exp(\beta(s))\beta'(s)ds-\int\limits_{0}^{t}\beta'(s)\exp(-\beta(s))ds\right)\exp\left(-\beta(t)\right)\\ =&\left(\exp(\beta(t))-1+\exp(-\beta(t))-1\right)\exp\left(-\beta(t)\right)\\ =&1-2\exp(-\beta(t))+\exp(-2\beta(t))\\ =&(1-\exp(-\beta(t)))^2\\ \mathbb{V}[X_{t}]=&(1-\exp(-\beta(t)))^2+\mathbb{V}[X_t]\exp(-\beta(t)) \end{align*}$$

The above equation also shows that the variance for sub-VP SDE is always smaller than the variance for the VP SDE - therefore, the name.

Deep Dive (Optional): Rigorous proof for sub-VP convergence. Using the equation in VP-Conv above (see 6.1.), we get that for every 1-Lipschitz $f$: $$\begin{align*} |\mathbb{E}[f(X_t)]-\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[f(Z)]| &\leq \mathbb{E}[|\sqrt{\mathbb{V}[X_t|X_0]}Z+\mathbb{E}[X_t|X_0]-Z|]\\ &\leq \mathbb{E}[|\sqrt{\mathbb{V}[X_t|X_0]}-1||Z|]+\mathbb{E}[|\mathbb{E}[X_t|X_0]|]\\ &\leq \mathbb{E}[|1-\exp(-\beta(t)))-1|]+{E}[|X_0|]\exp\left(-\frac{1}{2}\beta(t)\right)\\ &= (\exp(-0.5\beta(t)))+{E}[|X_0|])\exp\left(-\frac{1}{2}\beta(t)\right) \end{align*}$$

Therefore, we find that the Wasserstein distance is given by $$\mathcal{W}(X_t,\mathcal{N}(0,1))\leq (\exp(-0.5\beta t)+{E}[|X_0|])\exp\left(-\frac{1}{2}\beta(t)\right) \to 0 \quad \text{ as }t\to\infty$$ I.e. we find that the variance-preserving SDE convergence exponentially in the Wasserstein distance to a normal distribution (assuming $\beta(t)\to\infty$ as $t\to\infty$).

Let's plot the above formula and compare it to a simulation as well as to the VP simulation:

4.4. Overview over SDE types used for generative modelling¶

$f(x,t)$ $g(t)$ $\mathbb{E}[X_t;X_0]$ $\mathbb{V}[X_t;X_0]$ Variance asymptotics Expectation asymptotics Convergence into $\mathcal{N}(0,1)$? Convergence until time $t=T$
Variance-exploding (VE) $0$ $\sqrt{\beta'(t)}$ $X_0$ $\beta(t)$ Exploding Constant $\frac{X_t}{\sqrt{\beta(t)}}$ $\propto\frac{1}{\sqrt{\beta(T)}}$
Variance-preserving (VP) $-\frac{x\beta'(t)}{2}$ $\sqrt{\beta'(t)}$ $X_0\exp\left(-\frac{1}{2}\beta(t)\right)$ $1-\exp\left(-\beta(t)\right)$ Constant if $\mathbb{V}[X_0]=1$ Exponential decay to $0$ $X_t$ $\propto\exp(-\frac{1}{2}\beta(T))$
Sub-VP $-\frac{x\beta'(t)}{2}$ $\sqrt{\beta'(t)\exp(1-2\beta(t))}$ $X_0\exp\left(-\frac{1}{2}\beta(t)\right)$ $(1-\exp(-\beta(t)))^2$ Exponential convergence to variance $1$ Exponential decay to $0$ $X_t$ $\propto\exp(-\frac{1}{2}\beta(T))$

4.5 Specific noise functions for known diffusion models¶

In this section, we go through known diffusion models in the literature and study what noise functions $\beta(t)$ they use.

4.5.1. Score-matching with Langevin dynamics (SMLD)¶

The work "Generative Modeling by Estimating Gradients of the Data Distribution " proposed to use a variance-exploding SDE from $t=0$ to $t=T=1$ such that the noise scales follow a geometric sequence from standard deviation $\sigma_{\text{min}}$ to $\sigma_{\text{max}}$: $$\mathbb{V}[X_t|X_0]=\beta(t)=\sigma^2_{\text{min}}\left(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}\right)^{2t}$$ $$\Rightarrow \sqrt{\beta(t)}=\sigma_{\text{min}}\left(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}\right)^t$$ $$\Rightarrow\beta'(t)=\sigma^2_{\text{min}}\left(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}\right)^{2t}\log(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})2$$ $$\Rightarrow g(t)=\sqrt{\beta'(t)}=\sigma_{\text{min}}\left(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}\right)^{t}\sqrt{\log(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})2}$$

4.5.2. Denoising Diffusion Probabilistic Models (DDPMs)¶

In the work Denoising Diffusion Probabilistic Models it was proposed to use a model that corresponds to a variance-preserving SDE running from $t=0$ to $t=T=1$ setting $\beta'(t)=\beta_{\text{min}}+t(\beta_{\text{max}}-\beta_{\text{min}})$ and then: $$\beta(t)=\frac{1}{2}t^2(\beta_{\text{max}}-\beta_{\text{min}})+t\beta_{\text{min}}$$

4.5.3. DDPM Sub-VP SDEs¶

This is a variant of the former SDE to reframe it as an Sub-VP SDE as used by Song et al. in their experiments. Here, the sub-VP SDE runs from $t=0$ to $t=T=1$ and uses the same $\beta(t)$.

4.5.4. Summary¶

Method SDE type $f(x,t)$ $g(t)$ $\mathbb{E}[X_t;X_0]$ $\mathbb{V}[X_t;X_0]$ Convergence into $\mathcal{N}(0,1)$? Convergence until $T=1$
Score matching with Langevin dynamics (SMLD) Variance-exploding (VE) $0$ $\sigma_{\text{min}}\left(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}\right)^{t}\sqrt{2\log(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})}$ $X_0$ $\sigma^2_{\text{min}}\left(\frac{\sigma^2_{\text{max}}}{\sigma^2_{\text{min}}}\right)^{t}$ $\frac{X_t}{\sigma_{\text{min}}}\left(\frac{\sigma_{\text{min}}}{\sigma_{\text{max}}}\right)^t$ $\propto\frac{1}{\sigma_{\text{max}}}$
Deep UL using Nonequ. Thermodynamics Variance-preserving (VP) $-x\frac{\beta_{\text{min}}+t(\beta_{\text{max}}-\beta_{\text{min}})}{2}$ $\sqrt{\beta_{\text{min}}+t(\beta_{\text{max}}-\beta_{\text{min}})}$ $X_0\exp\left(-\frac{1}{4}t^2(\beta_{\text{max}}-\beta_{\text{min}})-\frac{1}{2}t\beta_{\text{min}}\right)$ $1-\exp\left(-\frac{1}{2}t^2(\beta_{\text{max}}-\beta_{\text{min}})-t\beta_{\text{min}}\right)$ $X_t$ $\propto\exp(-\frac{1}{4}(\beta_{\text{max}}+\beta_{\text{min}}))$
Score-based generative modeling through sub-VP SDEs sub-VP $-x\frac{\beta_{\text{min}}+t(\beta_{\text{max}}-\beta_{\text{min}})}{2}$ $\sqrt{[\beta_{\text{min}}+t(\beta_{\text{max}}-\beta_{\text{min}})][1-\exp(-t^2(\beta_{\text{max}}-\beta_{\text{min}})-2t\beta_{\text{min}})])}$ $X_0\exp\left(-\frac{1}{4}t^2(\beta_{\text{max}}-\beta_{\text{min}})-\frac{1}{2}t\beta_{\text{min}}\right)$ $[1-\exp\left(-\frac{1}{2}t^2(\beta_{\text{max}}-\beta_{\text{min}})-t\beta_{\text{min}}\right)]^2$ $X_t$ $\propto\exp(-\frac{1}{4}(\beta_{\text{max}}+\beta_{\text{min}}))$

6.From noise to signal: Generative modelling with SDEs¶

So far, we have learnt that we have can construct stochastic differential equations evolving from time $t=0$ to a certain time $t=T$. For affine drift coefficients, we have learnt how compute their distributions over time. This is relevant for generative AI because, as we have learnt, SDEs converge under certain constraints to a normal distribution $\mathcal{N}(0,1)$ regardless (!!!) of their initial distribution $X_0\sim p_0$! Let me explain why that is significant...

Remember that for generative AI, the goal is to sample from a data distribution $p_{\text{data}}$, e.g. $p_{\text{data}}$ could be a distribution of images over celebrities and the goal is to generate a new picture of a fictional celebrity. If we set $p_{\text{data}}=p_0$ as the initial distribution of an SDE, we know that after a certain time $T>0$, the distribution of $X_T$ is $\mathcal{N}(0,1)$ (up to some negligible error). The significance of this is that while we cannot draw novel samples from $p_{\text{data}}$ (beyond what we have in our train data), we can sample from $\mathcal{N}(0,1)$ easily. Therefore, the idea is go backwards: if we run an SDE backwards in time (from $t=T$ to $t=0$), we could sample $X_T\sim \mathcal{N}(0,1)$ and then evolve the SDE backwards to reach $X_0\sim p_{\text{data}}$. In other words, we would have learnt to sample from our data distribution. Here is a figure from Song et al. that illustrates this:

image.png

In this chapter, we are going to learn how to reverse an SDE in time enabling us to build generative AI models.

6.1. Time-reversal formula¶

An important result of SDEs is that we can reverse them in time. What we mean with that is the following: imagine that we run an SDE $$dX(t) = f(X_t,t)dt+g(t)dW_t$$ from time $t=0$ ($X_0\sim p_0$) to time $X_{T}\sim p_{T}$. The question is: if we started only with a sample from $X_T\sim p_T$, is there an SDE that runs backwards in time (from $t=T$ to $t=0$) and gives the same stochastic process as $X_t$?

Let's first define what a reverse-time SDE is. It is given by the following equation:

$$d\bar{X}_t= \bar{f}(X_t,t)dt + \bar{g}(t)d\bar{W}_t $$

where the Brownian motion $\bar{W}_u$ runs backwards in time (i.e. $\bar{W}_{t-s}-\bar{W}_{t}$ is independent of $\bar{W}_t$ for $s>0$). In discretized terms, that means: $$\begin{align*} \bar{X}_{t-s}-\bar{X}_t &\approx - s\bar{f}(\bar{X}_t,t)+\sqrt{s}\bar{g}(t)\epsilon \quad \text{ where } \epsilon\sim\mathcal{N}(0,I)\end{align*}$$

So we can reframe our question: given $f,g$, are there $\bar{f},\bar{g}$ such that the reverse-time diffusion process $\bar{X}_{t}$ has the same distribution as the forward-time process $X_t$?

An important result from the 80s from mathematician Brian Anderson is: Yes! We can reverse an SDE by setting

$$\begin{align*}\bar{g}&=g\quad(\text{g-Reverse})\\ \bar{f}(x,t)&=f(x,t)-g^2(t)\frac{\partial}{\partial x}\log p_{t}(x)\quad(\text{f-Reverse})\end{align*}$$

where $p_t(x)$ is the density of the distribution of $X_t$. In other words, the reverse-time SDE is given by: $$d\bar{X}_t= [f(X_t,t)-g^2(t)\frac{\partial}{\partial x}\log p_{t}(\bar{X}_t)]dt + g(t)d\bar{W}_t \quad(\text{Reverse-SDE})$$

Intuitively, the statement says: to reverse an SDE, we need to:

  1. Add Gaussian noise $\bar{W}_t$ with the same variance drift as the forward equation.
  2. Go $f(x,t)$ backwards (i.e. $-f(x,t)$) and account for the change in distribution $p_t(x)$.

Proof?: Later, I also give a self-contained derivation of the above time-reversal formula. As this is not relevant for the machine learning practitioner, I have put in the appendix at the end of this tutorial.

6.2. Score-based generative modelling with SDEs¶

The time-reversal formula allows us to learn how to generate samples from a probability distribution $p_{\text{data}}$. In order to sample from $p_{\text{data}}$, we follow these steps:

  1. Setup - select converging SDE. Select $f(x,t)$ and $g(t)$ with affine drift coefficients such that $X_t$ converges to a static distribution $q$ (e.g. $q=\mathcal{N}(0,1)$).
  2. Training - learn scores: Train a deep neural network $s_{\theta}(x,t)$ to approximate $s_{\theta}(x,t)=\frac{\partial}{\partial x}\log p_t(x)$ with $p_t$ is the distribution of $X_t$ when $X_0\sim p_{\text{data}}(x)$.
  3. Deployment - run reverse-time SDE with trained score: To sample from $p_{\text{data}}$:\ 3.1. Sample $X_T\sim q$.\ 3.2. Run-reverse time SDE from $T$ to $0$ replacing $\frac{\partial}{\partial x}\log p_{t}$ with $s_{\theta}$: $$d\bar{X}_t= [f(X_t,t)-g^2(t)s_{\theta}(x,t)dt + g(t)d\bar{W}_t$$
      I.e. numerically with step size $s>0$:
    
    $$\begin{align*} \bar{X}_{t-s} &\approx \bar{X}_t +s\left[g^2(t)s_{\theta}(\bar{X}_t,t)-f(\bar{X}_t,t)\right]+\sqrt{s}g(t)\epsilon \quad \text{ where } \epsilon\sim\mathcal{N}(0,I)\quad(\text{Reverse Dynamics})\end{align*}$$

Score-based models?: the above explains why the methods are called score-based generative models: to generate the reverse-time SDE, we "only" need to train a model to predict the score $\frac{\partial}{\partial x}\log p_t(x)$ - everything else is defined by us.

Remark - annealed Langevin dynamics: Remember from the previous tutorial that Langevin dynamics can sample from $p_{\text{data}}$ by the following (reverse-time) iteration: $$\bar{X}_{t-h} = \bar{X}_t+h\frac{\partial}{\partial x}\log p_{\text{data}}(\bar{X}_t)+\sqrt{h}\epsilon\quad \text{ where } \epsilon\sim\mathcal{N}(0,I)\quad\text{(Langevin Dynamics)}$$

For the variance exploding SDE, we have that:

$$\begin{align*}\bar{X}_{t-s}\approx\bar{X}_t+h\frac{\partial}{\partial x}\log p_t(\bar{X}_t)+\sqrt{h}\epsilon \end{align*}$$

for $h=sg^2(t)$. So basically, the right-hand side is a simulation of Langevin dynamics with a slight difference: instead of using $p_{\text{data}}$ at every time step, we use a noisy version $$p_t(x_t)=\int \mathcal{N}(x_t;x_0,\beta(t))p_{\text{data}}(x_0)dx_0$$ As the noise $\beta(t)$ is decreasing to zero for $t\to 0$, $p_{t}\to p_{\text{data}}$ - i.e. we anneal the distribution to the data distribution. This gives the above method annealed Langevin dynamics. The advantage of annealed Langevin dynamics is in regions of low probability $p_{\text{data}}$, we have very poor estimates of the scores (just out of data scarcity). Therefore, upon initialization, one uses a noised version (introducing bias to remove high variance). Yang Song's blog has great illustrations for that.

6.3. Training scores¶

The question remains how we train the neural network $s_{\theta}$. Due to the construction, the shape of the input and output of $s_{\theta}$ are simply the shape of our data vectors $X_0\sim p_{\text{data}}$. Therefore, the architecture $s_{\theta}$ is likely going to depend on the data type we are looking at - for images, we might want to pick a transformer or a CNN.

It remains to define a differentiable loss $L(\theta)$. A natural choice is to use the mean-squared error to the true score $\frac{\partial}{\partial x}p_t(x)$ weighted by a function $\lambda:[0,T]\to\mathbb{R}_{+}$: $$\begin{align*} L_{\text{untractable}}(\theta)=&\frac{1}{T}\int\limits_{t=0}^{T}\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{x_t\sim p_t(\cdot|x_0)}\left[\|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_t(x_t)\|^2\right]dt\\ =&\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{x_t\sim p_t(\cdot|x_0)}\left[\|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_t(x_t)\|^2\right]\right]\\ \end{align*} $$ where have replaced the integral with a Monte Carlo estimate of the integral. Notice that in the above, we cannot compute $p_t(x_t)$ as it depends on $p_0=p_{\text{data}}$ - which we are trying learn! However, it holds that for $X_t|X_0\sim p_{t|0}(\cdot|X_0)$:

$$\begin{align*} p_t(x_t)=\mathbb{E}_{x_0\sim p_{\text{data}}}\left[p_{t|0}(x_t|x_0)\right]=\int p_{t|0}(x_t|x_0)p_{\text{data}}(x)dx \end{align*} $$

and therefore $$\begin{align*} \frac{\partial}{\partial x_t}\log p_t(x_t) =&p_{t}(x_t)^{-1}\frac{\partial}{\partial x_t}p_t(x_t)\\ =&p_{t}(x_t)^{-1}\frac{\partial}{\partial x_t}\int p_{t|0}(x_t|x)p_{\text{data}}(x)dx\\ =&p_{t}(x_t)^{-1}\int \frac{\frac{\partial}{\partial x_t}p_{t|0}(x_t|x)}{p_{t|0}(x_t|x)}p_{\text{data}}(x)p_{t|0}(x_t|x)dx\\ =&p_{t}(x_t)^{-1}\int \frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x)p_{\text{data}}(x)p_{t|0}(x_t|x)dx\\ =&\int \frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x)p_{0|t}(x|x_t)dx\\ =&\mathbb{E}_{x_0'\sim p_{0|t}(\cdot|x_t)}\left[\frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x_0')\right] \end{align*} $$

and therefore $$\begin{align*} \|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_t(x_t)\|^2=&\|\mathbb{E}_{x_0'\sim p(\cdot|x_t)}\left[s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x_0')\right]\|^2\\ \leq &\mathbb{E}_{x_0'\sim p(\cdot|x_t)}\left[\|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x_0')\|^2\right]\\ \end{align*} $$ and inserting that equation in $L_{\text{untractable}}$ we get: $$\begin{align*} L_{\text{untractable}}(\theta) =&\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{x_t\sim p_t(\cdot|x_0)}\left[\|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_t(x_t)\|^2\right]\right]\\ \leq &\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{x_t\sim p_t(\cdot|x_0)}\left[\mathbb{E}_{x_0'\sim p(\cdot|x_t)}\left[\|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x_0')\|^2\right]\\\right]\right]\\ = &\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{x_t\sim p_t(\cdot|x_0)}\left[\|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x_0)\|^2\right]\\\right]\\ = & L_{\text{simple}}(\theta) \end{align*} $$ where have used that if $x_0'\sim p_{0|t}(\cdot|x_t)$ and $x_t\sim p_{t}$, then also $x_0'\sim p_0=p_{\text{data}}$. Note, we can compute $L_{\text{simple}}(\theta)$ because it only depends on the conditional distribution $p_{t|0}(x_t|x_0)$ which is normally distributed because we assumed affine drift coefficients: $$p_{t|0}(x_t|x_0)=\mathcal{N}(x_t;\mathbb{E}[X_t|X_0=x_0];\mathbb{V}[X_t|X_0=x_0])$$

As we have shown above, for affine drift coefficients we can simply compute $\mathbb{E}[X_t|X_0=x_0],\mathbb{V}[X_t|X_0=x_0]$ in closed form. The score is then given by:

$$\begin{align*} \frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x_0)=-\frac{x_t-\mathbb{E}[X_t|X_0=x_0]}{\mathbb{V}[X_t|X_0=x_0]} \end{align*} $$

6.4 Score networks as denoising machines¶

Using that $p_{t|0}$ is Gaussian, we can set $X_t|X_0\sim \sqrt{\mathbb{V}[X_t|X_0=x_0]}\epsilon+\mathbb{E}[X_t|X_0]$ where $\epsilon\sim\mathcal{N}(0,\mathbb{1}_d)$ and therefore: $$ \begin{align*} L_{\text{simple}}(\theta) &=\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{x_t\sim p_t(\cdot|x_0)}\left[\|s_{\theta}(x_t,t)+\frac{x_t-\mathbb{E}[X_t|X_0=x_0]}{\mathbb{V}[X_t|X_0=x_0]}\|^2\right]\\\right]\\ &=\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{\epsilon\sim\mathcal{N}(0,\mathbb{1})}\left[\|s_{\theta}(\sqrt{\mathbb{V}[X_t|X_0=x_0]}\epsilon+\mathbb{E}[X_t|X_0],t)+\frac{\epsilon}{\sqrt{\mathbb{V}[X_t|X_0=x_0]}}\|^2\right]\\\right]\\ &=\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\frac{1}{\mathbb{V}[X_t|X_0=x_0]}\mathbb{E}_{\epsilon\sim\mathcal{N}(0,\mathbb{1})}\left[\|\epsilon_{\theta}(\sqrt{\mathbb{V}[X_t|X_0=x_0]}\epsilon+\mathbb{E}[X_t|X_0],t)-\epsilon\|^2\right]\\\right] \end{align*} $$ for an alternative denoising network $$ \begin{align*}&\epsilon_{\theta}(x_t,t)=-s_{\theta}(x_t,t)\sqrt{\mathbb{V}[X_t|X_0=x_0]}\\ \Leftrightarrow &s_{\theta}(x_t,t)=-\frac{\epsilon_\theta(x_t,t)}{\sqrt{\mathbb{V}[X_t|X_0=x_0]}}\end{align*}$$

The network $\epsilon_{\theta}$ is called a denoising network because it learns to predict the noise that has converted the original image $x_0$ to the distorted image $x_t=\sqrt{\mathbb{V}[X_t|X_0=x_0]}\epsilon+\mathbb{E}[X_t|X_0]$. Therefore, training a diffusion model is equivalent to training a neural network to predict the noise that distorted your input.

7. Summary¶

Finally, we summarize the key takeaways from this tutorial:

  1. Ito-SDEs. Ito-SDEs are defined by a deterministic drift $f$ and a random drift $g$ and a Brownian motion $W_t$: $$dX(t) = f(X_t,t)dt+g(t)dW_t$$
  2. Simulation. One can simulate SDEs in discrete time by the Euler-Maryuama method: $$X_{t+s} = X_{t} + sf(X_{t},t)+g(t)\sqrt{s}\epsilon\quad, \epsilon\sim\mathcal{N}(0,\mathbf{1}_d)$$
  3. Affine drifts allow simple formulas. All SDEs relevant for machine learning have affine drift coefficients, i.e. $f(x,t)=a(t)x+b(t)$, one can derive closed-form expressions for conditional expectations and conditional variance and the distributions are normal (i.e. the process is Gaussian).
  4. Training. Training a diffusion model corresponds to training a neural network $s_{\theta}(x,t)$ to optimize the score $s_{\theta}(x,t)\approx p_t(x)$ by minimizing the following loss: $$\begin{align*} L(\theta) = &\mathbb{E}_{t\sim\text{Unif}_{[0,T]}}\left[\lambda(t)\mathbb{E}_{x_0\sim p_{\text{data}}}\mathbb{E}_{x_t\sim p_t(\cdot|x_0)}\left[\|s_{\theta}(x_t,t)-\frac{\partial}{\partial x_t}\log p_{t|0}(x_t|x_0)\|^2\right]\\\right]\\ \end{align*}$$
  5. Sampling. Sampling from a diffusion model corresponds to sampling an inital value $\bar{X}_T\sim q$ from a distribution $q$ that the SDE converges to. Subsequently, the SDE is evolved in reverse-time with approximate score $s_{\theta}$ given by: $$\begin{align*}&d\bar{X}_t= [f(X_t,t)-g^2(t)s_{\theta}(x,t)dt + g(t)d\bar{W}_t\\ \Leftrightarrow& \bar{X}_{t-s} \approx \bar{X}_t +s\left[g^2(t)s_{\theta}(\bar{X}_t,t)-f(\bar{X}_t,t)\right]+\sqrt{s}g(t)\epsilon \quad \text{ where } \epsilon\sim\mathcal{N}(0,I)\end{align*}$$
  6. Most important SDEs for diffusion models: There are 2 big types of SDEs used in the literature for diffusion models:\
    A. The variance-preserving (VP) SDE exponentially converges to $\mathcal{N}(0,\mathbf{1}_d)$. The discrete counterpart is the first diffusion model introduced by [Sohl-Dickstein et al.](https://arxiv.org/pdf/1503.03585.pdf).\
    B. The variance-exploding (VE) SDE has a constant expectation value $\mathbb{E}[X_0]=\mathbb{E}[X_t]$ and a rescaled version converges to $\mathcal{N}(0,\mathbf{1}_d)$. The discrete counterpart is[ Score matching with Langevin dynamics (SMLD)](https://arxiv.org/abs/1907.05600).

What's next? In my next tutorial, we are going to put the theory we developed into practice and train a diffusion model to generate novel images. We are also going to learn abouut conditional generation of images based on text as well as learn about Neural ODEs and their connection to diffusion models.

Appendix: A Mathematical Derivation of the time-reversal formula¶

In the following, I would like to present a simple derivation of the time-reversal formula. For the reader only interested in using the formula above, I recommend to skip this section. Most proofs (including the original one) rely on Kolmogorov's forward equation. However, I don't like using a complicated formula to develop intuition for another formula. Therefore, I developed the below derivation that does not rely on methods outside of traditional probability theory.

The approach of the proof is to express $g,f$ in terms of the following limits: \begin{align*}f(x,t) = \lim\limits_{h\to 0}\frac{\mathbb{E}[X_{t+h}-X_{t}|X_t=x]}{h}\\ g^2(t) = \lim\limits_{h\to 0}\frac{\mathbb{V}[X_{t+h}|X_t=x]}{h} \end{align*} This a reverse definition: instead defining $X_{t}$ via $f,g$ as above, we can also define $f,g$ implicitly via the above equations deriving from $X_t$. We use this identity to derive the equation for $\bar{f},\bar{g}$ from $\bar{X}_t$.

A.1. Deriving $\bar{f}$¶

Since $\bar{f}$ is the time-reversal of the SDE, we can compute: \begin{align*} \mathbb{E}[\bar{X}_{t}-\bar{X}_{t+h}|\bar{X}_{t+h}=x] &=\int (x_t-x)p_{t|t+h}(x_t|x)dx_{t} \\ &=\int (x_t-x)p_{t+h|t}(x|x_{t})p_t(x_t)/p_{t+h}(x)dx_t\\ &\approx\frac{1}{p_{t+h}(x)}\int (x_t-x)\mathcal{N}(x;x_t+hf(x_{t},t),hg(t)))p_t(x_t)dx_t \end{align*} We can approximate $f(x_t,t)$ via a Taylor approximation: \begin{align*} f(x_{t},t) = f(x,t)+(x_t-x)\frac{\partial}{\partial x}f(x,t) \end{align*} and do a substution $x_t\to x_t-hf(x,t)$ and get: \begin{align*} \mathbb{E}[\bar{X}_{t}-\bar{X}_{t+h}|\bar{X}_{t+h}=x] \approx&\frac{1}{p_{t+h}(x)}\int (x_t-x)\mathcal{N}(x_t+hf(x,t)+h(x_t-x)\frac{\partial}{\partial x}f(x_t,t);x,hg(t)))p_t(x_t)dx_t\\ =&\frac{1}{p_{t+h}(x)}\int (x_t-x)\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg(t)))p_t(x_t)dx_t\\ &-\frac{hf(x,t)}{p_{t+h}(x)}\int\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg(t)))p_t(x_t)dx_t\\ =&A+B \end{align*} Using partial integration, we get that: \begin{align*} \frac{A}{h} &= \frac{hg^2(t)}{hp_{t+h}(x)(1+h\frac{\partial}{\partial x}f(x,t))} \int (1+h\frac{\partial}{\partial x}f(x,t))\frac{x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t)-x}{hg^2(t)}\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg^2(t)))p_t(x_t)dx_t \\&-\frac{hg^2(t)}{hp_{t+h}(x)(1+h\frac{\partial}{\partial x}f(x,t))} \int (1+h\frac{\partial}{\partial x}f(x,t))\frac{h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t)}{hg^2(t)}\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg^2(t)))p_t(x_t)dx_t\\ =&-\frac{g^2(t)}{p_{t+h}(x)(1+h\frac{\partial}{\partial x}f(x,t))}\int \frac{\partial}{\partial x_t}\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg^2(t)))p_t(x_t)dx_t\\\ &-\frac{1}{p_{t+h}(x)} \int (x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t)\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg^2(t)))p_t(x_t)dx_t\\ =&\frac{g^2(t)}{p_{t+h}(x)(1+h\frac{\partial}{\partial x}f(x,t))}\int \mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg^2(t)))\frac{\partial}{\partial x_t}p_t(x_t)dx_t\\ &-\frac{1}{p_{t+h}(x)} \int (x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t))\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg^2(t)))p_t(x_t)dx_t\\ \to&\frac{g^2(t)}{p_{t}(x)(1+0)}\int \delta_{x}\frac{\partial}{\partial x_t}p_t(x_t)dx_t\\ &-\frac{1}{p_{t}(x)} \int (x_t-x)\frac{\partial}{\partial x}f(x,t))\delta_{x}p_t(x_t)dx_t\\ =&\frac{g^2(t)}{p_{t}(x)}\frac{\partial}{\partial x}p_t(x)-0=g^2(t)\frac{\partial}{\partial x}\log p_t(x) \end{align*} where $\delta_{x}$ is the dirac delta distribution and we have used that for $x_n\to x$ and $\mu^2_n\to 0$ also for any regular function $G$:

$$ \begin{align*} \int \mathcal{N}(z;x_n,\mu^2_n) G(z)dz \to \int\delta_{x} p_t(z)dz = G(x)\quad\text{as }n\to\infty \end{align*} $$

For $B$, we compute: \begin{align*} \frac{B}{h} =& -\frac{f(x,t)}{p_{t+h}(x)}\int\mathcal{N}(x_t+h(x_t-hf(x,t)-x)\frac{\partial}{\partial x}f(x,t);x,hg(t)))p_t(x_t)dx_t\\ &\to-\frac{f(x,t)}{p_t(x)}p_t(x)=-f(x,t) \end{align*}

As a result, we get that: \begin{align*} \bar{f}(x,t) = -\lim_{t\to 0}\frac{1}{h}\mathbb{E}[\bar{X}_{t-h}-\bar{X}_{t}]=-\lim_{t\to 0}\frac{B+A}{h}=f(x,t)-g^2(t)\frac{\partial}{\partial x}\log p_t(x) \end{align*}

A.2. Deriving $\bar{g}$ for Gaussian SDEs¶

Next, we derive $\bar{g}$. In the following, we assume that the Ito-SDE describes a Gaussian process, i.e. the distribution of tuple $(X_{t_1},X_{t_2},\dots,X_{t_n})$ is Gaussian. As we have shown above, this is in particular the case for affine drift coefficients $f$ - so basically this proof includes all SDEs relevant for machine learning.

For now, let's assume that the initial distribution of $X_0$ is deterministic, i.e. $X_0=x_0$. Let's write down the distribution of $X_{t-h},X_{t}$ for all $t,h\geq 0$:

$$X_{t-h}\sim\mathcal{N}(\mathbb{E}[X_{t-h}|X_0],\mathbb{V}[X_{t-h}|X_0])$$$$X_{t}|X_{t-h}=x_{t-h}\sim\mathcal{N}(x_{t-h}+hf(x_{t-h},t-h),hg^2(t-h))$$

Therefore, we know that the $\text{Cov}[X_{t-h},X_{t}]$ looks as follows: $$ \begin{align*} \text{Cov}[X_{t-h},X_{t}] &= \begin{pmatrix} \mathbb{V}[X_{t-h}|X_0] & \mathbb{V}[X_{t-h}|X_0] \\ \mathbb{V}[X_{t-h}|X_0] & \mathbb{V}[X_{t-h}|X_0] +hg^2(t-h) \end{pmatrix}\\ \mathbb{E}[[X_{t-h},X_{t}]^T|X_0] &= [\mathbb{E}[X_{t-h}|X_0],\mathbb{E}[X_{t-h}|X_0]+hf(X_{t-h},t-h)]^T \end{align*} $$

We use the formula of conditional multivariate Gaussians to get: $$\begin{align*} \bar{g}^2(t)=\lim\limits_{h\to 0}\frac{1}{h}\mathbb{V}[\bar{X}_{t-h}|\bar{X}_t,X_0]=&\lim\limits_{h\to 0}\frac{1}{h}\left[\mathbb{V}[\bar{X}_{t-h}|X_0]-\frac{\mathbb{V}[\bar{X}_{t-h}|X_0]^2}{\mathbb{V}[X_{t-h}|X_0] +hg^2(t-h)}\right]\\ =&\lim\limits_{h\to 0}\frac{1}{h}\mathbb{V}[\bar{X}_{t-h}|X_0]\left[1-[1+\frac{hg^2(t-h)}{\mathbb{V}[\bar{X}_{t-h}|X_0]}]^{-1}\right]\\ =&\mathbb{V}[\bar{X}_{t}|X_0]\frac{d}{dh}\left[1-[1+\frac{hg^2(t-h)}{\mathbb{V}[\bar{X}_{t-h}|X_0]}]^{-1}\right]_{|h=0}\\ \end{align*}$$

We compute the derivative: $$\begin{align*} &\frac{d}{dh}_{| h=0}\left[[1+\frac{hg^2(t-h)}{\mathbb{V}[X_{t-h}|X_0]}]^{-1}-1\right]\\ =&\left[-[1+\frac{hg^2(t-h)}{\mathbb{V}[X_{t-h}|X_0]}]^{-2}\right]_{|h=0}\frac{d}{dh}_{| h=0}\frac{hg^2(t-h)}{\mathbb{V}[X_{t-h}|X_0]}\\ =&-\frac{d}{dh}_{| h=0}\frac{hg^2(t-h)}{\mathbb{V}[X_{t-h}|X_0]}\\ =&-\frac{\mathbb{V}[X_{t-h}|X_0][g^2(t-h)-h2g(t-h)g'(t-h)]-hg^2(t-h)\frac{d}{dh}\mathbb{V}[X_{t-h}|X_0]}{(\mathbb{V}[X_{t-h}|X_0])^2}_{|h=0}\\ =&-\frac{g^2(t)}{\mathbb{V}[X_{t}|X_0]} \end{align*}$$ And therefore, $$\begin{align*} \bar{g}^2(t)=g^2(t) \end{align*} $$

For non-deterministic initial distribution $X_0\sim p_0$, we use the rule of law of total variance to get as well: $$\begin{align*} \frac{1}{h}\mathbb{V}[\bar{X}_{t-h}|\bar{X}_t]&=\frac{1}{h}[\mathbb{E}_{X_0}[\mathbb{V}[\bar{X}_{t-h}|\bar{X}_t,X_0]]+\mathbb{V}[\mathbb{E}_{X_0}[X_{t-h}|X_t,X_0]]\\ &=\frac{1}{h}\mathbb{E}_{X_0}[\mathbb{V}[\bar{X}_{t-h}|\bar{X}_t,X_0]]+\frac{1}{h}\mathbb{V}[-h\bar{f}(X_t,t)]\\ &=\frac{1}{h}\mathbb{E}_{X_0}[\mathbb{V}[\bar{X}_{t-h}|\bar{X}_t,X_0]]+h\mathbb{V}[\bar{f}(X_t,t)]\\ &\to g^2(t)+0 \end{align*}$$

This finishes the proof.

Note: stricly speaking, we would still need to show that the reverse-time process is actually an SDE at all. However, in the Gaussian case, it is trivial to show that the reverse-time Gaussian process is also a Gaussian process. Due to the above derivation, it is then also an SDE.