Statistics
==========

Mathematical foundations
-------------------------

This page uses several standard mathematical results. Here's a quick reference:

.. list-table:: Common mathematical results
   :header-rows: 1
   :widths: 40 40 20

   * - Result
     - Expression
     - Used in
   * - Gaussian integral
     - :math:`\int_{-\infty}^{\infty} e^{-x^2}\,dx = \sqrt{\pi}`
     - :ref:`gaussian-1d-normalization`
   * - Taylor series for exponential
     - :math:`e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!}`
     - :ref:`poisson-mean-proof`
   * - Exponential function property
     - :math:`e^{-\lambda} \cdot e^{\lambda} = e^0 = 1`
     - :ref:`poisson-mean-proof`

.. _gaussian-1d-normalization:

Gaussian distribution in 1D
----------------------------

What is a Gaussian distribution?
  The Gaussian (or normal) distribution is a continuous probability distribution that describes many natural phenomena. Its probability density function is:

  .. math::

    f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

  where:

  - :math:`\mu` is the mean (center of the distribution)
  - :math:`\sigma` is the standard deviation (spread of the distribution)
  - :math:`\sigma^2` is the variance

What do the parameters control?
  The mean :math:`\mu` shifts the distribution along the x-axis, while the standard deviation :math:`\sigma` controls how wide or narrow the bell curve appears.

  **Key properties:**

  - Total area under the curve equals 1 (normalization)
  - About 68% of values lie within :math:`\mu \pm \sigma`
  - About 95% of values lie within :math:`\mu \pm 2\sigma`
  - About 99.7% of values lie within :math:`\mu \pm 3\sigma`

Why does the normalization factor contain :math:`\sqrt{2\pi}`?
  The factor :math:`\frac{1}{\sigma\sqrt{2\pi}}` ensures that the integral over all space equals 1:

  .. math::

    \int_{-\infty}^{\infty} f(x)\,dx = 1

  This is required for a probability density function.

  .. dropdown:: Proof that the integral equals 1
    :icon: unlock
    :color: info

    Start with the Gaussian probability density function:

    .. math::

      \int_{-\infty}^{\infty} \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\,dx

    **Step 1: Change of variables**

    Let :math:`u = \frac{x - \mu}{\sigma}`, so :math:`x = \sigma u + \mu` and :math:`dx = \sigma\,du`:

    .. math::

      \int_{-\infty}^{\infty} \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{u^2}{2}\right) \sigma\,du = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp\left(-\frac{u^2}{2}\right)\,du

    **Step 2: Another substitution**

    Let :math:`v = \frac{u}{\sqrt{2}}`, so :math:`u = \sqrt{2}\,v` and :math:`du = \sqrt{2}\,dv`:

    .. math::

      \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-v^2} \sqrt{2}\,dv = \frac{\sqrt{2}}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-v^2}\,dv = \frac{1}{\sqrt{\pi}} \int_{-\infty}^{\infty} e^{-v^2}\,dv

    **Step 3: Gaussian integral**

    We now use the standard Gaussian integral result (this is a well-known mathematical constant, typically given without proof):

    .. math::

      \int_{-\infty}^{\infty} e^{-v^2}\,dv = \sqrt{\pi}

    Therefore:

    .. math::

      \frac{1}{\sqrt{\pi}} \cdot \sqrt{\pi} = 1

    This confirms that the normalization factor :math:`\frac{1}{\sigma\sqrt{2\pi}}` ensures the total probability equals 1, regardless of the values of :math:`\mu` and :math:`\sigma`.

How does changing the mean affect the distribution?
  Changing :math:`\mu` translates the entire curve horizontally without changing its shape:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt

    x = np.linspace(-10, 10, 500)
    sigma = 1.5

    # Different means
    means = [-3, 0, 3]
    colors = ['#e74c3c', '#3498db', '#2ecc71']

    plt.figure(figsize=(10, 6))
    for mu, color in zip(means, colors):
        y = (1/(sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu)/sigma)**2)
        plt.plot(x, y, linewidth=2.5, label=f'μ = {mu}, σ = {sigma}', color=color)
        plt.axvline(mu, color=color, linestyle='--', alpha=0.5, linewidth=1.5)

    plt.xlabel('x', fontsize=12)
    plt.ylabel('Probability density', fontsize=12)
    plt.title('Effect of changing mean μ', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()

  The dashed vertical lines show the center (mean) of each distribution.

How does changing the standard deviation affect the distribution?
  Changing :math:`\sigma` controls the width of the distribution. Larger :math:`\sigma` means more spread out, smaller :math:`\sigma` means more concentrated around the mean:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt

    x = np.linspace(-10, 10, 500)
    mu = 0

    # Different standard deviations
    sigmas = [0.5, 1.5, 3.0]
    colors = ['#9b59b6', '#e67e22', '#1abc9c']

    plt.figure(figsize=(10, 6))
    for sigma, color in zip(sigmas, colors):
        y = (1/(sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu)/sigma)**2)
        plt.plot(x, y, linewidth=2.5, label=f'μ = {mu}, σ = {sigma}', color=color)
        # Show ±σ range
        plt.axvspan(mu - sigma, mu + sigma, alpha=0.1, color=color)

    plt.xlabel('x', fontsize=12)
    plt.ylabel('Probability density', fontsize=12)
    plt.title('Effect of changing standard deviation σ', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()

  The shaded regions show the :math:`\mu \pm \sigma` range for each distribution.

Why does the peak height change with :math:`\sigma`?
  The normalization constraint means the total area must always equal 1. When you increase :math:`\sigma` (spreading out the distribution), the peak height must decrease to maintain the same total area:

  .. math::

    \text{Peak height} = f(\mu) = \frac{1}{\sigma\sqrt{2\pi}}

  This inverse relationship ensures conservation of total probability.

How do both parameters work together?
  Here's a comparison showing different combinations of :math:`\mu` and :math:`\sigma`:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt

    x = np.linspace(-12, 12, 500)

    # Different combinations
    params = [
        (0, 1.0, '#3498db', 'Standard normal (μ=0, σ=1)'),
        (-4, 0.8, '#e74c3c', 'Narrow, left (μ=-4, σ=0.8)'),
        (3, 2.5, '#2ecc71', 'Wide, right (μ=3, σ=2.5)'),
    ]

    plt.figure(figsize=(10, 6))
    for mu, sigma, color, label in params:
        y = (1/(sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu)/sigma)**2)
        plt.plot(x, y, linewidth=2.5, label=label, color=color)
        plt.axvline(mu, color=color, linestyle='--', alpha=0.4, linewidth=1.5)

    plt.xlabel('x', fontsize=12)
    plt.ylabel('Probability density', fontsize=12)
    plt.title('Combined effects of μ and σ', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()

  Notice how the standard normal distribution (blue, :math:`\mu=0, \sigma=1`) serves as the canonical reference.

Gaussian distribution in 2D
----------------------------

What is a 2D Gaussian distribution?
  The 2D Gaussian extends the 1D case to two variables. For independent variables :math:`x` and :math:`y`, the probability density function is:

  .. math::

    f(x, y) = \frac{1}{2\pi\sigma_x\sigma_y} \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2} - \frac{(y - \mu_y)^2}{2\sigma_y^2}\right)

  where:

  - :math:`\mu_x, \mu_y` are the means in the x and y directions
  - :math:`\sigma_x, \sigma_y` are the standard deviations in the x and y directions
  - The distribution is centered at :math:`(\mu_x, \mu_y)`

Why does the normalization factor become :math:`2\pi\sigma_x\sigma_y`?
  The 2D integral must equal 1:

  .. math::

    \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y)\,dx\,dy = 1

  Since the x and y variables are independent, the 2D Gaussian factorizes into a product of two 1D Gaussians:

  .. math::

    f(x, y) = \frac{1}{\sigma_x\sqrt{2\pi}} \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2}\right) \times \frac{1}{\sigma_y\sqrt{2\pi}} \exp\left(-\frac{(y - \mu_y)^2}{2\sigma_y^2}\right)

  The normalization factor becomes :math:`\frac{1}{2\pi\sigma_x\sigma_y}` since :math:`\sqrt{2\pi} \times \sqrt{2\pi} = 2\pi`.

  .. dropdown:: Proof that the 2D integral equals 1
    :icon: unlock
    :color: info

    Start with the 2D Gaussian probability density function:

    .. math::

      \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \frac{1}{2\pi\sigma_x\sigma_y} \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2} - \frac{(y - \mu_y)^2}{2\sigma_y^2}\right)\,dx\,dy

    **Step 1: Separate the exponential**

    Since the exponent is a sum, we can split it into a product:

    .. math::

      \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2} - \frac{(y - \mu_y)^2}{2\sigma_y^2}\right) = \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2}\right) \exp\left(-\frac{(y - \mu_y)^2}{2\sigma_y^2}\right)

    **Step 2: Factor the double integral**

    The double integral of a product (when variables are independent) equals the product of the integrals:

    .. math::

      \frac{1}{2\pi\sigma_x\sigma_y} \int_{-\infty}^{\infty} \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2}\right)\,dx \int_{-\infty}^{\infty} \exp\left(-\frac{(y - \mu_y)^2}{2\sigma_y^2}\right)\,dy

    This can be rewritten as:

    .. math::

      \frac{1}{\sigma_x\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2}\right)\,dx \times \frac{1}{\sigma_y\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp\left(-\frac{(y - \mu_y)^2}{2\sigma_y^2}\right)\,dy

    **Step 3: Recognize 1D Gaussian integrals**

    Each integral is exactly the form of a normalized 1D Gaussian, which we know equals 1:

    .. math::

      \frac{1}{\sigma_x\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp\left(-\frac{(x - \mu_x)^2}{2\sigma_x^2}\right)\,dx = 1

    .. math::

      \frac{1}{\sigma_y\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp\left(-\frac{(y - \mu_y)^2}{2\sigma_y^2}\right)\,dy = 1

    **Step 4: Final result**

    Therefore:

    .. math::

      1 \times 1 = 1

    This proves that the 2D Gaussian is properly normalized. The key insight is that for independent variables, the 2D Gaussian is simply the product of two 1D Gaussians, and the product of two normalized distributions is also normalized.

What does a 2D Gaussian look like?
  A 2D Gaussian forms a bell-shaped surface. Here's a visualization with :math:`\mu_x = 0, \mu_y = 0, \sigma_x = 1, \sigma_y = 1`:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D

    # Create grid
    x = np.linspace(-4, 4, 100)
    y = np.linspace(-4, 4, 100)
    X, Y = np.meshgrid(x, y)

    # Parameters
    mu_x, mu_y = 0, 0
    sigma_x, sigma_y = 1, 1

    # 2D Gaussian
    Z = (1/(2*np.pi*sigma_x*sigma_y)) * np.exp(-0.5*((X-mu_x)**2/sigma_x**2 + (Y-mu_y)**2/sigma_y**2))

    # Create figure with 3D surface
    fig = plt.figure(figsize=(12, 5))
    
    # 3D surface plot
    ax1 = fig.add_subplot(121, projection='3d')
    surf = ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.9)
    ax1.set_xlabel('x', fontsize=11)
    ax1.set_ylabel('y', fontsize=11)
    ax1.set_zlabel('Probability density', fontsize=11)
    ax1.set_title('3D surface view', fontsize=12, fontweight='bold')
    fig.colorbar(surf, ax=ax1, shrink=0.5)

    # Contour plot
    ax2 = fig.add_subplot(122)
    contour = ax2.contour(X, Y, Z, levels=10, cmap='viridis')
    ax2.clabel(contour, inline=True, fontsize=8)
    ax2.set_xlabel('x', fontsize=11)
    ax2.set_ylabel('y', fontsize=11)
    ax2.set_title('Contour plot (top view)', fontsize=12, fontweight='bold')
    ax2.grid(True, alpha=0.3)
    ax2.set_aspect('equal')

    plt.tight_layout()

  The left panel shows the 3D bell surface, while the right panel shows contour lines representing constant probability density values.

How do different standard deviations affect the shape?
  When :math:`\sigma_x \neq \sigma_y`, the distribution becomes elliptical rather than circular:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt

    x = np.linspace(-6, 6, 200)
    y = np.linspace(-6, 6, 200)
    X, Y = np.meshgrid(x, y)

    mu_x, mu_y = 0, 0

    # Three different cases
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    configs = [
        (1, 1, 'Circular (σₓ=1, σᵧ=1)'),
        (2, 0.8, 'Elliptical horizontal (σₓ=2, σᵧ=0.8)'),
        (0.6, 2, 'Elliptical vertical (σₓ=0.6, σᵧ=2)'),
    ]

    for ax, (sigma_x, sigma_y, title) in zip(axes, configs):
        Z = (1/(2*np.pi*sigma_x*sigma_y)) * np.exp(-0.5*((X-mu_x)**2/sigma_x**2 + (Y-mu_y)**2/sigma_y**2))
        
        contour = ax.contour(X, Y, Z, levels=8, cmap='plasma')
        ax.clabel(contour, inline=True, fontsize=8)
        ax.set_xlabel('x', fontsize=11)
        ax.set_ylabel('y', fontsize=11)
        ax.set_title(title, fontsize=12, fontweight='bold')
        ax.grid(True, alpha=0.3)
        ax.set_aspect('equal')
        ax.axhline(0, color='black', linewidth=0.5, alpha=0.3)
        ax.axvline(0, color='black', linewidth=0.5, alpha=0.3)

    plt.tight_layout()

  The contour lines show how the distribution spreads more in the direction with larger standard deviation.

How do you shift the center of a 2D Gaussian?
  Changing :math:`\mu_x` and :math:`\mu_y` translates the distribution without changing its shape:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt

    x = np.linspace(-6, 6, 200)
    y = np.linspace(-6, 6, 200)
    X, Y = np.meshgrid(x, y)

    sigma_x, sigma_y = 1.2, 1.2

    # Different centers
    centers = [(0, 0), (-2, 2), (2, -1.5)]
    colors = ['#3498db', '#e74c3c', '#2ecc71']

    plt.figure(figsize=(10, 8))
    
    for (mu_x, mu_y), color in zip(centers, colors):
        Z = (1/(2*np.pi*sigma_x*sigma_y)) * np.exp(-0.5*((X-mu_x)**2/sigma_x**2 + (Y-mu_y)**2/sigma_y**2))
        
        contour = plt.contour(X, Y, Z, levels=6, colors=color, alpha=0.7, linewidths=2)
        plt.plot(mu_x, mu_y, 'o', color=color, markersize=10, 
                label=f'μₓ={mu_x}, μᵧ={mu_y}')

    plt.xlabel('x', fontsize=12)
    plt.ylabel('y', fontsize=12)
    plt.title('2D Gaussians at different centers', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.axis('equal')
    plt.tight_layout()

  Each colored circle marks the center :math:`(\mu_x, \mu_y)` of its corresponding distribution.

.. _poisson-mean-proof:

Poisson distribution
--------------------

What is a Poisson distribution?
  The Poisson distribution describes the probability of **counting a certain number of discrete events** in a **fixed interval** when these events occur independently at a constant average rate. Its probability mass function is:

  .. math::

    P(k; \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}

  where:

  - :math:`k` is the number of events (must be a non-negative integer: 0, 1, 2, 3, ...)
  - :math:`\lambda` is the average number of events (the rate parameter)
  - :math:`e \approx 2.71828` is Euler's number

What is an intuitive understanding of Poisson distribution?
  Think about counting rare, random events:

  **Example 1: Phone calls to a help desk**
  
  A help desk receives on average 3 calls per hour (:math:`\lambda = 3`). In any given hour, they might receive 0, 1, 2, 3, 4, or more calls. The Poisson distribution tells you the probability of each outcome. Getting exactly 3 calls is most likely, but 2 or 4 calls are also reasonably probable.

  **Example 2: Raindrops hitting a window**
  
  If raindrops hit a 10 cm × 10 cm window at an average rate of 20 drops per minute (:math:`\lambda = 20`), you can count how many drops actually hit in the next minute. Some minutes you might get 18 drops, others 22, but rarely 5 or 40.

  **Example 3: Electrons hitting a detector pixel**
  
  If a detector pixel expects 50 electrons on average (:math:`\lambda = 50`), you will actually count something like 45, 48, 52, or 55 electrons. The Poisson distribution tells you how likely each count is.

  **Key insight:** When events are rare and random, you cannot predict the exact count, but you can predict the probability distribution of possible counts.

Why is Poisson distribution important in electron microscopy?
  Electron detection is fundamentally a **counting process**. When electrons hit a detector, you count discrete events. The number of electrons arriving at each pixel follows Poisson statistics, not Gaussian statistics (though they can look similar at high counts).

  **Key properties:**

  - Mean equals :math:`\lambda`
  - Variance equals :math:`\lambda` (standard deviation is :math:`\sqrt{\lambda}`)

  .. _poisson-mean-proof:

  .. dropdown:: Proof that mean and variance equal λ
    :icon: unlock
    :color: info

    **Proof that the mean equals λ:**

    The mean (expected value) is:

    .. math::

      E[k] = \sum_{k=0}^{\infty} k \cdot P(k; \lambda) = \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k e^{-\lambda}}{k!}

    The :math:`k=0` term contributes zero, so start from :math:`k=1`:

    .. math::

      E[k] = \sum_{k=1}^{\infty} k \cdot \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=1}^{\infty} \frac{k \cdot \lambda^k}{k!}

    Since :math:`k/k! = 1/(k-1)!`, we can simplify:

    .. math::

      E[k] = e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!} = e^{-\lambda} \lambda \sum_{k=1}^{\infty} \frac{\lambda^{k-1}}{(k-1)!}

    Let :math:`m = k-1`, then:

    .. math::

      E[k] = e^{-\lambda} \lambda \sum_{m=0}^{\infty} \frac{\lambda^m}{m!}

    **Key step:** The sum :math:`\sum_{m=0}^{\infty} \frac{\lambda^m}{m!}` is the Taylor series for :math:`e^{\lambda}`:

    .. math::

      e^{\lambda} = 1 + \lambda + \frac{\lambda^2}{2!} + \frac{\lambda^3}{3!} + \cdots = \sum_{m=0}^{\infty} \frac{\lambda^m}{m!}

    Therefore:

    .. math::

      E[k] = e^{-\lambda} \lambda \cdot e^{\lambda} = \lambda \cdot e^{-\lambda + \lambda} = \lambda \cdot e^0 = \lambda

    **Proof that the variance equals λ:**

    First, calculate :math:`E[k^2]`:

    .. math::

      E[k^2] = \sum_{k=0}^{\infty} k^2 \cdot \frac{\lambda^k e^{-\lambda}}{k!}

    We can write :math:`k^2 = k(k-1) + k`, so:

    .. math::

      E[k^2] = E[k(k-1)] + E[k]

    For :math:`E[k(k-1)]`:

    .. math::

      E[k(k-1)] = \sum_{k=2}^{\infty} k(k-1) \cdot \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=2}^{\infty} \frac{\lambda^k}{(k-2)!}

    .. math::

      = e^{-\lambda} \lambda^2 \sum_{k=2}^{\infty} \frac{\lambda^{k-2}}{(k-2)!} = e^{-\lambda} \lambda^2 e^{\lambda} = \lambda^2

    Therefore:

    .. math::

      E[k^2] = \lambda^2 + \lambda

    The variance is:

    .. math::

      \text{Var}(k) = E[k^2] - (E[k])^2 = \lambda^2 + \lambda - \lambda^2 = \lambda


Why electron counting follows Poisson statistics
  In electron microscopy, you are literally counting individual electrons that hit each pixel of your detector. Each electron either hits a pixel or doesn't (discrete events), and electrons arrive randomly in time. This is exactly the scenario described by a Poisson distribution.

  The key insight: even if a pixel should receive λ = 100 electrons on average, you will not get exactly 100 every time. Instead, you get a distribution: sometimes 95, sometimes 103, sometimes 98. The standard deviation is √100 = 10 electrons. This uncertainty is fundamental—it comes from quantum randomness in when and where electrons arrive, not from imperfect equipment.

  For Poisson-distributed electron counts with mean :math:`\lambda`, we know that the standard deviation equals :math:`\sqrt{\lambda}`. This gives us a simple but powerful relationship:

  .. math::

    \text{SNR} = \frac{\lambda}{\sqrt{\lambda}} = \sqrt{\lambda}

The diminishing returns problem
  This square-root relationship reveals a fundamental challenge: **SNR improves slowly with dose**. You get diminishing returns—each additional electron contributes less to image quality than the previous one.

  The numbers tell the story:
  
  - To double your image quality (2× SNR), you need **4× more electrons**
  - To improve SNR by 10×, you need **100× more dose**
  - Going from 100 to 200 electrons: +10 SNR units (from 10 to 14.1)
  - Going from 900 to 1000 electrons: +1 SNR unit (from 30 to 31.6)

  This is why electron microscopy is so challenging—you cannot simply "blast more electrons" at the sample to get better images. Each sample has a radiation damage budget, and the square-root law means you hit diminishing returns quickly.


Why use Poisson if it eventually looks like Gaussian anyway?
  At high electron counts (:math:`\lambda \gtrsim 20`), Poisson does approximate Gaussian, but there are important reasons to understand Poisson:

  **1. Low-dose imaging**
  
  In cryo-EM and radiation-sensitive applications, you deliberately use low doses (10-50 electrons per pixel) to minimize sample damage. At these counts, Poisson is **discrete and asymmetric**—very different from Gaussian. Using Gaussian here would give wrong predictions.

  **2. Correct noise model**
  
  Poisson tells you that variance equals mean (:math:`\sigma^2 = \lambda`). This is different from Gaussian, where mean and variance are independent parameters. Knowing this relationship is crucial for:
  
  - Predicting SNR from dose: :math:`\text{SNR} = \sqrt{\lambda}`
  - Understanding that noise scales with :math:`\sqrt{\text{signal}}`, not independently
  - Proper image processing and denoising algorithms

  **3. Physical understanding**
  
  Poisson arises from the physics: electrons arrive randomly and independently. Gaussian arises from averaging many random variables (Central Limit Theorem). Using Poisson shows you understand the underlying counting process.

  **4. Dose optimization**
  
  Knowing SNR = :math:`\sqrt{\lambda}` (from Poisson) tells you exactly how much dose you need:
  
  - Want SNR = 5? Need :math:`\lambda = 25` electrons
  - Want SNR = 10? Need :math:`\lambda = 100` electrons (4× more!)
  
  This direct relationship comes from Poisson, not Gaussian.

  **Bottom line:** At high doses, you can use Gaussian as an approximation. But at low doses (where it matters most for beam-sensitive samples), you must use Poisson to correctly predict noise and optimize experiments.

What is signal-to-noise ratio intuitively?
  Signal-to-noise ratio (SNR) measures how much "real information" you have compared to random fluctuations:

  **Intuitive picture:**
  
  Imagine trying to hear someone speaking (signal) in a noisy room (noise). If they speak loudly and the room is quiet, you hear them clearly (high SNR). If they whisper or the room is very noisy, you struggle to understand them (low SNR).

  **In electron microscopy:**
  
  - **Signal**: The actual structure you want to see (mean electron count :math:`\lambda`)
  - **Noise**: Random fluctuations in the count (standard deviation :math:`\sqrt{\lambda}`)
  - **SNR**: How clearly you can distinguish the structure from random fluctuations

  For Poisson statistics, :math:`\text{SNR} = \lambda / \sqrt{\lambda} = \sqrt{\lambda}`. This means:

  - 100 electrons give SNR = 10 (10% relative noise)
  - 10,000 electrons give SNR = 100 (1% relative noise)
  - At low electron doses, Poisson noise (shot noise) dominates image quality

Why is standard deviation considered "noise"?
  The standard deviation measures **uncertainty** in your measurement. Here's why this matters:

  **The fundamental problem:**
  
  Imagine you want to compare two pixels to see if they have different structures:
  
  - Pixel A counts 100 electrons (with fluctuation ±10)
  - Pixel B counts 110 electrons (with fluctuation ±10)
  
  **Question:** Is pixel B really brighter, or is it just random fluctuation?
  
  If both measurements fluctuate by ±10, a difference of 10 electrons could be random chance! You cannot confidently say they're different.

  **Now consider:**
  
  - Pixel A counts 10,000 electrons (with fluctuation ±100)
  - Pixel B counts 10,100 electrons (with fluctuation ±100)
  
  Here, a 100-electron difference is real because fluctuations are only ±100. The difference (100) is exactly at the noise level, marginally detectable.

  **The SNR tells you:** How many "noise units" is your signal?
  
  - Low SNR (SNR < 3): Signal buried in noise, unreliable measurement
  - Medium SNR (SNR ≈ 5-10): Signal detectable but noisy
  - High SNR (SNR > 20): Signal clearly above noise, reliable measurement

Why does this concept solve a practical problem?
  **Problem 1: Can you detect a feature?**
  
  If an atom creates a brightness difference of 50 electrons but your noise is ±50 electrons (SNR = 1), you cannot see the atom! It's lost in fluctuations.
  
  If your noise is ±5 electrons (SNR = 10), the atom stands out clearly.

  **Problem 2: How much dose do you need?**
  
  SNR = √λ tells you directly: Want to detect a feature with SNR = 5? Need λ = 25 electrons minimum.
  
  This lets you plan experiments: balance image quality (need high SNR) against radiation damage (want low dose).

  **Problem 3: Can you trust differences between measurements?**
  
  If you measure two samples and find a 10% difference, is it real or noise? SNR tells you:
  
  - SNR = 100 means noise is 1%, so 10% difference is real
  - SNR = 5 means noise is 20%, so 10% difference could be random
  
  **Bottom line:** SNR quantifies **measurement reliability**. It answers "Can I trust what I'm seeing?" This is essential for any quantitative science.


  .. dropdown:: Derivation of SNR = √λ for Poisson statistics
    :icon: unlock
    :color: info

    Signal-to-noise ratio (SNR) is defined as the ratio of signal to noise:

    .. math::

      \text{SNR} = \frac{\text{Signal}}{\text{Noise}}

    **Step 1: Identify the signal**

    In counting statistics, the signal is the mean number of events:

    .. math::

      \text{Signal} = \text{Mean} = \lambda

    **Step 2: Identify the noise**

    The noise is quantified by the standard deviation. For a Poisson distribution, the variance equals :math:`\lambda`:

    .. math::

      \text{Variance} = \lambda

    Therefore, the standard deviation (noise) is:

    .. math::

      \text{Noise} = \sqrt{\text{Variance}} = \sqrt{\lambda}

    **Step 3: Calculate SNR**

    Substituting into the SNR definition:

    .. math::

      \text{SNR} = \frac{\lambda}{\sqrt{\lambda}} = \frac{\lambda}{\lambda^{1/2}} = \lambda^{1 - 1/2} = \lambda^{1/2} = \sqrt{\lambda}

    **Conclusion:**

    For Poisson-distributed counts, :math:`\text{SNR} = \sqrt{\lambda}`. This fundamental result shows that:

    - SNR increases with the square root of the count
    - To double SNR, you need to quadruple the number of events
    - Higher electron dose gives better SNR but at diminishing returns

What do the mean and variance being equal mean physically?
  If you expect 100 electrons on average (:math:`\lambda = 100`), the standard deviation of your count will be :math:`\sqrt{100} = 10` electrons. The relative uncertainty is :math:`10/100 = 10\%`.

  For 10,000 electrons: standard deviation is :math:`\sqrt{10000} = 100`, giving :math:`100/10000 = 1\%` relative uncertainty.

  **This is why increasing electron dose improves image quality**: more electrons means better counting statistics.

How does the Poisson distribution look for different values of λ?
  At low :math:`\lambda`, the distribution is asymmetric and discrete. At high :math:`\lambda`, it approaches a Gaussian:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.stats import poisson

    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    lambdas = [2, 10, 50]
    colors = ['#e74c3c', '#3498db', '#2ecc71']
    
    for ax, lam, color in zip(axes, lambdas, colors):
        k = np.arange(0, lam + 4*np.sqrt(lam))
        pmf = poisson.pmf(k, lam)
        
        ax.bar(k, pmf, color=color, alpha=0.7, edgecolor='black', linewidth=1.5)
        ax.axvline(lam, color='black', linestyle='--', linewidth=2, 
                   label=f'Mean λ = {lam}')
        ax.set_xlabel('Number of events (k)', fontsize=12)
        ax.set_ylabel('Probability P(k)', fontsize=12)
        ax.set_title(f'Poisson distribution (λ = {lam})', fontsize=13, fontweight='bold')
        ax.legend(fontsize=11)
        ax.grid(True, alpha=0.3, axis='y')
        
    plt.tight_layout()

  Notice how the distribution becomes more symmetric and bell-shaped as :math:`\lambda` increases.

When does Poisson approximate Gaussian?
  As :math:`\lambda` increases, the Poisson distribution approaches a Gaussian with :math:`\mu = \lambda` and :math:`\sigma = \sqrt{\lambda}`. Here's a comparison:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.stats import poisson, norm

    lam = 50
    k = np.arange(0, 100)
    
    # Poisson distribution
    poisson_pmf = poisson.pmf(k, lam)
    
    # Gaussian approximation
    x = np.linspace(0, 100, 500)
    gaussian_pdf = norm.pdf(x, loc=lam, scale=np.sqrt(lam))
    
    plt.figure(figsize=(10, 6))
    plt.bar(k, poisson_pmf, color='#3498db', alpha=0.7, 
            label=f'Poisson (λ = {lam})', edgecolor='black', linewidth=1)
    plt.plot(x, gaussian_pdf, color='#e74c3c', linewidth=3, 
             label=f'Gaussian (μ = {lam}, σ = {np.sqrt(lam):.1f})')
    
    plt.xlabel('Number of events', fontsize=12)
    plt.ylabel('Probability density', fontsize=12)
    plt.title('Poisson distribution vs Gaussian approximation', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()


Understanding signal-to-noise ratio
  When you measure electron counts in microscopy, you face a fundamental challenge: every measurement contains both the real signal you want to see and random fluctuations you do not. Signal-to-noise ratio (SNR) quantifies how well you can distinguish the real information from the noise:

  .. math::

    \text{SNR} = \frac{\text{Signal}}{\text{Noise}} = \frac{\mu}{\sigma}

  where :math:`\mu` is the mean signal and :math:`\sigma` is the standard deviation representing measurement uncertainty.

The square-root law in action
  Here's how SNR scales with electron dose:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt

    # Electron counts (dose)
    dose = np.linspace(1, 1000, 1000)
    snr = np.sqrt(dose)

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Linear scale
    ax1.plot(dose, snr, color='#3498db', linewidth=2.5)
    ax1.set_xlabel('Electron dose (λ)', fontsize=12)
    ax1.set_ylabel('SNR = √λ', fontsize=12)
    ax1.set_title('SNR vs electron dose (linear scale)', fontsize=13, fontweight='bold')
    ax1.grid(True, alpha=0.3)

    # Add reference points
    ref_doses = [10, 100, 400, 900]
    for d in ref_doses:
        ax1.plot(d, np.sqrt(d), 'o', color='#e74c3c', markersize=8)
        ax1.annotate(f'{d} e⁻ → SNR={np.sqrt(d):.1f}', 
                    xy=(d, np.sqrt(d)), xytext=(10, 10),
                    textcoords='offset points', fontsize=9,
                    bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.3))

    # Log-log scale
    ax2.loglog(dose, snr, color='#2ecc71', linewidth=2.5)
    ax2.set_xlabel('Electron dose (λ)', fontsize=12)
    ax2.set_ylabel('SNR = √λ', fontsize=12)
    ax2.set_title('SNR vs electron dose (log-log scale)', fontsize=13, fontweight='bold')
    ax2.grid(True, alpha=0.3, which='both')
    
    # Show slope = 0.5 line
    dose_fit = np.array([10, 1000])
    snr_fit = np.sqrt(dose_fit)
    ax2.loglog(dose_fit, snr_fit, '--', color='#e74c3c', linewidth=2, 
              label='Slope = 0.5')
    ax2.legend(fontsize=11)

    plt.tight_layout()

  The log-log plot shows the square-root relationship as a straight line with slope 0.5. Notice the key reference points: 100 electrons gives SNR = 10 (10% noise), while 900 electrons gives SNR = 30 (3.3% noise). The improvement is substantial but requires much more dose.

Visualizing the impact on image quality
  The abstract mathematics of SNR becomes concrete when you see its effect on real images. Here's a simulation showing two atomic features imaged at different electron doses:

  .. plot::
    :include-source: False

    import numpy as np
    import matplotlib.pyplot as plt

    # Create a simple test pattern: two Gaussian spots
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    
    # True signal: sum of two Gaussians (representing two features/atoms)
    signal = 50 * np.exp(-(X+1)**2 - (Y+1)**2) + 30 * np.exp(-(X-1)**2 - (Y-1)**2)
    
    # Different dose levels
    doses = [10, 50, 200, 1000]
    
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    for idx, (dose_factor, ax) in enumerate(zip(doses, axes)):
        # Scale signal by dose
        scaled_signal = signal * dose_factor / 50
        
        # Add Poisson noise
        np.random.seed(42 + idx)
        noisy = np.random.poisson(scaled_signal)
        
        # Calculate SNR from the dose
        typical_lambda = np.mean(scaled_signal[scaled_signal > 10])
        snr = np.sqrt(typical_lambda)
        
        # Show noisy image
        im = ax.imshow(noisy, cmap='viridis', origin='lower')
        ax.set_title(f'Dose factor: {dose_factor}\nSNR ≈ {snr:.1f}', 
                     fontsize=13, fontweight='bold')
        ax.axis('off')
        plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
        
    plt.suptitle('Effect of electron dose on image quality', 
                fontsize=16, fontweight='bold', y=1.02)
    plt.tight_layout()

  At **dose factor 10** (leftmost, SNR ≈ 3.2), the two atomic features are barely distinguishable from the noise—you can see them if you know where to look, but they're not convincing. At **dose factor 50** (SNR ≈ 7.1), the features emerge more clearly, though still grainy. By **dose factor 200** (SNR ≈ 14.1), the image quality is quite good with well-defined features. Finally, at **dose factor 1000** (rightmost, SNR ≈ 31.6), you have an essentially noise-free image with smooth, sharp boundaries.

  This progression illustrates the practical meaning of SNR: it determines whether you can see what you're looking for.

The fundamental tradeoff
  This analysis reveals the central challenge in electron microscopy: **you want high SNR for clear images, but every electron deposits energy that damages your sample**. This is especially critical for biological specimens like proteins, which are extremely radiation-sensitive.

  Consider a typical cryo-EM experiment. With doses of 20-60 electrons/Ų, you get SNR = √20 ≈ 4.5 to √60 ≈ 7.7 per pixel—barely enough to distinguish features! This is why cryo-EM requires averaging thousands of identical particles and sophisticated image processing algorithms to extract high-resolution structural information.

  .. note::

    The SNR = √λ relationship explains why single-particle cryo-EM revolutionized structural biology: by averaging many copies of the same molecule, you effectively increase λ (and thus SNR) without increasing the per-particle dose. If you average 10,000 particles, you gain a factor of √10,000 = 100× in SNR compared to a single image.

  The Gaussian approximation works well when :math:`\lambda \gtrsim 20`. This is why high-dose electron microscopy images can be treated with Gaussian statistics.

Principal component analysis
-----------------------------

What problem does PCA solve?
  Suppose you measure 50 variables on each student (test scores, homework, attendance, etc). Most variables are correlated. PCA finds new coordinate axes (principal components) where:
  
  - **PC1** points in the direction of maximum variance
  - **PC2** is perpendicular to PC1 and captures the next most variance
  - Each subsequent PC is perpendicular to all previous ones
  
  Often just 2-3 principal components capture 80-95% of total variance, reducing 50 dimensions to 3.

Why does variance matter?
  High variance directions capture the relationships between variables. When test scores and homework are correlated, PC1 aligns with that correlation and captures their combined effect. This lets you represent high dimensional relationships in lower dimensions while preserving the structure of your data.

  The components themselves are meaningful: PC1 might represent "overall academic performance" while PC2 captures "test performance vs homework effort." You discover these patterns automatically from the data.

Does this work with higher dimensions?
  Yes! The 2D example below illustrates the concept, but PCA shines with high dimensional data. With 1000 gene expression measurements, PCA might find that just 10 components capture 90% of variance. Those 10 components reveal biological processes (cell cycle, stress response, etc) hidden in the original 1000 dimensions.

How does PCA work in 2D?
  Consider correlated data (test scores vs homework completion rate). PCA rotates the axes to align with the data's natural spread.

  .. plot:: notebooks/plot_statistics_pca_2d.py
     :include-source: False

  The red arrow (PC1) points where data varies most. The green arrow (PC2) captures remaining variation. In this example, PC1 explains ~89% of variance, PC2 explains ~11%.

Mathematical foundation
^^^^^^^^^^^^^^^^^^^^^^^

How does PCA work mathematically?
  PCA finds a set of orthogonal directions (principal components) that best represent your data's variance. The process involves three key steps:

  **Step 1: Center the data**
  
  Subtract the mean from each feature. If :math:`X` is your data matrix (:math:`n` samples by :math:`p` features), compute the centered data:

  .. math::

    X_{\text{centered}} = X - \bar{X}

  where :math:`\bar{X}` is the mean of each column.

  **Step 2: Compute the covariance matrix**
  
  The covariance matrix captures how features vary together:

  .. math::

    C = \frac{1}{n-1} X_{\text{centered}}^T X_{\text{centered}}

  This is a :math:`p \times p` symmetric matrix where entry :math:`C_{ij}` is the covariance between features :math:`i` and :math:`j`.

  **Why is it called covariance?** Covariance measures how two variables change together. If :math:`C_{ij} > 0`, when feature :math:`i` increases, feature :math:`j` tends to increase too (positive correlation). If :math:`C_{ij} < 0`, they move in opposite directions. If :math:`C_{ij} \approx 0`, they vary independently. The diagonal entries :math:`C_{ii}` are just the variance of each feature (how much it varies with itself). So the covariance matrix summarizes all pairwise relationships: variances on the diagonal, covariances off the diagonal.

  **Step 3: Find eigenvectors and eigenvalues**
  
  Solve the eigenvalue equation:

  .. math::

    C v = \lambda v

  where :math:`v` is an eigenvector and :math:`\lambda` is the corresponding eigenvalue. The eigenvectors are your principal components, and eigenvalues tell you how much variance each component captures.

Why do we need eigenvectors?
  Eigenvectors are special directions where the covariance matrix just stretches the vector without rotating it. When you project data onto an eigenvector direction, you maximize variance while keeping dimensions uncorrelated.

  The largest eigenvalue corresponds to the direction of maximum variance (PC1). The second largest gives the direction of maximum remaining variance perpendicular to PC1 (PC2), and so on.

What does orthogonal mean geometrically?
  Orthogonal means perpendicular. In the 2D plot above, PC1 and PC2 are at 90 degrees. Mathematically, two vectors :math:`v_1` and :math:`v_2` are orthogonal if their dot product is zero:

  .. math::

    v_1 \cdot v_2 = 0

  PCA guarantees all principal components are orthogonal. This is crucial because it means they capture independent patterns in your data. PC1 might capture "overall performance" while PC2 captures "consistency" as separate, uncorrelated aspects.

How do eigenvalues relate to variance?
  Each eigenvalue :math:`\lambda_i` equals the variance captured by its principal component:

  .. math::

    \text{Variance along PC}_i = \lambda_i

  The total variance in your data is:

  .. math::

    \text{Total variance} = \sum_{i=1}^{p} \lambda_i = \text{trace}(C)

  The fraction of variance explained by the first :math:`k` components is:

  .. math::

    \text{Variance explained} = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{p} \lambda_i}

  This is what the scree plot shows: you keep components until the cumulative sum reaches 80-95% of total variance.

How do you transform data to the new coordinates?
  Once you have the eigenvectors (principal components), project your data onto them. If :math:`V` is the matrix of eigenvectors (each column is a PC), the transformed data is:

  .. math::

    X_{\text{PC}} = X_{\text{centered}} \cdot V

  Each column of :math:`X_{\text{PC}}` gives the coordinates in the new principal component space. To reduce dimensions, keep only the first :math:`k` columns corresponding to the largest :math:`k` eigenvalues.

  .. dropdown:: Example: PCA calculation in 2D
    :icon: unlock
    :color: info

    Suppose we have 3 data points in 2D:

    .. math::

      X = \begin{bmatrix} 2 & 4 \\ 3 & 5 \\ 5 & 7 \end{bmatrix}

    **Step 1: Center the data**

    Mean: :math:`\bar{X} = \begin{bmatrix} 3.33 & 5.33 \end{bmatrix}`

    .. math::

      X_{\text{centered}} = \begin{bmatrix} -1.33 & -1.33 \\ -0.33 & -0.33 \\ 1.67 & 1.67 \end{bmatrix}

    **Step 2: Compute covariance matrix**

    .. math::

      C = \frac{1}{2} \begin{bmatrix} -1.33 & -0.33 & 1.67 \\ -1.33 & -0.33 & 1.67 \end{bmatrix} \begin{bmatrix} -1.33 & -1.33 \\ -0.33 & -0.33 \\ 1.67 & 1.67 \end{bmatrix} = \begin{bmatrix} 1.56 & 1.56 \\ 1.56 & 1.56 \end{bmatrix}

    **Step 3: Find eigenvalues and eigenvectors**

    Solve :math:`\det(C - \lambda I) = 0`:

    .. math::

      \det \begin{bmatrix} 1.56 - \lambda & 1.56 \\ 1.56 & 1.56 - \lambda \end{bmatrix} = (1.56 - \lambda)^2 - 1.56^2 = 0

    This gives :math:`\lambda_1 = 3.12` and :math:`\lambda_2 = 0`.

    The eigenvectors are :math:`v_1 = \begin{bmatrix} 0.707 \\ 0.707 \end{bmatrix}` (45° diagonal) and :math:`v_2 = \begin{bmatrix} -0.707 \\ 0.707 \end{bmatrix}` (perpendicular).

    **Interpretation**: PC1 points along the diagonal (where data varies), PC2 is perpendicular (no variance). The data is perfectly correlated, lying on a line, so one component captures 100% of variance.

What are common applications?
  - **Face recognition**: Eigenfaces (compress face images from millions of pixels to ~100 coefficients)
  - **Gene expression**: Find patterns in thousands of genes across samples
  - **Image compression**: Store fewer components while preserving visual quality
  - **Noise reduction**: Keep high-variance components, discard low-variance noise

  .. plot:: notebooks/plot_statistics_pca_screen.py
     :include-source: False

  The scree plot shows diminishing returns after the first few components.

What are the limitations of PCA?
  PCA makes several assumptions that can break down in real applications:

  **Linearity assumption**: PCA only captures linear relationships. If your data lies on a curved manifold (like a Swiss roll), PCA will fail to find the true low dimensional structure. Non-linear methods like t-SNE, UMAP, or kernel PCA work better for such cases.

  **Variance equals importance**: PCA assumes high variance directions contain the signal. But sometimes noise has high variance while the signal you care about has low variance. For example, in gene expression data, housekeeping genes might vary a lot (high variance) but be biologically uninteresting, while disease-related genes vary subtly (low variance) but are crucial.

  **Interpretability**: Principal components are linear combinations like "0.3×gene1 + 0.5×gene2 - 0.2×gene3." This mathematical abstraction is hard to interpret biologically or physically. You lose the direct connection to your original measurements.

  **Sensitivity to scaling**: If one feature has values in the thousands while another is between 0 and 1, PCA will be dominated by the large-scale feature. You must standardize your data first, but this itself is a modeling choice that affects results.

How is PCA actually used in practice?
  Despite its limitations, PCA remains widely used because it is fast, simple, and often "good enough":

  **Preprocessing step**: Most commonly, PCA reduces dimensionality before feeding data to other algorithms. For example, reduce 10,000 gene measurements to 50 components, then run clustering or classification on those 50 components. This speeds up computation and often improves results by removing noise.

  **Exploratory data analysis**: Plot the first 2-3 principal components to visualize your data and spot clusters, outliers, or trends. This helps you understand your dataset before diving into complex modeling. Even if PCs are not perfectly interpretable, patterns in PC space often reveal real structure.

  **Quality control**: In genomics, the first few PCs often capture technical artifacts (batch effects, sample quality). You can identify and remove these before analyzing biological signals. If PC1 separates samples by sequencing batch rather than biological condition, you know something went wrong.

  **Noise filtering**: Keep only top PCs that capture signal, discard bottom PCs that capture noise. This is used in image denoising and signal processing. The assumption is that noise is spread across many dimensions while signal concentrates in a few.

  **Feature engineering**: Use PCs as features for machine learning models. Instead of feeding raw pixels, feed the first 100 PCs of an image. This reduces overfitting and speeds up training.

  The key insight is that PCA is rarely the final answer but rather a useful tool in a larger analysis pipeline. You use it to simplify, visualize, or preprocess data before applying other methods.

When should you use PCA?
  **Good when:**
  
  - Features are correlated (redundancy exists)
  - Many features, few samples (curse of dimensionality)
  - Need visualization of high-dimensional data
  
  **Avoid when:**
  
  - Features already uncorrelated
  - Need interpretable features (PCs are linear combinations)
  - Nonlinear structure dominates (use kernel PCA instead)

  .. note::

    Always center your data (subtract mean). If features have different scales, standardize them first (zero mean, unit variance).