Deep Learning
=============

Fundamentals
-----------

What is deep learning?
    A technique to find patterns in data using neural networks, which can approximate any continuous function. Let's first understand how these networks learn through optimization.

What is gradient descent?
    An algorithm that minimizes functions by following their steepest downward slope. For a function :math:`f(\theta)`:

    .. math::

        \theta_{n+1} = \theta_n - \eta \nabla f(\theta_n)

    where :math:`\eta` is the learning rate controlling step size. Updates continue until changes become negligible.

How does it handle multiple parameters?
    Each parameter updates independently using its partial derivative:

    .. math::

        \theta_{i, n+1} = \theta_{i, n} - \eta \frac{\partial f}{\partial \theta_i} \bigg|_{\theta = \theta_n} \quad \text{for } i = 1, 2, ..., n

Training Process
---------------

What is a loss function?
    A measure of prediction error. For example, mean squared error:

    .. math::

        L(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i; \boldsymbol{\theta}))^2

    where:
        - :math:`y_i`: actual value
        - :math:`f(x_i; \boldsymbol{\theta})`: model's prediction
        - :math:`\boldsymbol{\theta}`: model parameters
    
    Perfect predictions yield zero loss.

What is stochastic gradient descent (SGD)?
    A faster version of gradient descent that uses mini-batches:

    .. math::

        L_{SGD}(\boldsymbol{\theta}) = \frac{1}{B} \sum_{i=1}^{B} (y_i - f(x_i; \boldsymbol{\theta}))^2

    where :math:`B` is the mini-batch size (32-512 samples). Benefits:
        - Lower computation per step
        - Natural noise aids optimization
        - Faster training cycles

How do mini-batches work?
    - Random sampling from dataset
    - Typically 32-512 samples per batch
    - Shuffle data between epochs
    - One epoch = full dataset pass

Advanced Optimization
---------------------

What limits SGD?
    Two main challenges:
        1. Oscillations from mini-batch noise
        2. Difficult learning rate selection

What improvements exist?
    Modern optimizers address SGD's limitations:
        - Momentum: Adds velocity term
        - AdaGrad: Parameter-specific rates
        - RMSProp: Adaptive step sizing
        - ADAM: Combines momentum and adaptation

How does ADAM work?
    ADAM adapts each parameter's learning rate using momentum and variance estimates:

    .. math::
    
        \theta_i^{(t+1)} = \theta_i^{(t)} - \eta \frac{\hat{m}_t^{(i)}}{\sqrt{\hat{v}_t^{(i)}} + \epsilon}

    where :math:`\hat{m}_t^{(i)}` tracks momentum and :math:`\hat{v}_t^{(i)}` scales updates based on gradient history.

Why is ADAM effective?
    Key benefits:
        - Momentum helps overcome local minima
        - Variance scaling prevents overshooting
        - Adapts to each parameter's landscape
        - Self-adjusts step sizes automatically

How are updates computed?
    For each parameter :math:`\theta_i`, ADAM tracks:

    1. Momentum (first moment):
    
    .. math::
    
        m_t^{(i)} = \beta_1 m_{t-1}^{(i)} + (1 - \beta_1) g_t^{(i)}

    2. Variance (second moment):
    
    .. math::
    
        v_t^{(i)} = \beta_2 v_{t-1}^{(i)} + (1 - \beta_2) (g_t^{(i)})^2

    3. Bias corrections:
    
    .. math::
    
        \hat{m}_t^{(i)} = \frac{m_t^{(i)}}{1 - \beta_1^t}, \quad
        \hat{v}_t^{(i)} = \frac{v_t^{(i)}}{1 - \beta_2^t}

What are ADAM's key implementation details?
    Essential points:
        - Per-parameter tracking
        - Shared hyperparameters
        - Bias correction at start
        - Time index t for corrections


Backpropagation (TBA)
--------------------

Activation Functions (TBA)
--------------------------

Build a simple neural network (TBA)
-----------------------------------

Glossary of terms
-----------------

PyTorch
    Open source deep learning framework developed by Meta. Provides automatic differentiation (autograd), GPU acceleration via CUDA, and neural network building blocks. Dominant framework for research. Supports dynamic computation graphs allowing flexible model architectures. Popular for computer vision, NLP, and scientific ML applications including electron microscopy analysis.

TensorFlow
    Open source deep learning framework developed by Google. Provides automatic differentiation, GPU/TPU acceleration, and production deployment tools. Popular for production ML systems. Uses static computation graphs (though TensorFlow 2.x added eager execution). Strong ecosystem for mobile and web deployment.

Autograd
    Automatic differentiation. System automatically computes gradients (derivatives) of functions, essential for training neural networks via backpropagation. PyTorch and TensorFlow track operations on tensors, building computational graph, then compute gradients automatically. Eliminates manual derivative calculation, enabling rapid model development.

Backpropagation
    Algorithm for training neural networks. Computes gradients of loss function with respect to network weights by applying chain rule backwards through network layers. Enables gradient descent optimization. GPU parallelization makes backpropagation fast even for networks with millions of parameters.

Gradient descent
    Optimization algorithm iteratively adjusting parameters to minimize loss function. Computes gradient (direction of steepest increase), steps in opposite direction. Learning rate controls step size. Variants: SGD (stochastic), Adam (adaptive learning rate), RMSprop. Critical for training neural networks and iterative reconstruction algorithms.

Batch size
    Number of samples processed together before updating model weights. Larger batches improve GPU utilization (more parallelism) and training stability but require more memory. Smaller batches provide noisier gradients but may generalize better. Typical values: 16 to 256 for vision tasks, 32 to 128 for scientific data.

Epoch
    One complete pass through entire training dataset. Training typically requires multiple epochs (10 to 1000s depending on problem). Each epoch, model sees all training samples once. Monitor validation loss per epoch to detect overfitting.

Learning rate
    Step size for gradient descent updates. Too high: training diverges. Too low: slow convergence. Typical values: 0.001 to 0.0001 for Adam optimizer. Learning rate schedules reduce rate during training (cosine annealing, step decay). Critical hyperparameter requiring tuning.

Overfitting
    Model memorizes training data rather than learning generalizable patterns. High training accuracy but poor validation accuracy. Causes: too many parameters, too few training samples, too many training epochs. Solutions: regularization (dropout, weight decay), data augmentation, early stopping.

Regularization
    Techniques preventing overfitting. L1/L2 regularization penalizes large weights. Dropout randomly deactivates neurons during training. Data augmentation creates variations of training samples. Batch normalization stabilizes training and reduces overfitting. Essential for deep networks with millions of parameters.

CNN
    Convolutional Neural Network. Architecture using convolution operations to process images. Convolution kernels (3×3, 5×5) slide across image detecting features (edges, textures, patterns). Multiple layers build hierarchical representations. Standard for image classification, segmentation, object detection. Used in microscopy for automated particle picking, defect detection, and pattern recognition.

RNN
    Recurrent Neural Network. Architecture for sequential data with feedback connections creating memory. Each step receives current input and previous hidden state. Useful for time series, text, video. Variants: LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit). Less common in microscopy except for temporal analysis.

Transformer
    Architecture using self-attention mechanism to process sequences. Replaces RNN sequential processing with parallel attention allowing much faster training on GPUs. Foundation for modern NLP (BERT, GPT) and increasingly used for vision (ViT) and scientific applications. Attention mechanism weighs importance of different input regions. Computational complexity scales as O(N²) with sequence length.

Attention
    Mechanism allowing model to focus on relevant parts of input. Query-key-value framework computes weighted sum where weights indicate importance. Self-attention relates positions within single sequence. Cross-attention relates positions between two sequences. Enables long-range dependencies and interpretability showing what model attends to.

U-Net
    CNN architecture for image segmentation. Encoder path downsamples image extracting features. Decoder path upsamples and refines segmentation. Skip connections preserve spatial information. Originally for biomedical image segmentation, now widely used in microscopy for particle segmentation, background subtraction, and denoising.

ResNet
    Residual Network. CNN using skip connections (residual connections) allowing training very deep networks (50 to 1000+ layers). Skip connections combat vanishing gradient problem. Each block learns residual (difference) rather than full transformation. Widely used as backbone for computer vision tasks including microscopy image analysis.

Inference
    Using trained model to make predictions on new data. Forward pass through network without backpropagation. Much faster than training. Can use lower precision (float16, int8) for speed. Real-time inference enables live microscope feedback.

Training
    Process of adjusting model parameters to minimize loss function on training data. Involves forward pass (compute predictions), loss calculation, backward pass (compute gradients via backpropagation), and weight update (gradient descent). Computationally intensive, benefits greatly from GPU acceleration. Can take hours to weeks depending on model size and dataset.

Fine-tuning
    Adapting pre-trained model to new task by continuing training on new dataset. Start with weights trained on large dataset (ImageNet), train on smaller domain-specific dataset. Much faster than training from scratch. Common when labeled data is limited. Transfer learning leverages knowledge from similar tasks.

Data augmentation
    Artificially increasing training data by applying transformations: rotations, flips, crops, brightness adjustments, noise addition. Improves generalization and reduces overfitting. For microscopy: rotate/flip images, add Poisson noise simulating counting statistics, adjust intensity scaling.

Loss function
    Measures difference between model predictions and ground truth. Optimization minimizes loss. Common losses: MSE (mean squared error) for regression, cross-entropy for classification, L1 for robustness to outliers. For microscopy: L1 or L2 for denoising, cross-entropy for particle classification, custom losses for reconstruction.

Optimizer
    Algorithm updating model weights based on gradients. SGD is basic gradient descent. Adam combines momentum and adaptive learning rates, most popular default. AdamW adds weight decay. RMSprop adapts learning rate per parameter. Each has hyperparameters (learning rate, momentum, decay). Choice affects convergence speed and final performance.

Hyperparameter
    Configuration setting for training process, not learned from data. Examples: learning rate, batch size, number of layers, dropout rate, regularization strength. Requires manual tuning or automated search (grid search, random search, Bayesian optimization). Significantly impacts model performance.

Checkpoint
    Saved model state (weights, optimizer state) during training. Enable resuming interrupted training, selecting best model based on validation performance, and model deployment. Save periodically (every N epochs) and when validation loss improves. Essential for long training runs that may fail or be preempted.

Mixed precision training
    Training neural networks using both float16 and float32. Compute in float16 (2x faster, half memory) while maintaining critical operations in float32 for numerical stability. Loss scaling prevents underflow. Automatic mixed precision (AMP) in PyTorch and TensorFlow handles details automatically. Enables larger batch sizes and faster training on modern GPUs with Tensor Cores.

Tensor Core
    Specialized GPU hardware for mixed precision matrix multiplication. Operates on float16 inputs producing float32 output. Much faster than standard CUDA cores for dense matrix operations. A100 provides 312 TFLOPS (float16) tensor core performance versus 19.5 TFLOPS (float64) standard performance. Critical for training large deep learning models efficiently.

TPU
    Tensor Processing Unit. Google's custom ASIC (Application-Specific Integrated Circuit) for machine learning. Optimized for TensorFlow. Provides high throughput for training and inference. Alternative to NVIDIA GPUs. Cloud-only access via Google Cloud Platform.

NCCL
    NVIDIA Collective Communications Library. Optimized primitives for multi-GPU and multi-node communication: all-reduce, broadcast, reduce, gather, scatter. Essential for distributed deep learning training. Exploits NVLink for intra-node communication, InfiniBand for inter-node. PyTorch Distributed and Horovod use NCCL internally.

Distributed training
    Training single model across multiple GPUs or nodes. Data parallelism replicates model on each device, processes different data batches, synchronizes gradients. Model parallelism splits model across devices for models too large for single GPU. Pipeline parallelism processes different batches at different model stages simultaneously. Achieves near-linear speedup for large datasets and models.

Horovod
    Distributed training framework supporting PyTorch, TensorFlow, and others. Provides easy API for data-parallel training using MPI and NCCL. Designed for efficiency on HPC clusters. Common for large-scale scientific ML training across many GPUs.

Segmentation
    Computer vision task assigning class label to each pixel. Semantic segmentation labels classes (particle, background, vacuum). Instance segmentation distinguishes individual objects. For microscopy: segment nanoparticles, grain boundaries, defects, or regions of interest.

Classification
    Task assigning single label to entire input. For images: classify patterns by type, identify structures, categorize experimental conditions. Binary classification (two classes) or multi-class (many classes). Output is probability distribution over classes.

Regression
    Predicting continuous values rather than discrete classes. For microscopy: predict properties from images, estimate experimental parameters, determine optimal settings. Loss typically MSE or MAE (mean absolute error).

Encoder-Decoder
    Architecture with two parts. Encoder compresses input to lower-dimensional representation (bottleneck). Decoder reconstructs or generates output from encoding. Used in autoencoders (unsupervised), segmentation (U-Net), sequence-to-sequence models. Compress data to latent space, decode to desired output.

Latent space
    Lower-dimensional representation learned by encoder. Captures essential features, discards noise. Enables visualization (2D or 3D latent space), interpolation between samples, generation of new samples. Variational autoencoders (VAE) and GANs operate in latent space. Useful for clustering similar patterns and detecting anomalies.

GAN
    Generative Adversarial Network. Two networks compete: generator creates fake samples, discriminator distinguishes real from fake. Training converges when discriminator cannot tell difference. Applications: image generation, super-resolution, domain transfer. For microscopy: generate synthetic training data, denoise images, enhance resolution.

Batch normalization
    Normalization technique applied after each layer. Normalizes activations to have mean 0 and variance 1, reducing internal covariate shift. Stabilizes training, enables higher learning rates, provides regularization effect. Standard in modern CNNs. Alternatives: layer normalization, group normalization.

Dropout
    Regularization technique randomly setting fraction of neurons to zero during training. Forces network to learn redundant representations, preventing co-adaptation of neurons. Dropout rate typically 0.2 to 0.5. Disabled during inference. Simple but effective technique reducing overfitting.

ReLU
    Rectified Linear Unit. Activation function: f(x) = max(0, x). Most popular activation for hidden layers. Computationally efficient, helps avoid vanishing gradient problem. Variants: Leaky ReLU (small negative slope), PReLU (learned slope), GELU (smooth approximation).

Softmax
    Activation function for final layer in multi-class classification. Converts logits to probability distribution summing to 1. Computed as: softmax(x)ᵢ = exp(xᵢ) / Σⱼ exp(xⱼ). Paired with cross-entropy loss for training. For K classes, outputs K probabilities.

Adam
    Adaptive Moment Estimation. Popular optimizer combining momentum and adaptive learning rates per parameter. Maintains running averages of gradient (first moment) and squared gradient (second moment). Default choice for most deep learning tasks. Hyperparameters: learning rate (α ≈ 0.001), β₁ ≈ 0.9, β₂ ≈ 0.999.

SGD
    Stochastic Gradient Descent. Basic optimizer updating weights using gradient of loss on mini-batch. Noisy updates but can find better minima than adaptive methods. Often uses momentum (accelerates in consistent direction). Requires careful learning rate tuning. Still competitive with proper tuning especially for computer vision.

Transfer learning
    Using knowledge from one task to improve performance on another. Typically: train on large dataset (ImageNet), fine-tune on small domain-specific dataset. Pre-trained model provides good feature extractor. Much faster than training from scratch. Essential technique when labeled data is limited.

Pre-trained model
    Model trained on large dataset, weights publicly available. ImageNet pre-trained models (ResNet, VGG, EfficientNet) common starting point for computer vision. BERT, GPT for NLP. Enables transfer learning. Pre-train on synthetic data, fine-tune on experimental data.

Vanishing gradient
    Problem where gradients become extremely small during backpropagation through many layers. Prevents effective weight updates in early layers. Common with sigmoid/tanh activations. Solutions: ReLU activation, skip connections (ResNet), batch normalization, careful initialization.

Exploding gradient
    Problem where gradients become extremely large during backpropagation. Causes unstable training with weights jumping to very large values. Solutions: gradient clipping, lower learning rate, batch normalization, proper weight initialization.

Weight initialization
    Initial values for neural network weights before training. Poor initialization causes vanishing/exploding gradients or slow convergence. Common methods: Xavier/Glorot (uniform variance across layers), He initialization (for ReLU), orthogonal initialization. Critical for deep networks.

Early stopping
    Regularization technique monitoring validation loss during training. Stop when validation loss stops improving to prevent overfitting. Simple but effective. Requires validation set separate from training and test sets.

Validation set
    Data subset used to evaluate model during training. Helps detect overfitting and tune hyperparameters. Separate from training set (used for weight updates) and test set (final evaluation). Typical split: 70% train, 15% validation, 15% test.

Cross-validation
    Technique for robust model evaluation. Split data into K folds, train K models each using different fold for validation. Average performance across folds. More reliable than single train/validation split. K=5 or K=10 common. Computationally expensive.

Confusion matrix
    Table showing model predictions versus ground truth for classification. Rows are actual classes, columns are predicted classes. Diagonal shows correct predictions. Off-diagonal shows errors. Enables computing precision, recall, F1 score.

Precision
    Fraction of positive predictions that are correct: TP/(TP+FP). High precision means few false positives. Important when false positives are costly.

Recall
    Fraction of actual positives correctly identified: TP/(TP+FN). High recall means few false negatives. Important when missing positives is costly. Also called sensitivity or true positive rate.

F1 score
    Harmonic mean of precision and recall: 2×(precision×recall)/(precision+recall). Balanced metric for classification. Ranges 0 to 1, higher is better. Useful when classes are imbalanced.

References
----------

.. bibliography::
   :filter: docname in docnames
   :style: plain