Deep Learning

Fundamentals

What is deep learning?

A technique to find patterns in data using neural networks, which can approximate any continuous function. Let’s first understand how these networks learn through optimization.

What is gradient descent?

An algorithm that minimizes functions by following their steepest downward slope. For a function \(f(\theta)\):

\[\theta_{n+1} = \theta_n - \eta \nabla f(\theta_n)\]

where \(\eta\) is the learning rate controlling step size. Updates continue until changes become negligible.

How does it handle multiple parameters?

Each parameter updates independently using its partial derivative:

\[\theta_{i, n+1} = \theta_{i, n} - \eta \frac{\partial f}{\partial \theta_i} \bigg|_{\theta = \theta_n} \quad \text{for } i = 1, 2, ..., n\]

Training Process

What is a loss function?

A measure of prediction error. For example, mean squared error:

\[L(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i; \boldsymbol{\theta}))^2\]

where:

\(y_i\): actual value
\(f(x_i; \boldsymbol{\theta})\): model’s prediction
\(\boldsymbol{\theta}\): model parameters

Perfect predictions yield zero loss.

What is stochastic gradient descent (SGD)?

A faster version of gradient descent that uses mini-batches:

\[L_{SGD}(\boldsymbol{\theta}) = \frac{1}{B} \sum_{i=1}^{B} (y_i - f(x_i; \boldsymbol{\theta}))^2\]

where \(B\) is the mini-batch size (32-512 samples). Benefits:

Lower computation per step
Natural noise aids optimization
Faster training cycles

How do mini-batches work?

Random sampling from dataset
Typically 32-512 samples per batch
Shuffle data between epochs
One epoch = full dataset pass

Advanced Optimization

What limits SGD?

Two main challenges:

Oscillations from mini-batch noise
Difficult learning rate selection

What improvements exist?

Modern optimizers address SGD’s limitations:

Momentum: Adds velocity term
AdaGrad: Parameter-specific rates
RMSProp: Adaptive step sizing
ADAM: Combines momentum and adaptation

How does ADAM work?

ADAM adapts each parameter’s learning rate using momentum and variance estimates:

\[\theta_i^{(t+1)} = \theta_i^{(t)} - \eta \frac{\hat{m}_t^{(i)}}{\sqrt{\hat{v}_t^{(i)}} + \epsilon}\]

where \(\hat{m}_t^{(i)}\) tracks momentum and \(\hat{v}_t^{(i)}\) scales updates based on gradient history.

Why is ADAM effective?

Key benefits:

Momentum helps overcome local minima
Variance scaling prevents overshooting
Adapts to each parameter’s landscape
Self-adjusts step sizes automatically

How are updates computed?

For each parameter \(\theta_i\), ADAM tracks:

Momentum (first moment):

\[m_t^{(i)} = \beta_1 m_{t-1}^{(i)} + (1 - \beta_1) g_t^{(i)}\]

Variance (second moment):

\[v_t^{(i)} = \beta_2 v_{t-1}^{(i)} + (1 - \beta_2) (g_t^{(i)})^2\]

Bias corrections:

\[\hat{m}_t^{(i)} = \frac{m_t^{(i)}}{1 - \beta_1^t}, \quad \hat{v}_t^{(i)} = \frac{v_t^{(i)}}{1 - \beta_2^t}\]

What are ADAM’s key implementation details?

Essential points:

Per-parameter tracking
Shared hyperparameters
Bias correction at start
Time index t for corrections

Backpropagation (TBA)

Activation Functions (TBA)

Build a simple neural network (TBA)

Glossary of terms

PyTorch: Open source deep learning framework developed by Meta. Provides automatic differentiation (autograd), GPU acceleration via CUDA, and neural network building blocks. Dominant framework for research. Supports dynamic computation graphs allowing flexible model architectures. Popular for computer vision, NLP, and scientific ML applications including electron microscopy analysis.
TensorFlow: Open source deep learning framework developed by Google. Provides automatic differentiation, GPU/TPU acceleration, and production deployment tools. Popular for production ML systems. Uses static computation graphs (though TensorFlow 2.x added eager execution). Strong ecosystem for mobile and web deployment.
Autograd: Automatic differentiation. System automatically computes gradients (derivatives) of functions, essential for training neural networks via backpropagation. PyTorch and TensorFlow track operations on tensors, building computational graph, then compute gradients automatically. Eliminates manual derivative calculation, enabling rapid model development.
Backpropagation: Algorithm for training neural networks. Computes gradients of loss function with respect to network weights by applying chain rule backwards through network layers. Enables gradient descent optimization. GPU parallelization makes backpropagation fast even for networks with millions of parameters.
Gradient descent: Optimization algorithm iteratively adjusting parameters to minimize loss function. Computes gradient (direction of steepest increase), steps in opposite direction. Learning rate controls step size. Variants: SGD (stochastic), Adam (adaptive learning rate), RMSprop. Critical for training neural networks and iterative reconstruction algorithms.
Batch size: Number of samples processed together before updating model weights. Larger batches improve GPU utilization (more parallelism) and training stability but require more memory. Smaller batches provide noisier gradients but may generalize better. Typical values: 16 to 256 for vision tasks, 32 to 128 for scientific data.
Epoch: One complete pass through entire training dataset. Training typically requires multiple epochs (10 to 1000s depending on problem). Each epoch, model sees all training samples once. Monitor validation loss per epoch to detect overfitting.
Learning rate: Step size for gradient descent updates. Too high: training diverges. Too low: slow convergence. Typical values: 0.001 to 0.0001 for Adam optimizer. Learning rate schedules reduce rate during training (cosine annealing, step decay). Critical hyperparameter requiring tuning.
Overfitting: Model memorizes training data rather than learning generalizable patterns. High training accuracy but poor validation accuracy. Causes: too many parameters, too few training samples, too many training epochs. Solutions: regularization (dropout, weight decay), data augmentation, early stopping.
Regularization: Techniques preventing overfitting. L1/L2 regularization penalizes large weights. Dropout randomly deactivates neurons during training. Data augmentation creates variations of training samples. Batch normalization stabilizes training and reduces overfitting. Essential for deep networks with millions of parameters.
CNN: Convolutional Neural Network. Architecture using convolution operations to process images. Convolution kernels (3×3, 5×5) slide across image detecting features (edges, textures, patterns). Multiple layers build hierarchical representations. Standard for image classification, segmentation, object detection. Used in microscopy for automated particle picking, defect detection, and pattern recognition.
RNN: Recurrent Neural Network. Architecture for sequential data with feedback connections creating memory. Each step receives current input and previous hidden state. Useful for time series, text, video. Variants: LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit). Less common in microscopy except for temporal analysis.
Transformer: Architecture using self-attention mechanism to process sequences. Replaces RNN sequential processing with parallel attention allowing much faster training on GPUs. Foundation for modern NLP (BERT, GPT) and increasingly used for vision (ViT) and scientific applications. Attention mechanism weighs importance of different input regions. Computational complexity scales as O(N²) with sequence length.
Attention: Mechanism allowing model to focus on relevant parts of input. Query-key-value framework computes weighted sum where weights indicate importance. Self-attention relates positions within single sequence. Cross-attention relates positions between two sequences. Enables long-range dependencies and interpretability showing what model attends to.
U-Net: CNN architecture for image segmentation. Encoder path downsamples image extracting features. Decoder path upsamples and refines segmentation. Skip connections preserve spatial information. Originally for biomedical image segmentation, now widely used in microscopy for particle segmentation, background subtraction, and denoising.
ResNet: Residual Network. CNN using skip connections (residual connections) allowing training very deep networks (50 to 1000+ layers). Skip connections combat vanishing gradient problem. Each block learns residual (difference) rather than full transformation. Widely used as backbone for computer vision tasks including microscopy image analysis.
Inference: Using trained model to make predictions on new data. Forward pass through network without backpropagation. Much faster than training. Can use lower precision (float16, int8) for speed. Real-time inference enables live microscope feedback.
Training: Process of adjusting model parameters to minimize loss function on training data. Involves forward pass (compute predictions), loss calculation, backward pass (compute gradients via backpropagation), and weight update (gradient descent). Computationally intensive, benefits greatly from GPU acceleration. Can take hours to weeks depending on model size and dataset.
Fine-tuning: Adapting pre-trained model to new task by continuing training on new dataset. Start with weights trained on large dataset (ImageNet), train on smaller domain-specific dataset. Much faster than training from scratch. Common when labeled data is limited. Transfer learning leverages knowledge from similar tasks.
Data augmentation: Artificially increasing training data by applying transformations: rotations, flips, crops, brightness adjustments, noise addition. Improves generalization and reduces overfitting. For microscopy: rotate/flip images, add Poisson noise simulating counting statistics, adjust intensity scaling.
Loss function: Measures difference between model predictions and ground truth. Optimization minimizes loss. Common losses: MSE (mean squared error) for regression, cross-entropy for classification, L1 for robustness to outliers. For microscopy: L1 or L2 for denoising, cross-entropy for particle classification, custom losses for reconstruction.
Optimizer: Algorithm updating model weights based on gradients. SGD is basic gradient descent. Adam combines momentum and adaptive learning rates, most popular default. AdamW adds weight decay. RMSprop adapts learning rate per parameter. Each has hyperparameters (learning rate, momentum, decay). Choice affects convergence speed and final performance.
Hyperparameter: Configuration setting for training process, not learned from data. Examples: learning rate, batch size, number of layers, dropout rate, regularization strength. Requires manual tuning or automated search (grid search, random search, Bayesian optimization). Significantly impacts model performance.
Checkpoint: Saved model state (weights, optimizer state) during training. Enable resuming interrupted training, selecting best model based on validation performance, and model deployment. Save periodically (every N epochs) and when validation loss improves. Essential for long training runs that may fail or be preempted.
Mixed precision training: Training neural networks using both float16 and float32. Compute in float16 (2x faster, half memory) while maintaining critical operations in float32 for numerical stability. Loss scaling prevents underflow. Automatic mixed precision (AMP) in PyTorch and TensorFlow handles details automatically. Enables larger batch sizes and faster training on modern GPUs with Tensor Cores.
Tensor Core: Specialized GPU hardware for mixed precision matrix multiplication. Operates on float16 inputs producing float32 output. Much faster than standard CUDA cores for dense matrix operations. A100 provides 312 TFLOPS (float16) tensor core performance versus 19.5 TFLOPS (float64) standard performance. Critical for training large deep learning models efficiently.
TPU: Tensor Processing Unit. Google’s custom ASIC (Application-Specific Integrated Circuit) for machine learning. Optimized for TensorFlow. Provides high throughput for training and inference. Alternative to NVIDIA GPUs. Cloud-only access via Google Cloud Platform.
NCCL: NVIDIA Collective Communications Library. Optimized primitives for multi-GPU and multi-node communication: all-reduce, broadcast, reduce, gather, scatter. Essential for distributed deep learning training. Exploits NVLink for intra-node communication, InfiniBand for inter-node. PyTorch Distributed and Horovod use NCCL internally.
Distributed training: Training single model across multiple GPUs or nodes. Data parallelism replicates model on each device, processes different data batches, synchronizes gradients. Model parallelism splits model across devices for models too large for single GPU. Pipeline parallelism processes different batches at different model stages simultaneously. Achieves near-linear speedup for large datasets and models.
Horovod: Distributed training framework supporting PyTorch, TensorFlow, and others. Provides easy API for data-parallel training using MPI and NCCL. Designed for efficiency on HPC clusters. Common for large-scale scientific ML training across many GPUs.
Segmentation: Computer vision task assigning class label to each pixel. Semantic segmentation labels classes (particle, background, vacuum). Instance segmentation distinguishes individual objects. For microscopy: segment nanoparticles, grain boundaries, defects, or regions of interest.
Classification: Task assigning single label to entire input. For images: classify patterns by type, identify structures, categorize experimental conditions. Binary classification (two classes) or multi-class (many classes). Output is probability distribution over classes.
Regression: Predicting continuous values rather than discrete classes. For microscopy: predict properties from images, estimate experimental parameters, determine optimal settings. Loss typically MSE or MAE (mean absolute error).
Encoder-Decoder: Architecture with two parts. Encoder compresses input to lower-dimensional representation (bottleneck). Decoder reconstructs or generates output from encoding. Used in autoencoders (unsupervised), segmentation (U-Net), sequence-to-sequence models. Compress data to latent space, decode to desired output.
Latent space: Lower-dimensional representation learned by encoder. Captures essential features, discards noise. Enables visualization (2D or 3D latent space), interpolation between samples, generation of new samples. Variational autoencoders (VAE) and GANs operate in latent space. Useful for clustering similar patterns and detecting anomalies.
GAN: Generative Adversarial Network. Two networks compete: generator creates fake samples, discriminator distinguishes real from fake. Training converges when discriminator cannot tell difference. Applications: image generation, super-resolution, domain transfer. For microscopy: generate synthetic training data, denoise images, enhance resolution.
Batch normalization: Normalization technique applied after each layer. Normalizes activations to have mean 0 and variance 1, reducing internal covariate shift. Stabilizes training, enables higher learning rates, provides regularization effect. Standard in modern CNNs. Alternatives: layer normalization, group normalization.
Dropout: Regularization technique randomly setting fraction of neurons to zero during training. Forces network to learn redundant representations, preventing co-adaptation of neurons. Dropout rate typically 0.2 to 0.5. Disabled during inference. Simple but effective technique reducing overfitting.
ReLU: Rectified Linear Unit. Activation function: f(x) = max(0, x). Most popular activation for hidden layers. Computationally efficient, helps avoid vanishing gradient problem. Variants: Leaky ReLU (small negative slope), PReLU (learned slope), GELU (smooth approximation).
Softmax: Activation function for final layer in multi-class classification. Converts logits to probability distribution summing to 1. Computed as: softmax(x)ᵢ = exp(xᵢ) / Σⱼ exp(xⱼ). Paired with cross-entropy loss for training. For K classes, outputs K probabilities.
Adam: Adaptive Moment Estimation. Popular optimizer combining momentum and adaptive learning rates per parameter. Maintains running averages of gradient (first moment) and squared gradient (second moment). Default choice for most deep learning tasks. Hyperparameters: learning rate (α ≈ 0.001), β₁ ≈ 0.9, β₂ ≈ 0.999.
SGD: Stochastic Gradient Descent. Basic optimizer updating weights using gradient of loss on mini-batch. Noisy updates but can find better minima than adaptive methods. Often uses momentum (accelerates in consistent direction). Requires careful learning rate tuning. Still competitive with proper tuning especially for computer vision.
Transfer learning: Using knowledge from one task to improve performance on another. Typically: train on large dataset (ImageNet), fine-tune on small domain-specific dataset. Pre-trained model provides good feature extractor. Much faster than training from scratch. Essential technique when labeled data is limited.
Pre-trained model: Model trained on large dataset, weights publicly available. ImageNet pre-trained models (ResNet, VGG, EfficientNet) common starting point for computer vision. BERT, GPT for NLP. Enables transfer learning. Pre-train on synthetic data, fine-tune on experimental data.
Vanishing gradient: Problem where gradients become extremely small during backpropagation through many layers. Prevents effective weight updates in early layers. Common with sigmoid/tanh activations. Solutions: ReLU activation, skip connections (ResNet), batch normalization, careful initialization.
Exploding gradient: Problem where gradients become extremely large during backpropagation. Causes unstable training with weights jumping to very large values. Solutions: gradient clipping, lower learning rate, batch normalization, proper weight initialization.
Weight initialization: Initial values for neural network weights before training. Poor initialization causes vanishing/exploding gradients or slow convergence. Common methods: Xavier/Glorot (uniform variance across layers), He initialization (for ReLU), orthogonal initialization. Critical for deep networks.
Early stopping: Regularization technique monitoring validation loss during training. Stop when validation loss stops improving to prevent overfitting. Simple but effective. Requires validation set separate from training and test sets.
Validation set: Data subset used to evaluate model during training. Helps detect overfitting and tune hyperparameters. Separate from training set (used for weight updates) and test set (final evaluation). Typical split: 70% train, 15% validation, 15% test.
Cross-validation: Technique for robust model evaluation. Split data into K folds, train K models each using different fold for validation. Average performance across folds. More reliable than single train/validation split. K=5 or K=10 common. Computationally expensive.
Confusion matrix: Table showing model predictions versus ground truth for classification. Rows are actual classes, columns are predicted classes. Diagonal shows correct predictions. Off-diagonal shows errors. Enables computing precision, recall, F1 score.
Precision: Fraction of positive predictions that are correct: TP/(TP+FP). High precision means few false positives. Important when false positives are costly.
Recall: Fraction of actual positives correctly identified: TP/(TP+FN). High recall means few false negatives. Important when missing positives is costly. Also called sensitivity or true positive rate.
F1 score: Harmonic mean of precision and recall: 2×(precision×recall)/(precision+recall). Balanced metric for classification. Ranges 0 to 1, higher is better. Useful when classes are imbalanced.