CUDA fundamentals

This page covers CUDA C++ programming fundamentals for GPU computing. Content is based on the NVIDIA CUDA C++ Programming Guide and tutoring sessions with LLMs.

Note

Looking for hands-on tutorials? See CUDA tutorials for 5 progressive examples from basic 2D matrix operations to production-ready pipelined processing with CUDA streams.

Learning roadmap

Core fundamentals shared in this page.

  • [x] CPU–GPU memory architecture and data transfers

  • [ ] 2D matrix operations with CUDA

  • [ ] Parallelizing computation across threads

  • [ ] Configuring blocks and grids for large data

  • [ ] Optimizing memory access patterns

  • [ ] Using shared memory for performance gains

  • [ ] Synchronization and avoiding race conditions

  • [ ] Profiling and debugging CUDA kernels

  • [ ] Streams and overlapping computation with data transfer

  • [ ] Tiling techniques for matrix multiplication

  • [ ] Warp-level programming and intrinsics

  • [ ] Mixed precision and tensor cores

  • [ ] CUDA libraries (cuBLAS, cuDNN, Thrust)

Here are some topics beyond covered above. As I learn more, I will document them throuhgout my website.

The full disciplinary stack (from bottom to top)

Understanding where CUDA fits in the complete computing stack:

  • Quantum and solid-state physics (electron behavior)

  • Semiconductor materials science (how transistors are built)

  • Device and VLSI engineering (how circuits are designed)

  • Computer architecture (how logic becomes compute units)

  • System architecture (how chips form clusters)

  • Compiler and runtime systems (how code maps to hardware)

  • Parallel programming models (CUDA, etc. — how you control it)

  • Numerical methods and algorithms (what problems you solve)

  • Scientific application domains (where GPU computing changes the world)

Intermediate level

Occupancy and scheduling

  • Occupancy modeling (how register pressure and shared memory affect SM concurrency)

  • Memory hierarchy deep dive (L1/L2 caching, prefetching, memory coalescing with hardware cache lines)

  • Asynchronous memory transfers and CUDA Graphs API (building dependency graphs for complex workflows)

Numerical and algorithmic optimization

  • Sparse linear algebra on GPUs (CSR, COO, ELL formats, SpMV kernels)

  • FFT and spectral methods (cuFFT and custom transforms)

  • Stencil computations and PDE solvers (time-stepping, ghost cells, halo exchange)

  • Monte Carlo methods (parallel RNGs like Philox, XORWOW, Sobol)

  • Reduction and scan patterns (warp-shuffle vs shared memory, hierarchical reductions)

Multi-GPU and distributed computing

  • NVLink and PCIe topology tuning

  • Unified Virtual Addressing (UVA) and peer-to-peer access

  • Multi-GPU programming with NCCL and MPI (domain decomposition, collective communication)

  • GPU clusters and CUDA-aware MPI (scaling scientific solvers to supercomputers)

Performance engineering and tools

  • Roofline analysis (quantifying compute vs bandwidth bounds)

  • CUDA Nsight Compute and Nsight Systems (instruction throughput, stalls, dependency graphs)

  • PTX/SASS inspection (reading compiler output to tune instruction-level behavior)

  • Autotuning frameworks (search-based kernel optimization like OpenTuner, Kernel Tuner)

Advanced language and ecosystem topics

  • CUDA with OpenMP/OpenACC interoperability

  • HIP and SYCL portability (writing kernels that run on AMD/Intel GPUs)

  • CUDA Graphs and Cooperative Groups

  • Dynamic parallelism (kernels launching kernels)

  • C++20 features in CUDA (templates, constexpr, thrust-style metaprogramming)

  • Python GPU bindings (Numba, CuPy, PyTorch extensions, CFFI bridges)

Domain-specific scientific applications

  • Finite element and finite volume GPU implementations

  • Molecular dynamics (LAMMPS, GROMACS implementations)

  • GPU-accelerated signal and image processing

  • Machine learning kernels (matrix multiply to custom activation and gradient ops)

  • Hybrid CPU–GPU solvers (splitting work between CPU control and GPU compute)

Research-level and experimental directions

  • Persistent kernels and task-based execution models

  • Graph and irregular computation on GPUs

  • GPU–CPU co-scheduling for heterogeneous architectures

  • Memory compression and precision-adaptive computing

  • Quantum-inspired and tensor network algorithms on GPU

Advanced level

GPU compiler and low-level architecture mastery

  • PTX-to-SASS compilation pipelines (learn how nvcc, ptxas, and driver JIT interact)

  • LLVM and NVVM IR (understanding CUDA’s compiler backend for custom IR passes)

  • Binary instrumentation tools (NVBit, CUPTI for runtime analysis at instruction granularity)

  • Register allocation and instruction scheduling (hardware scoreboards, dependencies, latency hiding)

  • Dynamic kernel specialization (JIT compiling PTX at runtime for adaptive workloads)

Advanced parallel algorithm design

  • GPU graph analytics (frontier-based BFS, PageRank, sparse gather-scatter, libraries like Gunrock)

  • Irregular memory access optimization (handling unstructured meshes and adaptive grids efficiently)

  • Load balancing and work redistribution (dynamic task assignment across warps or GPUs)

  • Parallel prefix sums, scans, and reductions for high-dimensional data

  • Asynchronous collective operations (overlapping computation with communication in multi-GPU systems)

System-level and HPC integration

  • GPUDirect RDMA (direct GPU-to-NIC communication for low-latency distributed HPC)

  • GPUDirect Storage (bypassing CPU for I/O-intensive workflows like HDF5, netCDF, large dataset streaming)

  • CUDA in containerized and cloud environments (Singularity, Docker with NVIDIA runtime, Kubernetes scheduling)

  • Slurm and job scheduler integration for GPU clusters

  • NUMA and CPU pinning for heterogeneous performance optimization

Advanced numerical computing and scientific software design

  • Mixed-precision iterative refinement (combining FP16/FP32/FP64 arithmetic intelligently)

  • GPU-accelerated preconditioners (AMG, ILU, domain-decomposition methods)

  • Automatic differentiation at kernel level (custom gradient operators for scientific ML or differentiable simulators)

  • Spectral element and discontinuous Galerkin methods (GPU mapping of high-order basis functions)

  • Adaptive mesh refinement (AMR) frameworks with dynamic data redistribution between GPUs

Cross-domain acceleration and frontier hardware

  • CUDA with TensorRT and cuQuantum (domain fusion for ML and physics simulations)

  • Hopper and Blackwell architecture deep dives (tensor memory accelerator, FP8 formats, DPX instructions)

  • FPGA–GPU hybrid workflows (offloading pre/post-processing to FPGAs or DPU units)

  • Heterogeneous memory systems (HBM3 vs GDDR7 tradeoffs, Unified Memory oversubscription tuning)

Meta-programming and performance portability

  • Template metaprogramming for kernel generation (compile-time specialization for problem dimensions)

  • Code generation frameworks (building your own mini-Thrust or Tensor compiler)

  • Performance portability layers (Kokkos, RAJA, and their CUDA backends)

  • AutoML for kernel optimization (AI-guided parameter tuning and kernel fusion)

Research and exascale computing

  • Exascale GPU programming models (evolving standards like OpenMP 5, SYCL 2.0, CUDA Graphs 2.0)

  • Fault tolerance on GPUs (checkpointing, redundant computation, resilience to transient faults)

  • Energy-efficient GPU computing (DVFS tuning, instruction-level power modeling)

  • Domain decomposition at exascale (overlapping communication, topology-aware scheduling)

Expert level

GPU compiler and low-level architecture mastery

  • PTX-to-SASS compilation pipelines (learn how nvcc, ptxas, and driver JIT interact)

  • LLVM and NVVM IR (understanding CUDA’s compiler backend for custom IR passes)

  • Binary instrumentation tools (NVBit, CUPTI for runtime analysis at instruction granularity)

  • Register allocation and instruction scheduling (hardware scoreboards manage dependencies and latency hiding)

  • Dynamic kernel specialization (JIT compiling PTX at runtime for adaptive workloads)

Advanced parallel algorithm design

  • GPU graph analytics (frontier-based BFS, PageRank, sparse gather-scatter, libraries like Gunrock)

  • Irregular memory access optimization (handling unstructured meshes and adaptive grids efficiently)

  • Load balancing and work redistribution (dynamic task assignment across warps or GPUs)

  • Parallel prefix sums, scans, and reductions for high-dimensional data

  • Asynchronous collective operations (overlapping computation with communication in multi-GPU systems)

System-level and HPC integration

  • GPUDirect RDMA (direct GPU-to-NIC communication for low-latency distributed HPC)

  • GPUDirect Storage (bypassing CPU for I/O-intensive workflows like HDF5, netCDF, large dataset streaming)

  • CUDA in containerized and cloud environments (Singularity, Docker with NVIDIA runtime, Kubernetes scheduling)

  • Slurm and job scheduler integration for GPU clusters

  • NUMA and CPU pinning for heterogeneous performance optimization

Advanced numerical computing and scientific software design

  • Mixed-precision iterative refinement (combining FP16/FP32/FP64 arithmetic intelligently for speed and accuracy)

  • GPU-accelerated preconditioners (AMG for algebraic multigrid, ILU, domain-decomposition methods)

  • Automatic differentiation at kernel level (custom gradient operators for scientific ML or differentiable simulators)

  • Spectral element and discontinuous Galerkin methods (GPU mapping of high-order basis functions)

  • Adaptive mesh refinement (AMR) frameworks with dynamic data redistribution between GPUs

Cross-domain acceleration and frontier hardware

  • CUDA with TensorRT and cuQuantum (domain fusion for ML and physics simulations)

  • Hopper and Blackwell architecture deep dives (tensor memory accelerator, FP8 formats, DPX instructions)

  • FPGA–GPU hybrid workflows (offloading pre/post-processing to FPGAs or DPU units)

  • Heterogeneous memory systems (HBM3 vs GDDR7 tradeoffs, Unified Memory oversubscription tuning)

Meta-programming and performance portability

  • Template metaprogramming for kernel generation (compile-time specialization for problem dimensions)

  • Code generation frameworks (building your own mini-Thrust or Tensor compiler)

  • Performance portability layers (Kokkos, RAJA, and their CUDA backends)

  • AutoML for kernel optimization (AI-guided parameter tuning and kernel fusion)

Research and exascale computing

  • Exascale GPU programming models (evolving standards like OpenMP 5, SYCL 2020, CUDA Graphs 2.0)

  • Fault tolerance on GPUs (checkpointing, redundant computation, resilience to transient faults)

  • Energy-efficient GPU computing (DVFS tuning, instruction-level power modeling)

  • Domain decomposition at exascale (overlapping communication, topology-aware scheduling)

Beyond CUDA

Hardware-software co-design

  • SM microarchitecture design (studying warp schedulers, register file banking, SIMT divergence handling at RTL or simulator level)

  • Cache-coherence protocol simulation (using GPGPU-Sim, Multi2Sim, or Accel-Sim to test new GPU memory hierarchies)

  • Custom GPU ISA extensions (experimenting with new PTX opcodes, tensor instructions, or synchronization primitives)

  • Energy–performance tradeoff modeling (designing kernels that exploit power gating and clock domain scaling)

Compiler and runtime research

  • Custom CUDA frontends (using Clang/LLVM to emit NVVM or PTX directly)

  • MLIR and Polyhedral optimization (building domain-specific GPU compilers for PDEs, quantum simulation, or graph analytics)

  • Kernel fusion and scheduling research (compiler passes that reorder kernels for maximum cache reuse and minimal global memory trips)

  • Auto-schedulers and learned compiler models (like TVM or Triton, where AI models learn to generate optimized kernels)

  • Custom runtime systems (adaptive task graphs, persistent kernel scheduling frameworks beyond CUDA Graphs)

Next-generation parallel models

  • Task-based and dataflow execution models (Legion, PaRSEC, Charm++, Kokkos with HPX)

  • Asynchronous many-task (AMT) runtime research (overlapping thousands of fine-grained GPU tasks)

  • Heterogeneous memory systems (disaggregated memory and CXL Compute Express Link for GPU clusters)

  • Speculative parallelism and transactional memory on GPUs

Numerical methods and exascale algorithms

  • Mixed-precision stability theory (rigorous analysis of error propagation in hybrid-precision solvers)

  • Resilient iterative methods (algorithms that recover from bit flips or soft errors during GPU execution)

  • Asynchronous numerical methods (time-stepping without global synchronization barriers)

  • GPU-accelerated quantum simulation (tensor network contraction, variational algorithms, quantum circuit simulation at scale)

  • Adaptive multiphysics frameworks (dynamic load balancing between solvers like fluid, chemistry, radiation on GPUs)

Cross-disciplinary HPC research

  • AI and HPC fusion (neural operators, differentiable physics, scientific machine learning SciML)

  • Physics-informed neural networks (PINNs with custom GPU kernels for PDE learning)

  • Quantum–classical hybrid systems (CUDA interacting with QPUs or FPGA accelerators)

  • Data-driven compiler optimization (reinforcement learning tuning compiler flags and kernel schedules)

  • Exascale visualization pipelines (GPU-based volume rendering of petabyte-scale simulation data)

Emerging paradigms

  • Analog and optical accelerators (using GPU-like APIs to control optical tensor cores or analog compute arrays)

  • Neuromorphic co-processing (integrating CUDA with spiking neural hardware for asynchronous event-driven simulation)

  • GPU simulators for future ISAs (building the next CUDA for post-von-Neumann systems)

  • Software-defined GPUs (virtual GPU orchestration across datacenters, abstracting away physical hardware)

  • Heterogeneous exascale orchestration (co-scheduling GPUs, TPUs, and quantum nodes via unified runtime models)

Fundamental concepts

What is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables developers to use GPUs for general-purpose computing. It extends C++ with keywords to define GPU functions (kernels) and manage device memory.

What is a kernel?

A kernel is a function that runs on the GPU, defined using the __global__ qualifier. It can be called from CPU code and executes in parallel across many GPU threads.

What is heterogeneous programming?

Heterogeneous programming refers to using different types of processors (CPUs and GPUs) within a single application to leverage their unique strengths. In CUDA, this means writing code that runs on both the host (CPU) and device (GPU), allowing parallel processing and acceleration of compute-intensive tasks.

How are CUDA threads organized?

CUDA threads are organized hierarchically:

  • Thread: The smallest unit of execution

  • Block: A group of threads that can cooperate via shared memory

  • Grid: A collection of blocks launched by a single kernel

  • Cluster: (compute capability 9.0+) A group of blocks that share memory

Threads are specified using <<<numBlocks, threadsPerBlock>>> syntax.

What is threadIdx?

Built-in variable containing the thread index within its block. For multi-dimensional blocks, access components as threadIdx.x, threadIdx.y, threadIdx.z.

What is blockIdx?

Built-in variable containing the block index within the grid. For multi-dimensional grids, access components as blockIdx.x, blockIdx.y, blockIdx.z.

What is blockDim?

Built-in variable containing the dimensions (number of threads) of a block in each dimension (x, y, z).

How is memory shared at different thread hierarchy levels?

Memory sharing follows the thread hierarchy. Each thread has private registers, threads within a block share block-level shared memory, and all threads can access grid-level global memory. This hierarchy exists because registers are fastest (1 cycle) but limited in size, while global memory is slow (hundreds of cycles) but large. See the Memory hierarchy section below for detailed performance characteristics and visualization.

Memory hierarchy and performance

What is memory coalescing?

Memory coalescing occurs when consecutive threads access consecutive memory addresses, allowing the GPU to fetch all data in a single memory transaction. This dramatically improves performance.

Memory transaction visualization:

GPU Memory Bus (32-byte transactions):
Address: 1000  1004  1008  1012  1016  1020  1024  1028
        ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
Memory: │ A0  │ A1  │ A2  │ A3  │ A4  │ A5  │ A6  │ A7  │
        └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
        └──────────────── One transaction ───────────────┘
        (32 bytes = 8 floats)

Coalesced access (GOOD):
Thread 0 → A0 (1000)
Thread 1 → A1 (1004)
Thread 2 → A2 (1008)
Thread 3 → A3 (1012)
Result: ONE memory transaction for all 4 threads

Scattered access (BAD):
Thread 0 → A0 (1000)  ─┐
Thread 1 → A4 (1016)   ├─ Three separate transactions
Thread 2 → A8 (1032)  ─┘
Result: THREE memory transactions for 3 threads
Why does row-major layout enable coalescing?

When processing a 2D array row by row, consecutive threads naturally access consecutive memory locations, enabling coalesced access.

Column-major vs row-major access:

2D array in memory (row-major):
0   1   2   3   4   5   6   7   8   9  10  11
└───Row 0───┘└───Row 1───┘└───Row 2───┘

Accessing by rows (coalesced):
Thread 0 → index 0 ┐
Thread 1 → index 1 ├─ Consecutive addresses
Thread 2 → index 2 ┘   ONE transaction

Accessing by columns (scattered):
Thread 0 → index 0 (row 0) ┐
Thread 1 → index 4 (row 1) ├─ Large gaps between addresses
Thread 2 → index 8 (row 2) ┘   Multiple transactions
What is shared memory in CUDA?

Shared memory is fast, on-chip memory that can be accessed by all threads within a block. It enables threads to cooperate by sharing intermediate results, reducing expensive global memory accesses.

What is __syncthreads()?

A barrier synchronization function that ensures all threads in a block reach the same point before proceeding. Essential when using shared memory to prevent race conditions.

Thread hierarchy

What are thread block clusters?

Thread block clusters (compute capability 9.0+) are groups of thread blocks that can share data within a kernel. Clusters provide an intermediate level of organization between blocks and the full grid.

Thread hierarchy levels:

Grid (entire kernel launch)
├── Cluster (group of blocks)
│   ├── Block (group of threads)
│   │   ├── Thread 0
│   │   ├── Thread 1
│   │   └── ...
│   └── Block
│       └── Threads...
└── Cluster
    └── Blocks...
Grid of thread block clusters

Grid of thread block clusters (source: NVIDIA)

How do you specify cluster dimensions?

Use the __cluster_dims__(x, y, z) qualifier in the kernel definition to specify compile-time cluster size.

// Cluster with 2 blocks in X dimension, 1 in Y and Z
__global__ void __cluster_dims__(2, 1, 1) cluster_kernel(float *input, float* output)
{
    // Kernel code
}

int main()
{
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / 16, N / 16);

    // Grid dimension must be multiple of cluster size
    cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output);
}
What is the maximum cluster size?

Cluster size limits are hardware dependent. For NVIDIA A100 GPUs, the maximum is 8 blocks in each dimension (X, Y, Z). Check your specific GPU architecture documentation for limits.

Memory hierarchy

What are the memory types in CUDA?

CUDA provides several memory types with different scopes, lifetimes, and performance characteristics.

Memory hierarchy

CUDA memory hierarchy (source: NVIDIA)

Memory hierarchy table:

Memory Type

Scope

Lifetime

Speed

Use Case

Registers

Thread

Thread

Fastest

Local variables

Shared Memory

Block

Block

Very Fast

Thread cooperation

Global Memory

Grid

Application

Slow

Large datasets

Constant Memory

Grid

Application

Fast (cached)

Read-only data

Texture Memory

Grid

Application

Fast (cached)

Spatial locality

Memory sharing visualization:

Grid
┌────────────────────────────────────────────────┐
│  Global Memory (all blocks can access)         │
│  ┌──────────────┐     ┌──────────────┐         │
│  │ Block 0      │     │ Block 1      │         │
│  │ Shared Mem 0 │     │ Shared Mem 1 │         │
│  │ ┌──┐  ┌──┐   │     │ ┌──┐  ┌──┐   │         │
│  │ │T0│  │T1│   │     │ │T0│  │T1│   │         │
│  │ │R │  │R │   │     │ │R │  │R │   │         │
│  │ └──┘  └──┘   │     │ └──┘  └──┘   │         │
│  └──────────────┘     └──────────────┘         │
└────────────────────────────────────────────────┘

R = Thread private registers (fastest, private to each thread)
T0, T1 = Different threads within a block
Shared Mem = Accessible by all threads in block (very fast)
Global Memory = Accessible by all threads in grid (slow)
Why does this memory hierarchy exist?
  • Performance: Registers are fastest (1 cycle), shared memory is very fast (few cycles), global memory is slow (hundreds of cycles)

  • Hardware constraints: GPU streaming multiprocessors (SMs) have limited on-chip memory, so shared memory must be divided among active blocks

  • Programming model: Block-level sharing enables efficient cooperation algorithms (reductions, prefix sums) while grid-level sharing allows communication across the entire problem

Compilation

What is nvcc?

The NVIDIA CUDA Compiler (nvcc) compiles CUDA C/C++ code into executable binaries that run on NVIDIA GPUs. It handles both host (CPU) code and device (GPU) code, generating optimized machine code for each.

What is just-in-time (JIT) compilation?

JIT compilation compiles code at runtime rather than ahead of time. In CUDA, this enables dynamic optimization and GPU code generation based on specific hardware and runtime conditions.

When is JIT compilation useful in machine learning?

JIT compilation is valuable when deploying models across different GPU architectures. PyTorch’s torch.jit.script and torch.jit.trace compile models at runtime, optimizing operations (matrix multiplications, convolutions, custom kernels) for the target GPU’s compute capability. Examples include SM_80 for A100 and SM_86 for RTX 3090. This allows the same model code to run efficiently on different hardware without manual recompilation.

Memory management

What is linear memory?

Linear memory is a contiguous block of memory allocated on the GPU. It is the most common memory type for storing data arrays in CUDA programs and can be accessed efficiently by threads using simple indexing.

How do you allocate and free GPU memory?

Use cudaMalloc() to allocate memory on the device and cudaFree() to release it.

float* d_A;
size_t size = N * sizeof(float);

// Allocate memory on GPU
cudaMalloc(&d_A, size);

// Use memory...

// Free memory
cudaFree(d_A);
How do you copy data between host and device?

Use cudaMemcpy() to transfer data between CPU (host) and GPU (device) memory.

// Copy direction flags:
cudaMemcpyHostToDevice    // CPU → GPU
cudaMemcpyDeviceToHost    // GPU → CPU
cudaMemcpyDeviceToDevice  // GPU → GPU

Memory

What is the concept of “Linear memory” in CUDA?

Linear memory refers to a contiguous block of memory allocated on the GPU that can be accessed by threads in a straightforward manner. It is the most common type of memory used for storing data arrays in CUDA programs.

What API is used to allocate and free linear memory in CUDA?

cudaMalloc() to allocate linear memory on the GPU and cudaFree() to free that memory.

How do you copy data between host and device memory in CUDA?

Usethe cudaMemcpy() function to copy data between host (CPU) memory and device (GPU) memory.

What is the complete workflow for a CUDA program?

A typical CUDA program follows this pattern: allocate host memory, allocate device memory, copy data to device, launch kernel, copy results back, and free memory.

Memory workflow visualization:

Step 1: Allocate host memory (CPU)
┌─────────────────────────────────────┐
│ CPU Memory (Host)                   │
│ ┌──────┬──────┬──────┐              │
│ │ h_A  │ h_B  │ h_C  │              │
│ └──────┴──────┴──────┘              │
└─────────────────────────────────────┘

Step 2: Allocate device memory (GPU)
┌─────────────────────────────────────┐
│ GPU Memory (Device)                 │
│ ┌──────┬──────┬──────┐              │
│ │ d_A  │ d_B  │ d_C  │              │
│ └──────┴──────┴──────┘              │
└─────────────────────────────────────┘

Step 3: Copy input data (CPU → GPU)
┌─────────┐           ┌─────────┐
│ h_A, h_B│ --------→ │ d_A, d_B│
└─────────┘  memcpy   └─────────┘

Step 4: Launch kernel (GPU computes)
┌─────────────────────┐
│ d_C = d_A + d_B     │
│ (parallel on GPU)   │
└─────────────────────┘

Step 5: Copy result back (GPU → CPU)
┌─────────┐           ┌─────────┐
│   d_C   │ ←-------- │   h_C   │
└─────────┘  memcpy   └─────────┘

Step 6: Free memory
// Device kernel
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

// Host code
int main()
{
    int N = 1000000;
    size_t size = N * sizeof(float);

    // 1. Allocate host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);
    float* h_C = (float*)malloc(size);

    // Initialize input vectors (not shown)

    // 2. Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // 3. Copy input data to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // 4. Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // 5. Copy result back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // 6. Free memory
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
}
What does float* d_A; cudaMalloc(&d_A, size); do?

This allocates GPU memory and stores the address in a pointer.

  • float* d_A; declares a pointer variable that will hold a GPU memory address

  • cudaMalloc(&d_A, size); allocates size bytes on the GPU and stores the address in d_A

  • The & operator passes the pointer’s address so cudaMalloc can modify it

Is cudaMemcpy synchronous or asynchronous?

cudaMemcpy is synchronous and blocks CPU execution until the transfer completes. This ensures data consistency but can impact performance.

  • cudaMemcpyHostToDevice: CPU waits until data is fully copied to GPU

  • cudaMemcpyDeviceToHost: CPU waits until GPU finishes all previous work AND copies data back

For asynchronous transfers, use cudaMemcpyAsync with CUDA streams.

Development tools

Installing CUDA toolkit

How do you install CUDA toolkit with conda?

Install CUDA toolkit and set up environment variables for development.

# Activate your environment
conda activate your-env

# Install CUDA toolkit
conda install -c nvidia \
    "cuda-toolkit=12.8.*" \
    "cuda-nvcc=12.8.*" \
    "cuda-cudart=12.8.*" \
    "cuda-cudart-dev=12.8.*" \
    "cuda-cudadevrt=12.8.*" \
    --update-specs

# Set environment variables
export CUDA_HOME="$CONDA_PREFIX"
export PATH="$CUDA_HOME/bin:$PATH"
export CPATH="$CUDA_HOME/targets/x86_64-linux/include:$CPATH"
export LIBRARY_PATH="$CUDA_HOME/targets/x86_64-linux/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"

# Verify installation
nvcc --version
ls $CONDA_PREFIX/targets/x86_64-linux/lib/libcudadevrt.a
How do you compile a CUDA program?

Use nvcc with optimization flags and proper include/library paths.

# Compile with optimization
nvcc -O2 program.cu -o program \
    -I"$CONDA_PREFIX/targets/x86_64-linux/include" \
    -L"$CONDA_PREFIX/targets/x86_64-linux/lib"

# Run
./program

Profiling and monitoring

How do you monitor GPU usage?

Use nvidia-smi to monitor GPU utilization, memory usage, and running processes.

# View current status
nvidia-smi

# Watch continuously (update every 1 second)
watch -n 1 nvidia-smi
What profiling tools are available?
  • nvprof: Legacy CUDA profiler (deprecated)

  • Nsight Compute (ncu): Detailed kernel profiling

  • Nsight Systems (nsys): System-wide performance analysis

# Check if Nsight tools are installed
which ncu
which nsys

PyTorch profiling

How do you profile PyTorch CUDA code?

Use PyTorch’s built-in profiler to analyze CPU and GPU time for operations.

import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True
) as prof:
    # Your training step or iteration
    model(input_data)

# Print top operations by CUDA time
print(prof.key_averages().table(
    sort_by="cuda_time_total",
    row_limit=10
))
What does high CPU time indicate?

High CPU time often means the CPU is waiting for the GPU or spending time launching many small kernels. Common causes include blocking calls like tensor.item() and torch.cuda.synchronize(), or dispatching many separate kernel launches instead of fused operations.

What are non-fused kernels?

When operations are written as separate tensor operations (add, multiply, FFT, crop), PyTorch dispatches separate CUDA kernels for each, causing multiple reads and writes to GPU memory. This creates overhead from kernel launches and memory transfers.

Non-fused operations:

Operation 1: x = a + b      → Kernel 1: read a,b write x
Operation 2: y = x * c      → Kernel 2: read x,c write y
Operation 3: z = y - d      → Kernel 3: read y,d write z

Result: 3 kernel launches, 6 memory reads, 3 memory writes

Fused version:
z = (a + b) * c - d         → Kernel 1: read a,b,c,d write z

Result: 1 kernel launch, 4 memory reads, 1 memory write
How can you reduce kernel launch overhead?

Use batched operations (gather/scatter) instead of loops with many small operations. Gather reads multiple positions in one operation, and scatter/fold writes batched results back.

Inefficient pattern (many kernels):

# Each iteration launches separate kernels
for pos in scan_positions:
    patch = obj[y:y+h, x:x+w]           # Extract (1 kernel)
    # ... process patch ...
    obj[y:y+h, x:x+w] += delta_patch    # Write back (1 kernel)

Efficient pattern (batched operations):

# GATHER: one operation for all positions
tiles = obj[ys, xs]  # Read all patches at once

# ... process all tiles ...

# SCATTER: one operation to write all results
obj.index_put_((ys, xs), delta_tiles, accumulate=True)
What is the CPU time?

Tells you how much CPU waited to launch the CUDA kernel.

What does it mean when CPU time dominates?

tensor.item() and torch.cuda.synchronize() are blocking calls that wait for the GPU to finish. If you have many of these calls, the CPU will be idle while waiting for the GPU to finish, leading to high CPU time.

Non-fused kernels

If your logic is written as ten sequential tensor operations (add, multiply, FFT, crop, etc.), PyTorch dispatches ten separate CUDA kernels, each reading/writing VRAM.

How can you have one gatehr and scatter operation instead of many?

Tell “PyTorch”, Hey GPU, grab all these windows at once, “Now take these updates and add them back in one go.”. Gather is one operation that reads many positions in a big tector into a batched tensor. Scatter/fold is one operation that wriotes those batched results back into their positions, summing overlamps.

# GATHER: one big read op
tiles = obj[ys, xs]                # many patches read together

# ...do your FFTs etc...

# SCATTER: one big write op (fold = special scatter that sums overlaps)
obj.index_put_((ys, xs), delta_tiles, accumulate=True)

PyTorch optimization techniques

Training strategies

Why use different learning rates for different parameters?

Different parameters have different sensitivities to changes. For example, in ptychography, the probe is more sensitive to noise and artifacts than the object. Using slower learning rates for sensitive parameters prevents overfitting to noise.

Why separate optimizers for different parameter groups?

Different parameter groups often have different characteristics that benefit from separate optimization strategies.

Example parameter characteristics:

Object (high-dimensional, well-conditioned):
- 512×512 pixels = 262,144 amplitude + 262,144 phase parameters
- Smooth loss landscape
- Can handle faster updates

Aberrations (low-dimensional, stiff):
- Handful of parameters (10-100)
- Stiff loss landscape with sharp minima
- Requires careful, slower optimization
What are aten:: operations?

Low-level primitive operations that PyTorch actually executes on the GPU.

Common aten operations:

aten::mul        →  elementwise multiplication (x * y)
aten::add        →  elementwise addition
aten::_fft_c2c   →  complex-to-complex FFT
aten::div        →  elementwise division
aten::pow        →  exponentiation (x ** 2)
Why do we separate optimziers/schedules for the object and aberrations (probe)?

The object is well conditioned are high-dimentional. 512×512 pixels. That’s 262,144 parameters just for amplitude, and another 262,144 for phase. aberrations is low-dimensional for a handlful of paraemters. The loss of the object is smooth and well-behaved, while the loss landscape for aberrations is stiff.

What are aten:: operations?

Low-level primitive operations PyTorch actually executes:

aten::mul → elementwise multiplication (x * y)

aten::_fft_c2c → complex-to-complex FFT (Fast Fourier Transform)

aten::div → elementwise division

aten::pow → exponentiation (like x ** 2)

torch.compile and kernel fusion

What does torch.compile do?

torch.compile analyzes PyTorch code and generates optimized kernels by fusing operations. It reduces kernel launches per iteration and improves memory access patterns.

How can you optimize complex number operations?

Use conjugate multiplication (Psi.conj() * Psi).real instead of torch.abs(Psi) ** 2. This avoids separate sqrt and square operations, enabling better kernel fusion.

Operation comparison:

Inefficient: torch.abs(Psi) ** 2
Step 1: Compute sqrt(real² + imag²)     → Kernel 1
Step 2: Square the result               → Kernel 2
Total: 2 kernels

Efficient: (Psi.conj() * Psi).real
Step 1: Multiply conjugate pairs        → Kernel 1
Result: real² + imag² (same result)
Total: 1 kernel, fusible with other operations

CUDA graphs

What is the difference between torch.compile and CUDA graphs?

These are complementary optimizations working at different levels.

Optimization levels:

torch.compile (graph-level):
- Fuses operations into larger kernels
- Reduces kernel count per iteration
- Optimizes memory access patterns

CUDA Graphs (runtime-level):
- Records entire GPU operation sequence
- Replays sequence without CPU overhead
- Eliminates Python → CUDA driver latency
- Works with already-fused kernels
What does “capturable” mean for optimizers?

Setting capturable=True keeps the optimizer’s internal state (step counters, momentum buffers, bias correction) entirely on the GPU. This is essential for CUDA Graphs because graph capture cannot include CPU operations or allocations.

# Make optimizer compatible with CUDA graphs
optimizer = torch.optim.Adam(
    params,
    lr=0.001,
    capturable=True  # Keep all state on GPU
)
Why can’t you create tensors during graph capture?

During CUDA Graph capture, all tensors must already exist on the GPU. Creating new tensors (torch.zeros, torch.ones) involves CPU operations and memory allocations that break the graph. Pre-allocate all needed tensors before capture.

Use conjugate form (Psi.conj()*Psi).real to kill abs/pow kernels and let Inductor fuse better.

predicted_intensity = torch.abs(Psi) ** 2 means a square root and then squared it again. Each is its own kernel. Using (Psi.conj()*Psi).real combines these into a single multiplication operation, which is more efficient and allows PyTorch Inductor to fuse operations better.

When do you CUDA graph, why do you make ADAM “capturable”?

“Capturable=True” means the optimizer’s internal state (like Adam’s step counters, momentum buffers, and bias-correction math) stays entirely on the GPU, not on the CPU. That’s crucial for CUDA Graphs because during graph capture, no CPU calls or new allocations are allowed — everything inside the captured block must run purely on the GPU; otherwise the capture breaks.

What does it mean by no tensor creation in CUDA graph capture?

During CUDA Graph capture, all tensors used must already exist on the GPU. Creating new tensors (e.g., using torch.zeros or torch.ones) during capture is not allowed, as it would involve CPU operations and memory allocations that break the graph.

How do you use torch.compile effectively?

Precompute constants, use pure tensor operations, and enable full graph mode for maximum fusion.

counts = 1000.0

# Precompute constants once on GPU
with torch.no_grad():
    I_meas = (diffraction_patterns ** 2).contiguous()
    meas_scale = (counts / I_meas.mean(dim=(1,2), keepdim=True)).contiguous()

# Compile model and loss function
model = torch.compile(model, backend="inductor", fullgraph=True)

@torch.compile(backend="inductor", fullgraph=True)
def step_loss(batch_indices, I_meas, meas_scale):
    Psi = model(batch_indices)
    # Use conjugate form for better fusion
    I_pred = (Psi.conj() * Psi).real              # |Ψ|²
    pred_scale = counts / I_pred.mean((1,2), keepdim=True)
    diff = I_pred * pred_scale - I_meas * meas_scale
    return (diff * diff).mean()                   # MSE

# Training loop
for iteration in range(NUM_ITERATIONS):
    loss = step_loss(batch_indices, I_meas, meas_scale)
    optimizer_object.zero_grad(set_to_none=True)
    optimizer_aberrations.zero_grad(set_to_none=True)
    loss.backward()
    optimizer_object.step()
    optimizer_aberrations.step()
What does fullgraph=True do?

It tells the compiler to treat the entire model forward pass as one graph, reducing Python overhead and enabling more fusion. The Inductor backend analyzes the computation graph (FFTs, elementwise ops) and fuses small operations into fewer, larger GPU kernels.

What are the limitations of complex number support?

PyTorch Inductor currently has limited support for complex numbers. Calling .backward() through compiled functions that mix real and complex gradients can cause issues. Consider using custom CUDA kernels for complex-heavy workloads.

Custom CUDA kernels

When should you write custom CUDA kernels?

Custom kernels are valuable when PyTorch’s built-in operations cannot achieve the needed performance or functionality:

  • Cannot fuse complex operations in torch.compile, causing repeated VRAM transfers

  • Need mixed precision (FP16/FP32) beyond PyTorch FFT’s complex64/128

  • Can leverage Tensor cores with FP16 inputs

  • Fuse entire operation sequences into single kernel launch

What is a fused kernel example?

Combine multiple operations that would normally require separate kernels into one kernel.

// Fused intensity loss kernel
__global__ void fused_intensity_loss_kernel(
    Complex* Psi,      // Input: complex wavefield
    float* I_meas,     // Input: measured intensity
    float* meas_scale, // Input: measurement scale
    float* loss        // Output: scalar loss
) {
    // 1. Compute |Psi|²
    // 2. Normalize with online mean calculation
    // 3. Compute MSE
    // All in ONE kernel launch!
}

Floating point precision

What are the different floating point formats?

Different precision levels trade off accuracy for performance and memory usage.

Floating point format comparison:

Format      Bits    Sign    Exponent    Mantissa    Range
──────────────────────────────────────────────────────────
FP32        32      1       8           23          ±10³⁸
FP16        16      1       5           10          ±10⁵
BF16        16      1       8           7           ±10³⁸

FP32 (float): Standard precision
- Default for most computations
- Good accuracy and range

FP16 (half): Half precision
- 2x memory savings
- Faster on Tensor cores
- Limited range (may underflow/overflow)

BF16 (bfloat16): Brain float
- Same range as FP32 (8-bit exponent)
- Less precision (7-bit mantissa)
- Popular in deep learning
- Better numerical stability than FP16
When should you use mixed precision?

Load and store in FP16 to save memory bandwidth, compute in FP32 for accuracy. For example, load object and probe as FP16, multiply using Tensor cores (accumulate in FP32), then use FP32 for FFT operations.

How does mixed precision work with PyTorch complex numbers?

PyTorch FFT requires complex64, but you can store data in FP16 and convert when needed.

Mixed precision workflow:

Step 1: Store in FP16 (save memory)
Object/Probe arrays: FP16 storage

Step 2: Load and compute
Load O, P from FP16
Compute O × P with Tensor cores
Accumulate in FP32

Step 3: Convert for FFT
Convert to complex64 for cuFFT
Perform FFT operations

Additional optimizations

What is optimizer fusion?

Fused optimizers combine all Adam update steps (momentum 1, momentum 2, bias correction, weight update) into one GPU kernel instead of multiple separate elementwise operations. This reduces kernel launch overhead and memory transfers.

What is channels-last memory layout?

PyTorch allows storing tensors in channels-last layout (NHWC instead of NCHW). This improves cache utilization and memory coalescing for CUDA kernels.

Memory layout comparison:

NCHW (default):        [Batch, Channels, Height, Width]
Memory order:          all channel 0, then all channel 1, etc.

NHWC (channels-last):  [Batch, Height, Width, Channels]
Memory order:          all channels for pixel 0, then pixel 1, etc.

Channels-last benefits:
- Better memory coalescing
- Improved cache utilization
- Faster convolutions
What does torch.backends.cudnn.benchmark = True do?

Enables cuDNN to benchmark and select the fastest convolution algorithm for the current tensor shapes. This adds overhead on first run but improves subsequent iterations.

Why is multiplication faster than division on GPUs?

Multiplication is a simpler operation that executes in a single clock cycle. Division is more complex and requires multiple cycles. When possible, use reciprocal multiplication (x * (1/y)) instead of division (x / y).

NVIDIA libraries

What are the main NVIDIA GPU libraries?

NVIDIA provides several optimized libraries for common operations:

  • cuBLAS: Basic Linear Algebra Subprograms for GPU-accelerated matrix and vector operations

  • cuDNN: Deep neural network primitives (convolution, pooling, normalization, attention, activation)

  • CUTLASS: C++ template abstractions for high-performance matrix multiplication

  • Thrust: Parallel algorithms library resembling C++ STL for GPU programming

What is Thrust?

Thrust is a parallel algorithms library that provides GPU-accelerated versions of common algorithms with an STL-like interface.

#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/transform.h>

// Create device vector and initialize
thrust::device_vector<float> v(1000000);
thrust::sequence(v.begin(), v.end());

// Square all elements in parallel
thrust::transform(
    v.begin(), v.end(), v.begin(),
    [] __device__ (float x) { return x*x; }
);
What is the C++ Standard Template Library?

The STL provides generic containers (vector, map, set) and algorithms (sort, reduce, transform) that work with any data type. They are implemented efficiently using templates and iterators.

What operations does cuDNN support?

cuDNN supports core deep learning operations including convolution, pooling, normalization, attention, and activation layers. These are also available in PyTorch and TensorFlow.

What are other NVIDIA tools?
  • TensorRT: C++ library for optimizing deep learning inference on NVIDIA GPUs

  • Nsight Compute: Detailed kernel profiling and analysis

  • Nsight Systems: System-wide performance analysis

  • Nsight Visual Studio Edition: Visual Studio extension for CUDA development

Debugging and validation

What is race condition checking?

Use racecheck to detect concurrent writes or simultaneous reads and writes from multiple threads that can cause undefined behavior.

How do you validate GPU results?

Compare GPU results against CPU reference implementation using relative error:

Relative error: |GPU - CPU| / |CPU| < ε

where ε is a small threshold (e.g., 10⁻⁵ for FP32, 10⁻³ for FP16).

Why compare GPU against CPU?

GPU floating-point math can reorder operations or fuse instructions (FMA), leading to slightly different numerical results than CPU. Validation ensures the differences are within acceptable tolerance.

What is a reproducible baseline?

Establish a reproducible baseline for performance testing and debugging:

  • Single command with all arguments and environment variables specified

  • Fixed random seeds

  • Consistent input data

  • Unit tests for numerical accuracy

  • Version control for code and dependencies

Performance analysis

What questions should you ask when optimizing?

Start with these fundamental questions:

  1. Is this compute-bound or memory-bound? Use profiling to determine the bottleneck

  2. Where are we spending time? Identify top kernels and CPU overhead

  3. Why isn’t the GPU fully utilized? Check occupancy and memory access patterns

  4. Are we using the right algorithms? Consider mathematical reformulations

How do you identify bottlenecks?

Use profiling tools to measure:

  • Kernel execution time

  • Memory bandwidth utilization

  • Compute utilization

  • CPU-GPU transfer overhead

  • Kernel launch overhead

CUDA C++ extensions for PyTorch

When writing custom CUDA kernels for PyTorch (e.g., for ptychography reconstruction, FFTs, or custom neural network layers), you divide code into host (CPU) and device (GPU) operations.

Host code responsibilities

1. Input validation and tensor interfacing

#include <torch/extension.h>

torch::Tensor ptychography_forward(
    torch::Tensor probe,           // complex probe function
    torch::Tensor object,          // object transmission
    torch::Tensor positions        // scan positions
) {
    // Validate inputs
    TORCH_CHECK(probe.is_cuda(), "probe must be CUDA tensor");
    TORCH_CHECK(probe.dtype() == torch::kComplexFloat,
               "probe must be complex32");

    // Get tensor dimensions
    int64_t batch_size = probe.size(0);
    int64_t probe_size = probe.size(1);

2. Memory allocation and data preparation

// Allocate output tensor on GPU
auto options = torch::TensorOptions()
    .dtype(torch::kComplexFloat)
    .device(probe.device());
auto output = torch::zeros({batch_size, probe_size, probe_size}, options);

// Get raw pointers for CUDA kernel
auto probe_ptr = probe.data_ptr<c10::complex<float>>();
auto object_ptr = object.data_ptr<c10::complex<float>>();
auto output_ptr = output.data_ptr<c10::complex<float>>();

3. CUDA kernel launch configuration

// Calculate grid and block dimensions
const int threads = 256;
const int blocks = (batch_size * probe_size * probe_size + threads - 1) / threads;

// Launch kernel
ptychography_kernel<<<blocks, threads>>>(
    probe_ptr,
    object_ptr,
    output_ptr,
    batch_size,
    probe_size
);

// Check for kernel errors
cudaError_t error = cudaGetLastError();
TORCH_CHECK(error == cudaSuccess,
           "CUDA kernel failed: ", cudaGetErrorString(error));

4. Synchronization and output handling

    // Wait for kernel completion
    cudaDeviceSynchronize();

    return output;  // Return PyTorch tensor
}

// Bind to Python
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &ptychography_forward, "Ptychography forward pass");
}

Device code (CUDA kernel)

__global__ void ptychography_kernel(
    const c10::complex<float>* probe,
    const c10::complex<float>* object,
    c10::complex<float>* output,
    int batch_size,
    int probe_size
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int total = batch_size * probe_size * probe_size;

    if (idx < total) {
        // Complex multiplication for ptychography
        output[idx] = probe[idx] * object[idx];
    }
}

Why this division?

  • Host: Handles PyTorch tensor interface, memory management, validation, kernel orchestration

  • Device: Pure computation (complex arithmetic, FFTs, ptychography operations)

  • PyTorch integration: Host code bridges Python/C++/CUDA seamlessly

Common host operations for scientific computing

  • Input/output tensor validation (types, shapes, device placement)

  • Memory allocation (output tensors, workspace buffers)

  • Data layout conversions (contiguous memory, proper strides)

  • Kernel launch parameters (grid/block sizing)

  • Error checking and synchronization

  • Gradient computation setup (for backward pass)

  • Multi-GPU coordination (if using distributed computing)

Typical workflow pattern

// 1. Validate inputs (host)
TORCH_CHECK(input.is_cuda(), "input must be on CUDA");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");

// 2. Allocate outputs (host)
auto output = torch::empty_like(input);

// 3. Get raw pointers (host)
auto input_ptr = input.data_ptr<float>();
auto output_ptr = output.data_ptr<float>();

// 4. Launch kernel (host calls device)
const int threads = 256;
const int blocks = (input.numel() + threads - 1) / threads;
my_kernel<<<blocks, threads>>>(input_ptr, output_ptr, input.numel());

// 5. Synchronize (host)
cudaDeviceSynchronize();

// 6. Return PyTorch tensor (host)
return output;

Building and using the extension

setup.py for compilation:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='ptychography_cuda',
    ext_modules=[
        CUDAExtension(
            'ptychography_cuda',
            ['ptychography.cpp', 'ptychography_kernel.cu'],
            extra_compile_args={
                'cxx': ['-O3'],
                'nvcc': ['-O3', '--use_fast_math']
            }
        )
    ],
    cmdclass={'build_ext': BuildExtension}
)

Using in Python:

import torch
import ptychography_cuda

# Create complex tensors on GPU
probe = torch.randn(32, 256, 256, dtype=torch.complex64, device='cuda')
object_wave = torch.randn(256, 256, dtype=torch.complex64, device='cuda')
positions = torch.randn(32, 2, device='cuda')

# Call custom CUDA kernel
result = ptychography_cuda.forward(probe, object_wave, positions)