CUDA tutorials
This page contains progressive tutorials for learning CUDA C++ programming through practical examples. Each tutorial builds on the previous one, from basic array addition to production-ready pipelined processing with CUDA streams.
For fundamental CUDA concepts, architecture, and memory model, see CUDA fundamentals.
Tutorial progression
Note
These tutorials progress from simple to realistic production scenarios:
Memory basics (0/6): 1D array addition with memory management fundamentals
Single block (1/6): Basic 2D matrix addition with 16×16 matrix using one block
Multiple blocks (2/6): Scale to 256×256 matrix using 256 blocks
Batch processing (3/6): Process 1,000 matrices simultaneously (65.5M elements)
Chunked processing (4/6): Handle 100,000 matrices with limited VRAM
Pipelined streams (5/6): Overlap data transfer and computation for maximum performance
Shared memory (6/6): Matrix multiplication with tiling for 10-20× speedup
Tutorial 0: Memory management basics
Understanding how to allocate and transfer data between CPU (host) and GPU (device) is fundamental to CUDA programming. This tutorial shows a complete example of 1D array addition.
Memory workflow visualization:
Computer with CPU and GPU ┌──────────────────────────────────────────────────────┐ │ CPU Side GPU Side │ │ │ │ RAM (System Memory) VRAM (GPU Memory) │ │ ┌─────────────────┐ ┌──────────────────┐ │ │ │ h_A[1.0, 2.0] │────────▶│ d_A[1.0, 2.0] │ │ │ │ h_B[3.0, 4.0] │ memcpy │ d_B[3.0, 4.0] │ │ │ └─────────────────┘ └──────────────────┘ │ │ │ │ │ Kernel accesses │ │ VRAM using d_A, d_B │ │ ▼ │ │ ┌──────────────────┐ │ │ │ 1000 GPU threads │ │ │ │ compute in VRAM │ │ │ └──────────────────┘ │ └──────────────────────────────────────────────────────┘ Workflow: 1. Allocate h_A in RAM (CPU memory) 2. Allocate d_A in VRAM (GPU memory) using cudaMalloc 3. Copy h_A → d_A using cudaMemcpy (RAM to VRAM, happens once) 4. Launch kernel with d_A pointer (just passing an address, not copying data) 5. Kernel threads access d_A directly in VRAM (fast, no copying)
Here is a pedagogical example of adding two arrays:
// Problem: We want 1000 GPU threads to add two arrays // Arrays must be in VRAM, and threads need to know where to find them __global__ void addArrays(float* A, float* B, float* C, int N) { int i = threadIdx.x; if (i < N) { C[i] = A[i] + B[i]; // Accesses data in VRAM } } int main() { int N = 1000; size_t size = N * sizeof(float); // 1000 × 4 bytes = 4000 bytes // Step 1: Allocate and initialize arrays in CPU RAM float *h_A = (float*)malloc(size); // h_ = host (CPU) float *h_B = (float*)malloc(size); float *h_C = (float*)malloc(size); // malloc(size) behind the scenes: // 1. Find 4000 bytes of free RAM // 2. Return starting address of the allocated memory (e.g., 0x5000) // 3. Store that address in h_A for (int i = 0; i < N; i++) { h_A[i] = 1.0f; h_B[i] = 2.0f; } // Step 2: Allocate arrays in GPU VRAM float *d_A, *d_B, *d_C; // d_ = device (GPU) cudaMalloc(&d_A, size); // Behind the scenes: // 1. Find 4000 bytes of free VRAM // 2. Get starting address of that VRAM block (e.g., 0x7f8a4b000000) // 3. Write that starting address into d_A (on CPU) // Result: d_A points to first element of the array in VRAM // // Summary: // d_A is a pointer variable defined as `float* d_A` // => d_A = VRAM address (where array starts in GPU memory) // => *d_A = data at that VRAM address (first element value) // => &d_A = CPU RAM address (where d_A variable itself lives) cudaMalloc(&d_B, size); cudaMalloc(&d_C, size); // Step 3: Copy data from RAM to VRAM (once) cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Step 4: Launch kernel with VRAM pointers addArrays<<<1, N>>>(d_A, d_B, d_C, N); // Threads access data directly in VRAM // Step 5: Copy result back to RAM cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Step 6: Cleanup cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C); }
Here is a visual:
CPU Memory (RAM) GPU Memory (VRAM) ┌────────────────┐ ┌──────────────────┐ │ Address:0x1000 │ │ Address:0x7f8a...│ │ ┌────────────┐ │ │ ┌──────────────┐ │ │ │ d_A │ │ │ │ [1.5, 2.3...]│ │ │ │ = 0x7f8a...│─┼─points to─▶│ └──────────────┘ │ │ └────────────┘ │ │ ↑ │ │ ↑ │ │ *d_A = 1.5 │ │ &d_A │ └──────────────────┘ └────────────────┘ &d_A = 0x1000 (address of pointer on CPU) d_A = 0x7f8a... (VRAM address stored in d_A) *d_A = 1.5 (data in VRAM)
Why pass pointers to kernels?
The problem is that GPUs have separate memory (VRAM) from CPUs (RAM). Your data must be in GPU memory before a kernel can process it. But how do you tell the kernel where to find that data in VRAM? If you tried to pass entire arrays by value, the program would copy gigabytes every kernel launch, which defeats the purpose of having data in fast VRAM.
The solution: First copy data from CPU RAM to GPU VRAM once using cudaMemcpy. Then pass pointers (GPU memory addresses in VRAM) to kernels. A pointer is just a small number (8 bytes) that tells threads where to find the array in VRAM. Thousands of threads can then work on the same dataset in VRAM without any copying. The naming convention uses d_ prefix for device (GPU/VRAM) pointers and h_ prefix for host (CPU/RAM) pointers.
Why doesn’t the kernel just know the VRAM addresses automatically?
Kernels are reusable functions, not hardcoded to specific data.
If addresses were hardcoded, you could only use each kernel once. By passing pointers as parameters, one kernel can process unlimited different datasets - same reason C++ functions take parameters instead of hardcoding values.
// Same kernel, different data!
__global__ void addArrays(float* A, float* B, float* C, int N) {
int i = threadIdx.x;
C[i] = A[i] + B[i]; // Works on ANY arrays you pass
}
int main() {
// Scenario 1: Physics simulation
float *d_physics_A, *d_physics_B, *d_physics_C;
cudaMalloc(&d_physics_A, size);
cudaMalloc(&d_physics_B, size);
cudaMalloc(&d_physics_C, size);
addArrays<<<1, N>>>(d_physics_A, d_physics_B, d_physics_C, N);
// Scenario 2: Same kernel, different data for ML
float *d_ml_A, *d_ml_B, *d_ml_C;
cudaMalloc(&d_ml_A, size);
cudaMalloc(&d_ml_B, size);
cudaMalloc(&d_ml_C, size);
addArrays<<<1, N>>>(d_ml_A, d_ml_B, d_ml_C, N);
// ▲ Same kernel code, different VRAM addresses!
// Scenario 3: Use outputs from previous kernels as inputs
addArrays<<<1, N>>>(d_physics_C, d_ml_C, d_final, N);
}
Tutorial 1: 2D matrix addition with single block (16×16)
Now we progress from 1D arrays to 2D matrices. We organize threads in a 2D layout (16×16) within a single block, where each thread computes one matrix element.
Here is the complete working example:
#include <cuda_runtime.h> #include <stdio.h> __global__ void matAdd(float* A, float* B, float* C, int N) { int i = threadIdx.x; int j = threadIdx.y; int index = i * N + j; if (i < N && j < N) { C[index] = A[index] + B[index]; } } int main() { int N = 16; size_t size = N * N * sizeof(float); // Host allocation and initialization float *h_A = (float*)malloc(size); float *h_B = (float*)malloc(size); float *h_C = (float*)malloc(size); for (int i = 0; i < N * N; i++) { h_A[i] = 1.0f; h_B[i] = 2.0f; } // Device allocation float *d_A, *d_B, *d_C; cudaMalloc(&d_A, size); cudaMalloc(&d_B, size); cudaMalloc(&d_C, size); // Copy to GPU cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Launch kernel dim3 threadsPerBlock(16, 16); dim3 numBlocks(1, 1); matAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result back cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C); return 0; }Thread organization (1 block with 16×16 threads):
Single block with 16×16 = 256 threads ┌────────────────────────────────────────┐ │ Thread Thread Thread ... Thread │ │ (0,0) (1,0) (2,0) (15,0) │ │ │ │ Thread Thread Thread ... Thread │ │ (0,1) (1,1) (2,1) (15,1) │ │ │ │ ... ... ... ... ... │ │ │ │ Thread Thread Thread ... Thread │ │ (0,15) (1,15) (2,15) (15,15) │ └────────────────────────────────────────┘ Each thread (i,j) computes: C[i][j] = A[i][j] + B[i][j] Key insight: Thread layout matches data layout! - Thread (0,0) handles matrix element [0][0] - Thread (5,7) handles matrix element [5][7] - Thread (15,15) handles matrix element [15][15]The kernel function:
__global__ void matAdd(float* A, float* B, float* C, int N) { // Get thread position in 2D block int i = threadIdx.x; // row index (0-15) int j = threadIdx.y; // column index (0-15) // Convert 2D (i,j) to 1D memory index int index = i * N + j; // Why? Memory is 1D, but we think in 2D // Example: Element [2][3] is at position 2×16 + 3 = 35 // Bounds check (important for general code) if (i < N && j < N) { C[index] = A[index] + B[index]; } }What each thread does:
Thread (0,0): index = 0×16 + 0 = 0 → C[0] = A[0] + B[0] (element [0][0]) Thread (0,1): index = 0×16 + 1 = 1 → C[1] = A[1] + B[1] (element [0][1]) Thread (1,0): index = 1×16 + 0 = 16 → C[16] = A[16] + B[16] (element [1][0]) Thread (2,3): index = 2×16 + 3 = 35 → C[35] = A[35] + B[35] (element [2][3]) All 256 threads execute this same code in parallel! Each thread just has different threadIdx values.Host code setup:
int main() { // Define problem size int N = 16; // 16×16 matrix size_t size = N * N * sizeof(float); // 16×16×4 = 1,024 bytes // Allocate host (CPU) memory float *h_A = (float*)malloc(size); // h_ prefix = host memory float *h_B = (float*)malloc(size); float *h_C = (float*)malloc(size); // Initialize matrices with sample values for (int i = 0; i < N * N; i++) { h_A[i] = 1.0f; // Matrix A filled with 1.0 h_B[i] = 2.0f; // Matrix B filled with 2.0 } // Expected result: C should be filled with 3.0Memory visualization after initialization:
CPU RAM: h_A: [1.0, 1.0, 1.0, ..., 1.0] (256 elements × 4 bytes = 1,024 bytes) h_B: [2.0, 2.0, 2.0, ..., 2.0] (256 elements × 4 bytes = 1,024 bytes) h_C: [?, ?, ?, ..., ?] (uninitialized, will hold results)GPU memory allocation:
// Declare device (GPU) pointers float *d_A, *d_B, *d_C; // d_ prefix = device memory // Allocate GPU memory cudaMalloc(&d_A, size); // Allocate 1,024 bytes in VRAM for A cudaMalloc(&d_B, size); // Allocate 1,024 bytes in VRAM for B cudaMalloc(&d_C, size); // Allocate 1,024 bytes in VRAM for CWhat happens with cudaMalloc:
CPU side: GPU side: ┌─────────────────┐ ┌──────────────────┐ │ d_A (pointer) │───points to───▶│ VRAM block │ │ value: 0x7f8a...│ │ [empty, 1KB] │ └─────────────────┘ └──────────────────┘ cudaMalloc(&d_A, 1024): 1. Find 1,024 free bytes in GPU VRAM 2. Return VRAM address (e.g., 0x7f8a4b000000) 3. Store that address in CPU variable d_A Result: d_A contains a VRAM address (lives on CPU, points to GPU)Data transfer to GPU:
// Copy data from CPU RAM to GPU VRAM cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);Data flow visualization:
Before cudaMemcpy: CPU RAM GPU VRAM h_A: [1.0, 1.0, ...] d_A: [?, ?, ...] (empty) h_B: [2.0, 2.0, ...] d_B: [?, ?, ...] (empty) After cudaMemcpy: CPU RAM GPU VRAM h_A: [1.0, 1.0, ...] d_A: [1.0, 1.0, ...] (copied!) h_B: [2.0, 2.0, ...] d_B: [2.0, 2.0, ...] (copied!) This is a synchronous operation: CPU waits until copy completes Time taken: ~1-10 microseconds for 1 KB over PCIeLaunching the kernel:
// Configure launch parameters dim3 threadsPerBlock(16, 16); // 16×16 = 256 threads per block dim3 numBlocks(1, 1); // Just 1 block total // Launch kernel matAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);Understanding dim3:
dim3 is a CUDA type for 3D dimensions (x, y, z) dim3 threadsPerBlock(16, 16); - Creates 16×16 = 256 threads in a 2D grid within one block - Equivalent to: dim3 threadsPerBlock(16, 16, 1); - Each thread gets unique (threadIdx.x, threadIdx.y) dim3 numBlocks(1, 1); - Creates just 1 block - For now, we only need 1 block (256 threads enough for 16×16 matrix)What happens during kernel launch:
GPU Execution: ┌────────────────────────────────────────┐ │ Block (0,0) with 256 threads │ │ │ │ Thread(0,0) Thread(1,0) ... │ │ ↓ ↓ ↓ │ │ C[0]=A[0]+B[0] C[1]=... ... │ │ │ │ All execute simultaneously! │ │ Time: ~1 microsecond for 256 adds │ └────────────────────────────────────────┘ CPU: Continues immediately (kernel launch is asynchronous) GPU: Schedules 256 threads, executes addition, writes to d_CGetting results back:
// Copy result from GPU VRAM to CPU RAM cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);Data flow:
Before this line: GPU VRAM CPU RAM d_C: [3.0, 3.0, ...] h_C: [?, ?, ...] After cudaMemcpy: GPU VRAM CPU RAM d_C: [3.0, 3.0, ...] h_C: [3.0, 3.0, ...] (copied!) This is synchronous: CPU waits for: 1. Kernel to finish computing 2. Data transfer to complete Now h_C contains the results, we can use them on CPU!Cleanup:
// Free GPU memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); // Free CPU memory free(h_A); free(h_B); free(h_C); }Why cleanup is important:
cudaFree(d_A): - Releases 1,024 bytes of GPU VRAM - Makes that memory available for other GPU operations - GPU memory is limited (typically 4-24 GB) free(h_A): - Releases 1,024 bytes of CPU RAM - Standard C memory management Without cleanup: Memory leak! Resources stay allocated until program ends.
2D to 1D memory layout visualization
Consider a 2D matrix B[3][4]:
Row 0: B[0][0] B[0][1] B[0][2] B[0][3]
Row 1: B[1][0] B[1][1] B[1][2] B[1][3]
Row 2: B[2][0] B[2][1] B[2][2] B[2][3]
Stored in 1D memory:
Index: 0 1 2 3 4 5 6 7 8 9 10 11
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
A[] │0,0│0,1│0,2│0,3│1,0│1,1│1,2│1,3│2,0│2,1│2,2│2,3│
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
└───────Row 0───────┘└───────Row 1───────┘└───────Row 2───────┘
Formula: A[i * N + j] where N = 4 (width)
Example mappings:
B[0][1] → A[0 × 4 + 1] = A[1] (row 0, column 1)
B[1][0] → A[1 × 4 + 0] = A[4] (row 1, column 0)
B[2][3] → A[2 × 4 + 3] = A[11] (row 2, column 3)
How it works:
i × N → skip i complete rows (each row is N elements)
+ j → move j positions within the current row
Why do you need boundary checks in this 2D matrix addition kernel?
The problem is that you must launch whole blocks, not individual threads.
Example with N=100 and block size 16×16:
Problem: Need 100×100 threads (one per matrix element)
Block size: 16×16 threads per block
Calculate blocks needed:
blocks_x = ⌈100 / 16⌉ = ⌈6.25⌉ = 7 blocks
blocks_y = ⌈100 / 16⌉ = ⌈6.25⌉ = 7 blocks
Threads launched:
total_x = 7 blocks × 16 threads = 112 threads
total_y = 7 blocks × 16 threads = 112 threads
Total = 112 × 112 = 12,544 threads
But your matrix is only 100×100 = 10,000 elements!
Extra threads = 12,544 - 10,000 = 2,544 threads doing nothing
Visual representation of the thread grid:
Matrix (100×100) Thread grid (112×112)
┌──────────────┐ ┌──────────────┬──┐
│ │ │ │XX│ ← Extra threads
│ Valid │ │ Valid │XX│ (threads 100-111)
│ region │ │ region │XX│
│ 100×100 │ │ 100×100 │XX│
│ │ │ │XX│
│ │ │ │XX│
└──────────────┘ ├──────────────┼──┤
│XXXXXXXXXXXXXX│XX│ ← Extra row of threads
└──────────────┴──┘ (rows 100-111)
Without boundary check: Threads with i≥100 or j≥100 access invalid memory
With boundary check: if (i < N && j < N) prevents out-of-bounds access
Code pattern:
int i = blockIdx.x * blockDim.x + threadIdx.x; // Could be 0-111
int j = blockIdx.y * blockDim.y + threadIdx.y; // Could be 0-111
if (i < N && j < N) // Only process if within 0-99 range
C[i * N + j] = A[i * N + j] + B[i * N + j];
When you do not need boundary checks: When problem size is guaranteed to be a multiple of block size (for example, N=256 with 16×16 blocks gives exactly 16×16 blocks with no remainder).
- What’s next?
We are using a single block containing 256 threads to add two 16×16 matrices. What if we want to add larger matrices, like 256×256? See the next tutorial on using multiple blocks.
Tutorial 2: 2D matrix addition with multiple blocks (256x256)
Recall block refers to a group of threads that can cooperate via shared memory. A grid is a collection of blocks launched by a single kernel.
For larger matrices like 256×256, we divide the matrix into smaller 16×16 chunks, where each chunk is processed by one block. This means we need 16 blocks in the X direction and 16 blocks in the Y direction, giving us 16 × 16 = 256 blocks total. Each block contains 16 × 16 = 256 threads that process the elements in that chunk.
2D block organization visualization:
Grid (multiple blocks arranged in 2D) ┌─────────┬─────────┬─────────┐ │ Block │ Block │ Block │ │ (0,0) │ (1,0) │ (2,0) │ ├─────────┼─────────┼─────────┤ │ Block │ Block │ Block │ │ (0,1) │ (1,1) │ (2,1) │ ├─────────┼─────────┼─────────┤ │ Block │ Block │ Block │ │ (0,2) │ (1,2) │ (2,2) │ └─────────┴─────────┴─────────┘ Each block contains 16×16 threads: Block (0,0) expanded: ┌──────────────────────────────┐ │ T(0,0) T(1,0) ... T(15,0) │ │ T(0,1) T(1,1) ... T(15,1) │ │ ... ... ... ... │ │ T(0,15) ... T(15,15) │ └──────────────────────────────┘ 256 threads per block
Here is a sample implementation:
// Kernel definition __global__ void MatAdd(float* d_A, float* d_B, float* d_C, int N) { // Calculate global thread indices int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; // Boundary check (prevents out-of-bounds access when N not divisible by block size) if (i < N && j < N) { int index = i * N + j; // Convert 2D to 1D index d_C[index] = d_A[index] + d_B[index]; } } int main() { ... // 2D thread organization dim3 threadsPerBlock(16, 16); // x=16, y=16, z=1 dim3 numBlocks(N/16, N/16, 1); // Explicit z dimension MatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N); ... }
Why use multiple blocks for large matrices?
GPUs achieve parallelism by distributing blocks across multiple streaming multiprocessors (SMs).
Comparison for 256×256 matrix (65,536 elements):
Option 1: One block with 1024 threads (maximum per block)
- Launch: 1 block × 1024 threads = 1,024 threads
- Problem: Need 65,536 threads (one per element)
- Result: Must run kernel 64 times sequentially or use complex indexing
- Parallelism: Only 1 SM utilized (out of 82 for RTX 3090, 108 for A100)
Why only 1 SM?
• 1 block = assigned to exactly 1 SM (blocks never split across SMs)
• That SM has 32 CUDA cores working
• Other 81 SMs sit completely idle
• Massive waste of GPU resources
Option 2: 256 blocks with 256 threads each
- Launch: 256 blocks × 256 threads = 65,536 threads
- All elements processed in parallel
- Parallelism: All 82 SMs working (256 blocks distributed across 82 SMs)
How blocks distribute:
• Each SM gets multiple blocks: 256 blocks ÷ 82 SMs ≈ 3 blocks per SM
• Each SM processes its assigned blocks' warps concurrently
• All SMs work simultaneously → ~82× more parallelism
Result: Option 2 is dramatically faster (~50-80× in practice)
Tutorial 3: Batch processing multiple matrices
Process 1,000 matrices of 256×256 elements simultaneously using a 3D grid configuration where the Z dimension represents the batch index.
Problem setup:
Batch size: 1,000 matrices Matrix size: 256×256 elements each Total elements: 1,000 × 256 × 256 = 65,536,000 elements Memory layout (flattened): Matrix 0: [elements 0 to 65,535] Matrix 1: [elements 65,536 to 131,071] Matrix 2: [elements 131,072 to 196,607] ... Matrix 999: [elements 65,470,464 to 65,535,999]3D grid organization:
Grid dimensions: - X dimension: 16 blocks (for 256 columns ÷ 16) - Y dimension: 16 blocks (for 256 rows ÷ 16) - Z dimension: 1,000 blocks (one per matrix in batch) Total blocks = gridDim.x × gridDim.y × gridDim.z = 16 × 16 × 1,000 = 256,000 blocks Threads per block = blockDim.x × blockDim.y × blockDim.z = 16 × 16 × 1 = 256 threads Total threads = Total blocks × Threads per block = 256,000 × 256 = 65,536,000 threadsWarp organization inside each block:
Since each block has 256 threads and warps contain 32 threads: Warps per block = 256 threads ÷ 32 threads/warp = 8 warps Total warps in grid = 256,000 blocks × 8 warps/block = 2,048,000 warps These 2,048,000 warps get distributed across the GPU's streaming multiprocessors (SMs). Each SM executes warps in groups, scheduling them to hide memory latency. On an A100 GPU with 108 SMs: Warps per SM ≈ 2,048,000 warps ÷ 108 SMs ≈ 18,963 warps/SM The SM's warp scheduler executes these warps over time, switching between warps whenever one stalls on memory access.Thread-to-element mapping
Each thread computes one element in one matrix. Think of each block as processing a 16×16 chunk of elements within a specific matrix:
Single matrix (256×256) divided into 16×16 chunks: Matrix 0: ← blockIdx.z = 0 ┌────────┬────────┬─────┬────────┐ │ Block │ Block │ ... │ Block │ ← blockIdx.y = 0 │ (0,0) │ (1,0) │ │ (15,0) │ │ 16×16 │ 16×16 │ │ 16×16 │ ├────────┼────────┼─────┼────────┤ │ Block │ Block │ ... │ Block │ ← blockIdx.y = 1 │ (0,1) │ (1,1) │ │ (15,1) │ ├────────┼────────┼─────┼────────┤ │ ... │ ... │ ... │ ... │ ├────────┼────────┼─────┼────────┤ │ Block │ Block │ ... │ Block │ ← blockIdx.y = 15 │ (0,15) │ (1,15) │ │(15,15) │ └────────┴────────┴─────┴────────┘ ↑ ↑ ↑ bx=0 bx=1 bx=15 Inside each block, 16×16 threads process the elements: Block (3, 5, 0): ← Processing chunk at column 3, row 5, matrix 0 ┌──┬──┬──┬───┬──┐ │T │T │T │...│T │ ← threadIdx.y = 0 │00│10│20│ │F0│ ├──┼──┼──┼───┼──┤ │T │T │T │...│T │ ← threadIdx.y = 1 │01│11│21│ │F1│ ├──┼──┼──┼───┼──┤ │..│..│..│...│..│ ├──┼──┼──┼───┼──┤ │T │T │T │...│T │ ← threadIdx.y = 15 │0F│1F│2F│ │FF│ └──┴──┴──┴───┴──┘ ↑ ↑ ↑ ↑ tx=0 1 2 15 Important: threadIdx always ranges from 0 to (blockDim - 1) - threadIdx.x ranges from 0 to 15 (always resets for each block) - threadIdx.y ranges from 0 to 15 (always resets for each block) - Every block has its own local thread numbering starting at (0,0) - To get the global position, we combine blockIdx and threadIdxCoordinate calculation:
For a thread at (threadIdx.x, threadIdx.y) in block (blockIdx.x, blockIdx.y, blockIdx.z): // Which matrix in the batch int batch_idx = blockIdx.z; // Range: 0 to 999 // Position within that matrix int row = blockIdx.y * 16 + threadIdx.y; // Range: 0 to 255 int col = blockIdx.x * 16 + threadIdx.x; // Range: 0 to 255 Example: Thread at threadIdx(5, 7) in block(3, 10, 42): - Processing matrix 42 - Row = 10 × 16 + 7 = 167 - Col = 3 × 16 + 5 = 53 - Computing element at position (167, 53) in matrix 42
Here is the complete implementation:
// Kernel for batch matrix addition __global__ void batchMatAdd(float* d_A, float* d_B, float* d_C, int N, int batch_size) { // Which matrix in the batch int batch_idx = blockIdx.z; // Position within this matrix int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; // Boundary check if (row < N && col < N && batch_idx < batch_size) { // Calculate flat index: skip to correct matrix, then to element int matrix_offset = batch_idx * N * N; int element_offset = row * N + col; int index = matrix_offset + element_offset; d_C[index] = d_A[index] + d_B[index]; } } int main() { const int N = 256; // Matrix dimension const int batch_size = 1000; // Number of matrices // Total elements for all matrices size_t total_size = batch_size * N * N * sizeof(float); // Allocate host memory float* h_A = (float*)malloc(total_size); float* h_B = (float*)malloc(total_size); float* h_C = (float*)malloc(total_size); // Initialize with sample data for (int i = 0; i < batch_size * N * N; i++) { h_A[i] = 1.0f; h_B[i] = 2.0f; } // Allocate device memory float *d_A, *d_B, *d_C; cudaMalloc(&d_A, total_size); cudaMalloc(&d_B, total_size); cudaMalloc(&d_C, total_size); // Copy data to device cudaMemcpy(d_A, h_A, total_size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, total_size, cudaMemcpyHostToDevice); // Configure 3D grid: 16×16×1000 blocks, each with 16×16 threads dim3 threadsPerBlock(16, 16); dim3 numBlocks(N/16, N/16, batch_size); // Launch kernel batchMatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N, batch_size); // Copy results back cudaMemcpy(h_C, d_C, total_size, cudaMemcpyDeviceToHost); // Verify (check first and last matrix) printf("First element: %f (expected 3.0)\n", h_C[0]); printf("Last element: %f (expected 3.0)\n", h_C[batch_size * N * N - 1]); // Cleanup cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C); }
How does GPU handle 256,000 blocks?
Block assignment to SMs:
GPU: NVIDIA A100 (108 SMs)
Total blocks: 256,000 blocks
Total warps: 2,048,000 warps (each block has 8 warps)
Block assignment (initial wave):
256,000 blocks ÷ 108 SMs ≈ 2,370 blocks per SM
Block to SM mapping:
SM 0: Assigned blocks 0, 108, 216, ... (2,370 blocks total)
SM 1: Assigned blocks 1, 109, 217, ... (2,370 blocks total)
...
SM 107: Assigned blocks 107, 215, 323, ...
How SMs actually execute: warp-level scheduling:
Important: SMs don't execute "blocks" - they execute warps!
Each block (256 threads) contains 8 warps (32 threads each).
Inside SM 0 (each SM operates independently):
┌─────────────────────────────────────────┐
│ SM 0 Warp Scheduler │
│ │
│ Warps assigned to THIS SM: │
│ - Block 0, Warp 0 (ready) ← execute │
│ - Block 0, Warp 1 (stalled) ← skip │
│ - Block 0, Warp 2 (ready) ← execute │
│ - ... │
│ - Block 108, Warp 0 (ready) ← execute │
│ - Block 108, Warp 1 (ready) ← execute │
│ - ... │
└─────────────────────────────────────────┘
Each SM has its own warp scheduler that:
1. Manages only the warps from blocks assigned to THIS SM
2. Every few clock cycles, picks ready warps (not stalled on memory)
3. Executes those warps on this SM's CUDA cores
4. When a warp stalls (memory access), switches to another ready warp
Key point: Each SM independently switches between its own warps.
The 108 SMs on A100 all do this in parallel, each managing their
own set of blocks and warps.
Timeline:
The GPU doesn't process blocks sequentially. Instead:
- Blocks are assigned to SMs as resources become available
- Within each SM, warps from multiple blocks execute concurrently
- Warp scheduler switches between warps every few cycles
- When a block finishes, SM gets assigned a new block
This happens automatically—no explicit management needed!
Tutorial 4: Processing large batches with limited VRAM
The previous example processed 1,000 matrices. What if you need to process 100,000 matrices but your GPU VRAM can only hold 1,000 at a time? You process them in chunks using a loop.
Problem setup:
Given: - Matrix size: 256×256 elements per matrix - Total matrices: 100,000 matrices (entire dataset) - Chunk size: 1,000 matrices (GPU VRAM limit per chunk) Calculate: - Total elements: 100,000 matrices × 256 × 256 = 6,553,600,000 elements - Elements per chunk: 1,000 matrices × 256 × 256 = 65,536,000 elements - Number of chunks: 100,000 matrices ÷ 1,000 matrices/chunk = 100 chunks Solution: Process in 100 chunks of 1,000 matrices each Chunk 0: Matrices 0-999 Chunk 1: Matrices 1,000-1,999 Chunk 2: Matrices 2,000-2,999 ... Chunk 99: Matrices 99,000-99,999Implementation strategy:
For each chunk: 1. Copy 1,000 matrices from host (CPU) to device (GPU) 2. Launch kernel to process these 1,000 matrices 3. Copy results back from device to host 4. Repeat for next chunk This is called "streaming" or "pipelined processing"
Here is the complete implementation:
#include <cuda_runtime.h> #include <cstdio> // Same kernel as before (processes up to 1,000 matrices) __global__ void batchMatAdd(float* d_A, float* d_B, float* d_C, int N, int batch_size) { int batch_idx = blockIdx.z; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < N && col < N && batch_idx < batch_size) { int matrix_offset = batch_idx * N * N; int element_offset = row * N + col; int index = matrix_offset + element_offset; d_C[index] = d_A[index] + d_B[index]; } } int main() { const int N = 256; // Matrix dimension const int total_matrices = 100000; // Total dataset size const int chunk_size = 1000; // GPU memory limit const int num_chunks = total_matrices / chunk_size; // Size for one chunk size_t chunk_bytes = chunk_size * N * N * sizeof(float); // Allocate host memory for entire dataset float* h_A_all = (float*)malloc(total_matrices * N * N * sizeof(float)); float* h_B_all = (float*)malloc(total_matrices * N * N * sizeof(float)); float* h_C_all = (float*)malloc(total_matrices * N * N * sizeof(float)); // Initialize data (would typically load from file) for (int i = 0; i < total_matrices * N * N; i++) { h_A_all[i] = 1.0f; h_B_all[i] = 2.0f; } // Allocate device memory (only enough for one chunk) float *d_A, *d_B, *d_C; cudaMalloc(&d_A, chunk_bytes); cudaMalloc(&d_B, chunk_bytes); cudaMalloc(&d_C, chunk_bytes); // Grid configuration for 1,000 matrices dim3 threadsPerBlock(16, 16); dim3 numBlocks(N/16, N/16, chunk_size); printf("Processing %d matrices in %d chunks of %d\n", total_matrices, num_chunks, chunk_size); // Process each chunk for (int chunk = 0; chunk < num_chunks; chunk++) { // Calculate starting position in the full dataset // Chunk 0: offset = 0 * 1000 * 256 * 256 = 0 (start at matrix 0) // Chunk 1: offset = 1 * 1000 * 256 * 256 = 65,536,000 (start at matrix 1000) // Chunk 2: offset = 2 * 1000 * 256 * 256 = 131,072,000 (start at matrix 2000) int offset = chunk * chunk_size * N * N; printf("Processing chunk %d/%d (matrices %d-%d)\n", chunk + 1, num_chunks, chunk * chunk_size, (chunk + 1) * chunk_size - 1); // Copy chunk from host to device cudaMemcpy(d_A, h_A_all + offset, chunk_bytes, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B_all + offset, chunk_bytes, cudaMemcpyHostToDevice); // Process this chunk batchMatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N, chunk_size); // Copy results back cudaMemcpy(h_C_all + offset, d_C, chunk_bytes, cudaMemcpyDeviceToHost); } // Verify results bool success = true; for (int i = 0; i < total_matrices * N * N; i++) { if (h_C_all[i] != 3.0f) { success = false; break; } } printf("Result: %s\n", success ? "PASS" : "FAIL"); // Cleanup cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A_all); free(h_B_all); free(h_C_all); return 0; }
Tutorial 5: Pipelined processing with CUDA streams
Tutorial 4 processes chunks sequentially - each chunk must fully complete (copy to GPU → compute → copy back) before the next chunk begins. This leaves the GPU idle during data transfers and the CPU waiting during computation.
Tutorial 5 uses CUDA streams to process multiple chunks concurrently - while one chunk computes on the GPU, another chunk’s data is being copied in, and a third chunk’s results are being copied out. This dramatically improves hardware utilization.
Key concept:
Without streams (previous example): ════════════════════════════════════════════════════════════ Chunk 0: [H→D]---[Compute]---[D→H] Chunk 1: [H→D]---[Compute]---[D→H] Chunk 2: [H→D]---[Compute]---[D→H] GPU idle during copies! CPU waits for GPU! With streams (this example): ════════════════════════════════════════════════════════════ Stream 0: [H→D]---[Compute]---[D→H] Stream 1: [H→D]---[Compute]---[D→H] Stream 2: [H→D]---[Compute]---[D→H] Operations overlap! Much better utilization!Why asynchronous operations matter:
The regular ``cudaMemcpy`` is blocking - when you call it, your CPU waits until the entire transfer completes before moving to the next line of code. Similarly, kernel launches with the default stream wait for all previous operations to finish. This creates idle time: - While data copies from CPU to GPU, the GPU compute units sit idle - While GPU computes, the CPU cannot prepare the next chunk - While results copy back, both CPU and GPU wait ``cudaMemcpyAsync`` is non-blocking - it returns immediately after queuing the transfer, letting your CPU continue executing code. Combined with streams, this allows multiple chunks to be processed concurrently. While one chunk computes in stream 0, another chunk can copy data in via stream 1, and a third chunk can copy results out via stream 2 - all happening simultaneously.Requirements for overlapping:
1. Use ``cudaMemcpyAsync`` instead of cudaMemcpy (non-blocking transfers) 2. Use pinned (page-locked) host memory with cudaMallocHost (enables DMA) 3. Use multiple CUDA streams to manage concurrent operations 4. Launch kernels in different streams (allows parallel execution)
Complete implementation
Here is the full code for Tutorial 5. Read through it first to see the overall structure, then continue to the section-by-section breakdown below.
Key parameters:
Given (same dataset as Tutorial 4): - Matrix size: 256×256 elements per matrix - Total matrices: 100,000 matrices (entire dataset) - Chunk size: 1,000 matrices (amount processed per chunk) - Number of streams: 4 streams (new in Tutorial 5!) Calculate: - Total chunks: 100,000 ÷ 1,000 = 100 chunks - Chunks per stream: 100 chunks ÷ 4 streams = 25 chunks/stream - Concurrent chunks: 4 chunks processing simultaneously (one per stream) Key difference from Tutorial 4: Tutorial 4 processes 1 chunk at a time sequentially (100 iterations) Tutorial 5 processes 4 chunks at a time concurrently (still 100 iterations, but chunks overlap in execution)#include <cuda_runtime.h> #include <cstdio> __global__ void batchMatAdd(float* d_A, float* d_B, float* d_C, int N, int batch_size) { int batch_idx = blockIdx.z; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < N && col < N && batch_idx < batch_size) { int matrix_offset = batch_idx * N * N; int element_offset = row * N + col; int index = matrix_offset + element_offset; d_C[index] = d_A[index] + d_B[index]; } } int main() { const int N = 256; const int total_matrices = 100000; const int chunk_size = 1000; const int num_chunks = total_matrices / chunk_size; const int num_streams = 4; size_t chunk_bytes = chunk_size * N * N * sizeof(float); // Allocate pinned host memory float *h_A_all, *h_B_all, *h_C_all; cudaMallocHost(&h_A_all, total_matrices * N * N * sizeof(float)); cudaMallocHost(&h_B_all, total_matrices * N * N * sizeof(float)); cudaMallocHost(&h_C_all, total_matrices * N * N * sizeof(float)); // Initialize data for (int i = 0; i < total_matrices * N * N; i++) { h_A_all[i] = 1.0f; h_B_all[i] = 2.0f; } // Allocate device memory for each stream float *d_A[num_streams], *d_B[num_streams], *d_C[num_streams]; cudaStream_t streams[num_streams]; for (int i = 0; i < num_streams; i++) { cudaMalloc(&d_A[i], chunk_bytes); cudaMalloc(&d_B[i], chunk_bytes); cudaMalloc(&d_C[i], chunk_bytes); cudaStreamCreate(&streams[i]); } dim3 threadsPerBlock(16, 16); dim3 numBlocks(N/16, N/16, chunk_size); // Process chunks using streams for (int chunk = 0; chunk < num_chunks; chunk++) { int stream_idx = chunk % num_streams; int offset = chunk * chunk_size * N * N; cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes, cudaMemcpyHostToDevice, streams[stream_idx]); cudaMemcpyAsync(d_B[stream_idx], h_B_all + offset, chunk_bytes, cudaMemcpyHostToDevice, streams[stream_idx]); batchMatAdd<<<numBlocks, threadsPerBlock, 0, streams[stream_idx]>>>( d_A[stream_idx], d_B[stream_idx], d_C[stream_idx], N, chunk_size); cudaMemcpyAsync(h_C_all + offset, d_C[stream_idx], chunk_bytes, cudaMemcpyDeviceToHost, streams[stream_idx]); } // Wait for all streams to complete for (int i = 0; i < num_streams; i++) { cudaStreamSynchronize(streams[i]); } // Verify results bool success = true; for (int i = 0; i < total_matrices * N * N; i++) { if (h_C_all[i] != 3.0f) { success = false; break; } } printf("Result: %s\n", success ? "PASS" : "FAIL"); // Cleanup for (int i = 0; i < num_streams; i++) { cudaFree(d_A[i]); cudaFree(d_B[i]); cudaFree(d_C[i]); cudaStreamDestroy(streams[i]); } cudaFreeHost(h_A_all); cudaFreeHost(h_B_all); cudaFreeHost(h_C_all); return 0; }
Code breakdown
Now let’s examine each section in detail to understand how streams enable concurrent processing.
Section 1: Allocate pinned memory
// Tutorial 4 used: malloc(total_matrices * N * N * sizeof(float)) // Tutorial 5 uses: cudaMallocHost for pinned memory float *h_A_all, *h_B_all, *h_C_all; cudaMallocHost(&h_A_all, total_matrices * N * N * sizeof(float)); cudaMallocHost(&h_B_all, total_matrices * N * N * sizeof(float)); cudaMallocHost(&h_C_all, total_matrices * N * N * sizeof(float));
Here we use pinned memery. This is required for asynchronous.
What is pinned memory?
Important: Pinned memory is HOST (CPU) memory, not GPU memory!
Pinned memory (also called page-locked memory) is a special type of CPU RAM that the operating system guarantees will stay at a fixed physical location and never be swapped to disk.
Memory location clarification:
Computer Architecture:
┌─────────────────────────────────────────────────────────┐
│ CPU GPU │
│ │
│ RAM (Host Memory) VRAM (Device Memory) │
│ ┌─────────────────────┐ ┌──────────────────┐ │
│ │ Regular malloc() │ │ cudaMalloc() │ │
│ │ (pageable) │ │ (always pinned) │ │
│ ├─────────────────────┤ └──────────────────┘ │
│ │ cudaMallocHost() │ ↑ │
│ │ (pinned/page-locked)│────DMA────────┘ │
│ └─────────────────────┘ Fast async transfer │
│ ↑ │
│ Lives in CPU RAM, not GPU VRAM! │
└─────────────────────────────────────────────────────────┘
Regular host memory (pageable, malloc):
float* data = (float*)malloc(1000 * sizeof(float));
Where it lives: CPU RAM (pageable memory)
What happens:
- OS gives you virtual memory addresses in CPU RAM
- Physical location in CPU RAM can change over time
- OS can move pages to disk (swap) if RAM is low
- When you access the data, OS might need to swap it back from disk
This is good: Lets you use more memory than physically available
This is bad: GPU cannot directly access memory that might move or swap
Pinned host memory (page-locked, cudaMallocHost):
float* data;
cudaMallocHost(&data, 1000 * sizeof(float));
Where it lives: CPU RAM (pinned/page-locked memory)
What happens:
- Memory is allocated at a fixed physical CPU RAM address
- OS promises this CPU RAM will never move to a different address
- OS promises this CPU RAM will never swap to disk
- GPU can directly read from this CPU RAM via DMA (Direct Memory Access)
This is good: GPU can transfer from CPU RAM without OS intervention (fast, async)
This is bad: Reduces CPU RAM available for other programs (limited resource)
What problem does non-pinned memory solve?
OS uses virtual memory to let programs use more memory than physically available in RAM. The OS divides memory into fixed-size blocks called pages (typically 4 KB each). When RAM gets full, the OS moves less-used pages from RAM to disk storage. This process is called swapping or paging (paging = moving pages between RAM and disk). This lets your computer run many programs simultaneously even with limited RAM.
When you access swapped-out memory, the OS transparently swaps it back from disk to RAM. This is invisible to your program but creates unpredictable delays. For pageable memory (memory divided into pages that can move), allocated with malloc, the OS has full control and can swap pages to disk at any time based on memory pressure.
For GPU transfers, this creates problems because the GPU needs stable physical RAM addresses. If memory is on disk, the GPU cannot access it until the OS swaps it back. Pinned memory prevents swapping by telling the OS “keep these pages in RAM permanently” - the OS locks those pages in physical RAM and never moves them to disk, giving the GPU reliable direct access via DMA.
Section 2: Create multiple device buffers
// Tutorial 4: One pointer (float *d_A) pointing to one GPU memory allocation // Tutorial 5: Array of pointers, each pointing to separate GPU memory // Declare arrays of pointers (with num_streams = 4, this creates 4-element arrays) // d_A is a 4-element array: d_A[0], d_A[1], d_A[2], d_A[3] // Each element is a pointer that will point to separate GPU memory float *d_A[num_streams], *d_B[num_streams], *d_C[num_streams]; // Array to hold 4 stream objects (execution contexts) cudaStream_t streams[num_streams]; // Create memory allocations and streams for (int i = 0; i < num_streams; i++) { // Allocate GPU memory for stream i's input A // &d_A[i] returns CPU RAM address where pointer d_A[i] is stored // cudaMalloc writes GPU VRAM address into d_A[i] cudaMalloc(&d_A[i], chunk_bytes); // Allocate GPU memory for stream i's input B cudaMalloc(&d_B[i], chunk_bytes); // Allocate GPU memory for stream i's output C cudaMalloc(&d_C[i], chunk_bytes); // Create execution stream i // Each stream is an independent queue of GPU operations cudaStreamCreate(&streams[i]); }
Why multiple buffers? If all streams shared one set of buffers, they would overwrite each other’s data. Each stream needs its own workspace in GPU memory to process chunks independently. This multiplies VRAM usage by the number of streams, but enables true concurrency.
What is a buffer and why is it called that?
A buffer is just memory. We call it a “buffer” because it temporarily holds data flowing between systems that operate at different speeds (CPU ↔ GPU).
Why buffers are critical for streaming:
Without multiple buffers:
- Stream 0 copies data into d_A
- Stream 1 tries to copy into d_A → OVERWRITES stream 0's data!
- Result: Data corruption, wrong results
With multiple buffers (one per stream):
- Stream 0 uses d_A[0], d_B[0], d_C[0]
- Stream 1 uses d_A[1], d_B[1], d_C[1]
- Stream 2 uses d_A[2], d_B[2], d_C[2]
- Stream 3 uses d_A[3], d_B[3], d_C[3]
- Result: Each stream has its own workspace, no conflicts!
This is why Tutorial 5 needs num_streams sets of buffers. Each stream
operates independently on its own data, enabling true concurrent processing.
Without separate buffers, streams cannot overlap their operations.
Section 3: Process chunks asynchronously
Recall what we’re working with:
Total dataset: 100,000 matrices (each 256×256 elements)
Chunk size: 1,000 matrices per chunk
Total chunks: 100 chunks to process (100,000 ÷ 1,000)
Number of streams: 4 streams for concurrent processing
Terminology reminder:
Chunk: A portion of data (1,000 matrices). What we’re processing.
Stream: An execution context on the GPU. How we process it concurrently.
The loop processes all 100 chunks by assigning each chunk to one of the 4 streams in round-robin fashion. Multiple streams process different chunks simultaneously.
// Process all 100 chunks, assigning each to a stream in round-robin fashion for (int chunk = 0; chunk < num_chunks; chunk++) { // Determine which stream to use (cycles through 0, 1, 2, 3, 0, 1, ...) // Chunk 0→Stream 0, Chunk 1→Stream 1, Chunk 4→Stream 0 (reuse), etc. int stream_idx = chunk % num_streams; // Calculate starting element offset (not byte offset!) for this chunk // "offset" tells us where this chunk's data starts in the h_A_all array // Chunk 0: offset = 0 × 1000 × 256 × 256 = 0 (start at element 0) // Chunk 1: offset = 1 × 1000 × 256 × 256 = 65,536,000 (skip first 1000 matrices) // Chunk 2: offset = 2 × 1000 × 256 × 256 = 131,072,000 (skip first 2000 matrices) // This is an element index, not bytes. We use it with pointer arithmetic. int offset = chunk * chunk_size * N * N; // Copy input matrix A for this chunk from CPU RAM to GPU VRAM // Source: h_A_all + offset (pinned CPU RAM, starting at chunk's data) // Destination: d_A[stream_idx] (GPU VRAM buffer for this stream) // Returns immediately (async), CPU continues to next line cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes, cudaMemcpyHostToDevice, streams[stream_idx]); // Copy input matrix B for this chunk from CPU RAM to GPU VRAM // Uses same stream, so this waits for previous operation in this stream cudaMemcpyAsync(d_B[stream_idx], h_B_all + offset, chunk_bytes, cudaMemcpyHostToDevice, streams[stream_idx]); // Launch kernel to process this chunk in GPU VRAM // Reads from: d_A[stream_idx], d_B[stream_idx] (inputs in VRAM) // Writes to: d_C[stream_idx] (output in VRAM) // Fourth argument (0) is shared memory size (not used here) // Last argument specifies which stream to use batchMatAdd<<<numBlocks, threadsPerBlock, 0, streams[stream_idx]>>>( d_A[stream_idx], d_B[stream_idx], d_C[stream_idx], N, chunk_size); // Copy result back from GPU VRAM to CPU RAM // Source: d_C[stream_idx] (GPU VRAM with computed results) // Destination: h_C_all + offset (pinned CPU RAM, write at chunk's location) // Returns immediately (async), loop continues to process next chunk cudaMemcpyAsync(h_C_all + offset, d_C[stream_idx], chunk_bytes, cudaMemcpyDeviceToHost, streams[stream_idx]); }Offset and pointer arithmetic visualization:
h_A_all array (100,000 matrices in CPU RAM): ┌──────────────────┬──────────────────┬──────────────────┬─────┬──────────────────┐ │ Chunk 0 │ Chunk 1 │ Chunk 2 │ ... │ Chunk 99 │ │ 1000 matrices │ 1000 matrices │ 1000 matrices │ │ 1000 matrices │ │ 65,536,000 elem │ 65,536,000 elem │ 65,536,000 elem │ │ 65,536,000 elem │ └──────────────────┴──────────────────┴──────────────────┴─────┴──────────────────┘ ↑ ↑ ↑ ↑ offset=0 offset=65,536,000 offset=131,072,000 offset=6,488,064,000 h_A_all+0 h_A_all+65536000 h_A_all+131072000 h_A_all+6488064000 When chunk=1: offset = 1 × 1000 × 256 × 256 = 65,536,000 cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes, ...) ↑ Pointer arithmetic! h_A_all is float*, so +offset means: "skip 65,536,000 float elements" = skip 65,536,000 × 4 bytes = skip 262,144,000 bytes = start at chunk 1's data Key point: When you add to a pointer in C/C++, the compiler automatically multiplies by sizeof(type). So h_A_all + offset means "start offset floats into the array", not "start offset bytes into memory".
Here, cudaMemcpyAsync returns immediately, letting the loop continue. Operations within a stream execute in order (copy → compute → copy), but different streams execute concurrently. While stream 0 computes, stream 1 copies data in, and stream 2 copies results out.
What actually happens: CPU queues work, GPU executes it:
CPU behavior (very fast, microseconds per iteration): - Loop runs through all 100 chunks rapidly - Each iteration just QUEUES operations (doesn't wait for them) - `cudaMemcpyAsync` returns immediately (just adds to stream's queue) - Kernel launch returns immediately (just adds to stream's queue) - CPU finishes the entire loop in milliseconds - Then CPU continues to other work (or waits at cudaStreamSynchronize) GPU behavior (slow, processes when ready): - Receives queued operations in each stream - Executes operations in each stream sequentially (H→D, compute, D→H) - BUT executes different streams concurrently - Takes seconds to process all 100 chunks Timeline example (to scale): Time → 0ms 1ms 2000ms (2 seconds) ▼ ▼ ▼ CPU: |████| Loop finishes instantly! → [CPU free to do other work] ↑ All 100 chunks queued GPU: |─────────────────────────────────────────────────────────────────| Stream 0: |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|─D→H─|... (chunks 0,4,8,12...) Stream 1: |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|─D→H─|... (chunks 1,5,9,13...) Stream 2: |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|... (chunks 2,6,10,14...) Stream 3: |─H→D─|─Compute─|─D→H─|─H→D─|─Compute| (chunks 3,7,11,15...) ↑ GPU crunching numbers for ~2 seconds total What happens in practice: 1. for loop runs (100 iterations): ~1 millisecond - Each iteration just queues 3 async operations per stream - No waiting, just scheduling work - Loop exits immediately after queuing all 100 chunks 2. GPU processes queued work: ~2000 milliseconds (2 seconds) - Works through each stream's queue sequentially - Processes 4 streams concurrently - Stream 0 processes chunks 0→4→8→12→...→96 (25 chunks total) - Stream 1 processes chunks 1→5→9→13→...→97 (25 chunks total) - Stream 2 processes chunks 2→6→10→14→...→98 (25 chunks total) - Stream 3 processes chunks 3→7→11→15→...→99 (25 chunks total)
How to choose the number of streams?
The optimal number of streams depends on the balance between copy time and compute time for your workload.
Theoretical guideline (simplified rule of thumb):
Goal: Keep GPU compute fully utilized (never idle waiting for data)
To achieve full overlap, while GPU computes one chunk, you need enough
streams to be copying data for upcoming chunks.
Derivation:r
- During compute time T, GPU is busy computing
- During that same time T, we need to finish copying for the next chunk
- Copy time for one chunk = C (includes both H→D and D→H)
- Number of chunks we can copy during T = T / C
- We need at least T / C streams to keep pipeline full
Formula: num_streams ≥ C / T (this keeps compute continuously fed)
However, the formula num_streams ≥ 2 × (C / T) + 1 adds safety margin for:
- Separate H→D and D→H copy engines (factor of 2)
- One extra stream for buffer (the +1)
Example:
- Copy H→D: 10 ms
- Compute: 50 ms
- Copy D→H: 10 ms
- Total copy: C = 20 ms, Compute: T = 50 ms
Basic formula: 20 / 50 = 0.4 → Need at least 1 stream
Conservative formula: 2 × (20 / 50) + 1 = 1.8 → Round up to 2 streams
Reality check: This is a rough guideline, not a law!
- Assumes perfect overlap (rarely happens in practice)
- Ignores PCIe bandwidth limits, copy engine contention
- Best approach: Start with 4 streams, then measure and tune
Practical considerations:
1. VRAM limitation (most important):
- Each stream needs its own buffers (d_A, d_B, d_C)
- More streams = more VRAM usage
- Example: 4 streams × 3 buffers × 1000 matrices × 256² × 4 bytes = 3 GB
- If VRAM is limited, reduce num_streams or chunk_size
2. Compute-bound vs memory-bound:
- Compute-bound (kernel time >> copy time): Use 2-4 streams
- Memory-bound (copy time >> kernel time): Use 4-8 streams
- Balanced workload: Use 3-4 streams
3. Diminishing returns:
- Beyond 4-8 streams, speedup plateaus
- More streams = more synchronization overhead
- GPU hardware has limited concurrent copy engines (typically 2)
4. Hardware constraints:
- PCIe lanes limit concurrent transfers
- GPU copy engines (typically 2: one H→D, one D→H)
- SM occupancy and scheduling capacity
Typical recommendations:
- Start with 4 streams (good balance for most workloads)
- If compute-heavy: Try 2-3 streams
- If memory-bound: Try 6-8 streams
- Always measure performance with your specific workload
- Watch VRAM usage: nvidia-smi to monitor memory consumption
How to test:
Run your code with different num_streams values (2, 4, 8, 16) and measure:
- Total execution time (lower is better)
- VRAM usage (must fit in available memory)
- GPU utilization (nvidia-smi dmon -s u)
Choose the smallest num_streams that gives near-maximum speedup.
What speedup can we expect with streaming?
Ideal speedup calculation:
Assume for one chunk:
- Copy H→D: 10 ms
- Compute: 50 ms
- Copy D→H: 10 ms
- Total: 70 ms per chunk
Without streams (100 chunks):
Total time = 100 × 70 ms = 7,000 ms = 7 seconds
With streams (4 concurrent):
Pipeline fills up after first 4 chunks, then processes 4 at a time.
After pipeline fills:
- 4 chunks complete every ~50 ms (limited by slowest operation)
- 100 chunks ÷ 4 streams × 50 ms ≈ 1,250 ms
- Plus initial fill time: ~200 ms
- Total: ~1,450 ms = 1.45 seconds
Speedup: 7.0 / 1.45 ≈ 4.8× faster!
Real-world considerations:
- PCIe bandwidth limits (16 GB/s typical)
- GPU compute capability and occupancy
- Memory allocation overhead
- Synchronization costs
Typical real speedup: 2-3× (still very significant!)
Section 4: Synchronize and verify
// Wait for all streams to finish before reading results for (int i = 0; i < num_streams; i++) { cudaStreamSynchronize(streams[i]); } // Now it's safe to verify results in h_C_all bool success = true; for (int i = 0; i < total_matrices * N * N; i++) { if (h_C_all[i] != 3.0f) { success = false; break; } }
Why synchronize? Because all operations are asynchronous, the loop completes before the GPU finishes processing. Without synchronization, you would read h_C_all while async copies are still writing to it, causing race conditions. cudaStreamSynchronize blocks the CPU until that stream completes all queued operations.
Tutorial 7: Batch processing with streams and tiling (100k images of 1024×1024)
Process 100,000 matrices (1024×1024 each) using both tiling and streams to maximize GPU utilization.
Problem: Large batch matrix operations
You have 100,000 images (1024×1024 pixels each) and need to apply a transformation like convolution, filtering, or matrix multiplication. Processing them one at a time wastes GPU resources. Processing all at once exceeds memory limits.
Note
Terminology: In this tutorial, “batch” refers to a group of matrices processed together in one iteration (same as “chunk” in Tutorial 4). We use “batch” here because it’s more common in machine learning and image processing contexts. The concept is identical: dividing a large dataset into smaller groups that fit in memory.
Challenge:
Total data: 100,000 matrices × 1024×1024 × 4 bytes = 400 GB
GPU memory: Typically 8-24 GB
Need to process in batches while keeping GPU busy
Solution approach:
Combine Tutorial 5 (streams) and Tutorial 6 (tiling):
Divide 100,000 matrices into batches that fit in GPU memory
Use multiple streams to overlap CPU→GPU transfer, computation, and GPU→CPU transfer
Within each computation, use tiled matrix operations for efficiency
Pipeline the batches to maximize throughput
Step 1: Understanding the pipeline
We’ll process matrices in batches of 1,000, using 3 streams for triple buffering:
Time →
Stream 0: [H→D batch 0] [Compute batch 0] [D→H batch 0]
Stream 1: [H→D batch 1] [Compute batch 1] [D→H batch 1]
Stream 2: [H→D batch 2] [Compute batch 2] [D→H batch 2]
Result: All three stages happen simultaneously across different batches
Step 2: Memory organization
For 100,000 matrices in batches of 1,000, we need a realistic memory strategy:
Storage (disk/database): 100,000 matrices (400 GB total)
├─ Load batches on-demand as needed
└─ Save processed results back to disk
Host memory (pinned, for fast transfer):
Stream 0 buffers: input (4 GB) + output (4 GB) = 8 GB
Stream 1 buffers: input (4 GB) + output (4 GB) = 8 GB
Stream 2 buffers: input (4 GB) + output (4 GB) = 8 GB
Total: 24 GB (realistic for modern systems)
GPU memory (device):
Stream 0: input (4 GB) + output (4 GB) = 8 GB
Stream 1: input (4 GB) + output (4 GB) = 8 GB
Stream 2: input (4 GB) + output (4 GB) = 8 GB
Total: 24 GB (fits on 32 GB GPU, tight on 16 GB GPU)
Data flow:
Disk → Pinned buffer → GPU → Pinned buffer → Disk
Key: Only 3 batches in memory at once, cycling through all 100 batches
Step 3: Implementation strategy
// Configuration
#define N 1024 // Matrix dimension
#define TILE_SIZE 16 // Tile size for shared memory
#define CHUNK_SIZE 1000 // Matrices per chunk
#define NUM_CHUNKS 100 // Total chunks (100,000 / 1,000)
#define NUM_STREAMS 3 // Triple buffering
// Tiled kernel: Apply 3×3 Gaussian blur filter
// Why tiling helps: Each output pixel needs 9 input pixels
// With shared memory: Load once, reuse 9 times by different threads
// Without shared memory: Each thread would load from slow global memory
__global__ void processMatrixTiled(float* input, float* output, int N)
{
// Shared memory tile (with halo for neighboring pixels)
__shared__ float tile[TILE_SIZE + 2][TILE_SIZE + 2];
// Calculate global position
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
// Load tile cooperatively (including 1-pixel border)
int tileRow = threadIdx.y + 1; // +1 for halo
int tileCol = threadIdx.x + 1;
if (row < N && col < N) {
tile[tileRow][tileCol] = input[row * N + col];
} else {
tile[tileRow][tileCol] = 0.0f;
}
// Load halo (border pixels needed for filter)
if (threadIdx.y == 0 && row > 0) {
tile[0][tileCol] = input[(row-1) * N + col]; // Top
}
if (threadIdx.y == TILE_SIZE-1 && row < N-1) {
tile[TILE_SIZE+1][tileCol] = input[(row+1) * N + col]; // Bottom
}
if (threadIdx.x == 0 && col > 0) {
tile[tileRow][0] = input[row * N + (col-1)]; // Left
}
if (threadIdx.x == TILE_SIZE-1 && col < N-1) {
tile[tileRow][TILE_SIZE+1] = input[row * N + (col+1)]; // Right
}
__syncthreads();
// Apply 3×3 Gaussian blur using shared memory
// Gaussian kernel weights (normalized):
// [1 2 1]
// [2 4 2] / 16
// [1 2 1]
if (row < N && col < N) {
float result =
tile[tileRow-1][tileCol-1] * 1.0f + // Top-left
tile[tileRow-1][tileCol] * 2.0f + // Top
tile[tileRow-1][tileCol+1] * 1.0f + // Top-right
tile[tileRow][tileCol-1] * 2.0f + // Left
tile[tileRow][tileCol] * 4.0f + // Center
tile[tileRow][tileCol+1] * 2.0f + // Right
tile[tileRow+1][tileCol-1] * 1.0f + // Bottom-left
tile[tileRow+1][tileCol] * 2.0f + // Bottom
tile[tileRow+1][tileCol+1] * 1.0f; // Bottom-right
output[row * N + col] = result / 16.0f; // Normalize
}
}
int main()
{
// Total: 100,000 matrices
int totalMatrices = NUM_CHUNKS * CHUNK_SIZE;
size_t matrixSize = N * N * sizeof(float);
size_t chunkBytes = CHUNK_SIZE * matrixSize; // 4 GB per chunk
// ═══════════════════════════════════════════════════════════
// STEP 1: Allocate pinned host memory for NUM_STREAMS chunks
// ═══════════════════════════════════════════════════════════
// Realistic: Only allocate enough for in-flight chunks
// With 3 streams: 3 chunks in (12 GB) + 3 chunks out (12 GB) = 24 GB
float *h_input[NUM_STREAMS], *h_output[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaMallocHost(&h_input[i], chunkBytes); // 4 GB per stream
cudaMallocHost(&h_output[i], chunkBytes); // 4 GB per stream
}
// Total host memory: 3 × (4 GB + 4 GB) = 24 GB (realistic!)
// Larger dataset stored in regular pageable memory or on disk
// For this example, we'll simulate loading from disk chunk by chunk
// ═══════════════════════════════════════════════════════════
// STEP 2: Allocate device memory (one set per stream)
// ═══════════════════════════════════════════════════════════
float *d_input[NUM_STREAMS], *d_output[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaMalloc(&d_input[i], chunkBytes); // 4 GB per stream
cudaMalloc(&d_output[i], chunkBytes); // 4 GB per stream
}
// Total device memory: 3 × (4 GB + 4 GB) = 24 GB
// ═══════════════════════════════════════════════════════════
// STEP 3: Create streams
// ═══════════════════════════════════════════════════════════
cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamCreate(&streams[i]);
}
// ═══════════════════════════════════════════════════════════
// STEP 4: Process chunks with pipelining
// ═══════════════════════════════════════════════════════════
// Grid configuration for 1000 matrices per chunk
dim3 threadsPerBlock(TILE_SIZE, TILE_SIZE); // 16×16 = 256 threads
dim3 blocksPerMatrix((N + TILE_SIZE - 1) / TILE_SIZE,
(N + TILE_SIZE - 1) / TILE_SIZE); // 64×64 blocks
for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
// Use streams in round-robin fashion
int streamIdx = chunk % NUM_STREAMS;
cudaStream_t stream = streams[streamIdx];
// ───────────────────────────────────────────────────────
// Load chunk from disk/pageable memory into pinned buffer
// ───────────────────────────────────────────────────────
// In real application: read from file, database, or network
// For now, simulate by generating data on-the-fly
for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
h_input[streamIdx][i] = rand() / (float)RAND_MAX;
}
// ───────────────────────────────────────────────────────
// Pipeline three operations on this stream:
// ───────────────────────────────────────────────────────
// 1. Transfer input chunk to GPU (async)
cudaMemcpyAsync(d_input[streamIdx], h_input[streamIdx],
chunkBytes, cudaMemcpyHostToDevice, stream);
// 2. Process all matrices in chunk (async on stream)
for (int m = 0; m < CHUNK_SIZE; m++) {
size_t matOffset = m * N * N;
processMatrixTiled<<<blocksPerMatrix, threadsPerBlock, 0, stream>>>(
d_input[streamIdx] + matOffset,
d_output[streamIdx] + matOffset,
N
);
}
// 3. Transfer results back to host (async)
cudaMemcpyAsync(h_output[streamIdx], d_output[streamIdx],
chunkBytes, cudaMemcpyDeviceToHost, stream);
// ───────────────────────────────────────────────────────
// Save results from previous chunk (if available)
// ───────────────────────────────────────────────────────
if (chunk >= NUM_STREAMS) {
// Previous chunk on this stream has completed
int prevChunk = chunk - NUM_STREAMS;
cudaStreamSynchronize(stream);
// Save results: write to disk, database, or network
// For now, just acknowledge completion
printf("Chunk %d processed (1000 matrices)\n", prevChunk);
}
}
// ═══════════════════════════════════════════════════════════
// STEP 5: Wait for final chunks and save results
// ═══════════════════════════════════════════════════════════
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamSynchronize(streams[i]);
// Save final results
printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
}
// Cleanup
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamDestroy(streams[i]);
cudaFree(d_input[i]);
cudaFree(d_output[i]);
cudaFreeHost(h_input[i]);
cudaFreeHost(h_output[i]);
}
return 0;
}
Summary
Tutorial 7 combines the best techniques from the series:
Tiling (Tutorial 6): Each matrix uses shared memory for 10-20× speedup
Streams (Tutorial 5): Multiple chunks pipeline to hide transfer latency
Chunking (Tutorials 3-4): Process data in manageable chunks
Key takeaways:
Memory hierarchy: Use shared memory (fast) for repeated access
Asynchronous execution: Overlap transfers and computation with streams
Resource management: Allocate GPU memory per stream, reuse across chunks
Pipeline depth: 3 streams achieve good balance between complexity and performance
This pattern scales to: - Image processing (filters, transformations) - Deep learning inference (batch predictions) - Scientific computing (simulations, data analysis) - Video processing (frame-by-frame operations)
For conceptual material on GPU architecture and CUDA fundamentals, see CUDA fundamentals.
Tutorial 8: Using cuBLAS for production matrix multiplication
Problem
In Tutorials 1-7, we learned how GPU parallelism works by implementing matrix multiplication from scratch. We progressed from naive implementations to optimized tiled versions with shared memory. However, for production code, you should use cuBLAS (CUDA Basic Linear Algebra Subprograms), NVIDIA’s highly optimized library.
Why use cuBLAS?
Performance: 2-5× faster than hand-written kernels (uses advanced optimizations)
Maintained: NVIDIA updates it for new GPU architectures
Tensor cores: Automatically uses specialized hardware when available
Less code: Single function call vs hundreds of lines of kernel code
When to write custom kernels:
Non-standard operations (custom transformations, domain-specific algorithms)
Fused operations (combine multiple operations to reduce memory traffic)
Research/experimentation (testing new algorithms)
When to use cuBLAS:
Standard linear algebra (matrix multiplication, vector operations, factorizations)
Production code (reliability and performance matter)
Scientific applications: electron microscopy (ptychography reconstruction uses extensive matrix operations and FFTs), molecular dynamics simulations, climate modeling
Deep learning (neural network layers are matrix multiplications)
This tutorial shows how to use cuBLAS for the same 1024×1024 matrix multiplication we’ve been working with. These operations appear repeatedly in computational pipelines: ptychography reconstructs high-resolution images through iterative matrix updates and Fourier transforms, processing thousands of diffraction patterns where each iteration involves matrix multiplications that cuBLAS accelerates significantly.
Solution
Instead of writing kernel code with shared memory tiling (Tutorial 6), we call a single cuBLAS function. The library handles all optimizations internally.
What cuBLAS does behind the scenes:
When you call cublasSgemm, the library:
Analyzes your GPU architecture: Detects compute capability, memory hierarchy, and available features
Selects optimal algorithm: Chooses tile sizes, blocking strategies, and memory access patterns
Uses specialized hardware: Activates tensor cores on modern GPUs (Volta/Turing/Ampere) for massive speedup
Manages memory efficiently: Handles shared memory allocation, register usage, and bank conflicts
Launches optimized kernels: Uses multiple kernel variants depending on matrix dimensions
For example, on an RTX 3090:
Detects tensor cores available
Uses 256×128 tiles (much larger than our 16×16 in Tutorial 6)
Employs warp-level matrix operations
Achieves 140+ GFLOPS vs our 27 GFLOPS
The cuBLAS GEMM function:
// GEMM: General Matrix-Matrix multiplication
// Computes: C = alpha * op(A) * op(B) + beta * C
// where op(X) can be X or X^T (transpose)
cublasSgemm(handle, // cuBLAS context
transA, // CUBLAS_OP_N (no transpose) or CUBLAS_OP_T
transB, // CUBLAS_OP_N or CUBLAS_OP_T
m, n, k, // Matrix dimensions
&alpha, // Pointer to scaling factor
A, lda, // Matrix A and leading dimension
B, ldb, // Matrix B and leading dimension
&beta, // Pointer to scaling factor
C, ldc); // Matrix C (output) and leading dimension
Concrete example: Multiply two 1024×1024 matrices
// We want: C = A × B where A, B, C are all 1024×1024
const int N = 1024;
const float alpha = 1.0f; // Scaling factor for A × B (1.0 = no scaling)
const float beta = 0.0f; // Scaling factor for C (0.0 = overwrite C)
// If beta=1.0, would compute C = A×B + C (accumulate)
// Matrix dimensions for standard multiplication:
// A is 1024×1024 (m × k), B is 1024×1024 (k × n), C is 1024×1024 (m × n)
int m = N, n = N, k = N;
// CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C defined in cublas_v2.h
// CUBLAS_OP_N = no transpose (use matrix as-is)
// CUBLAS_OP_T = transpose matrix
// CUBLAS_OP_C = conjugate transpose (for complex matrices)
cublasSgemm(handle, // cuBLAS library context (created once, reused)
CUBLAS_OP_N, // Don't transpose A
CUBLAS_OP_N, // Don't transpose B
m, n, k, // Dimensions: C(m×n) = A(m×k) × B(k×n)
&alpha, // Pointer to alpha scaling factor
d_A, N, // Matrix A on device, leading dimension N
d_B, N, // Matrix B on device, leading dimension N
&beta, // Pointer to beta scaling factor
d_C, N); // Matrix C on device (output), leading dimension N
// Leading dimension (lda, ldb, ldc) = stride between rows in memory
// For contiguous row-major matrices, this equals the row width (N)
Column-major vs row-major storage:
cuBLAS follows Fortran convention (column-major), while C/C++ uses row-major:
Row-major (C/C++): Column-major (Fortran):
Matrix [0 1 2] Matrix [0 1 2]
[3 4 5] [3 4 5]
Stored as: 0,1,2,3,4,5 Stored as: 0,3,1,4,2,5
(rows consecutive) (columns consecutive)
However, for our purposes, we don’t need to do anything special. When we pass row-major matrices to cuBLAS with CUBLAS_OP_N (no transpose), cuBLAS interprets them as column-major and the computation still works correctly. The matrix layout difference is handled automatically by the library as long as we’re consistent with our storage format throughout the computation.
Implementation
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
// ═══════════════════════════════════════════════════════════
// STEP 1: Setup and initialization
// ═══════════════════════════════════════════════════════════
const int N = 1024;
const size_t bytes = N * N * sizeof(float);
// Allocate host memory
float *h_A = (float*)malloc(bytes);
float *h_B = (float*)malloc(bytes);
float *h_C = (float*)malloc(bytes);
// Initialize matrices with simple values
for (int i = 0; i < N * N; i++) {
h_A[i] = rand() / (float)RAND_MAX;
h_B[i] = rand() / (float)RAND_MAX;
}
// ═══════════════════════════════════════════════════════════
// STEP 2: Allocate device memory
// ═══════════════════════════════════════════════════════════
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, bytes);
cudaMalloc(&d_B, bytes);
cudaMalloc(&d_C, bytes);
// Transfer data to GPU
cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);
// ═══════════════════════════════════════════════════════════
// STEP 3: Initialize cuBLAS
// ═══════════════════════════════════════════════════════════
cublasHandle_t handle;
cublasCreate(&handle);
// ═══════════════════════════════════════════════════════════
// STEP 4: Perform matrix multiplication using cuBLAS
// ═══════════════════════════════════════════════════════════
// Compute C = A × B using GEMM (General Matrix-Matrix multiplication)
const float alpha = 1.0f; // Scaling factor for A × B
const float beta = 0.0f; // Scaling factor for C (0 means overwrite)
// Simply pass our row-major matrices to cuBLAS
// cuBLAS handles the column-major/row-major difference internally
cublasSgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_N, // No transpose needed
N, N, N, // m, n, k (all N for square matrices)
&alpha, // Scaling factor alpha
d_A, N, // Matrix A, leading dimension
d_B, N, // Matrix B, leading dimension
&beta, // Scaling factor beta
d_C, N); // Matrix C, leading dimension
// ═══════════════════════════════════════════════════════════
// STEP 5: Copy result back
// ═══════════════════════════════════════════════════════════
cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);
printf("Matrix multiplication completed\n");
printf("Result matrix C = A × B (1024×1024)\n");
// ═══════════════════════════════════════════════════════════
// STEP 6: Cleanup
// ═══════════════════════════════════════════════════════════
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
Compilation:
nvcc -o tutorial8 tutorial8.cu -lcublas
Key differences from custom kernels:
No kernel code: cuBLAS handles all GPU operations internally
Column-major handling: Swap A and B to account for storage order
Handle management: Create/destroy cuBLAS context
Linking: Need to link against cuBLAS library (
-lcublas)
Performance comparison
Typical performance for 1024×1024 matrix multiplication on modern GPUs:
Implementation |
Time (ms) |
GFLOPS |
Notes |
|---|---|---|---|
Naive (Tutorial 1) |
~800 ms |
~2.7 |
Global memory only |
Tiled (Tutorial 6) |
~80 ms |
~27 |
Shared memory, 16×16 tiles |
cuBLAS (Tutorial 8) |
~15 ms |
~140 |
Highly optimized, tensor cores |
Why is cuBLAS faster?
Register blocking: Uses registers for innermost loops (faster than shared memory)
Warp-level primitives: Uses specialized instructions for matrix operations
Tensor cores: On Volta/Turing/Ampere GPUs, uses dedicated matrix multiplication units
Auto-tuning: Selects optimal tile sizes and strategies based on GPU architecture
Memory access patterns: Optimized coalescing and bank conflict avoidance
cuBLAS features
Beyond basic matrix multiplication, cuBLAS provides a comprehensive suite of linear algebra operations:
Category |
Function |
Description |
|---|---|---|
Level 1 BLAS |
|
Vector addition: y = alpha*x + y |
|
Dot product between two vectors |
|
|
Euclidean norm of a vector |
|
Level 2 BLAS |
|
Matrix-vector multiplication |
|
Outer product (rank-1 update) |
|
Level 3 BLAS |
|
Matrix-matrix multiplication (used in Tutorial 8) |
|
Triangular system solve |
|
|
Symmetric rank-k update |
|
Extensions |
Batched operations |
Process multiple small matrices efficiently |
Strided batched |
Regular memory patterns for batch processing |
|
Mixed precision |
FP16, FP32, FP64, and tensor core support |
|
Complex arithmetic |
Operations on complex-valued matrices |
Summary
Tutorial 8 demonstrates production-ready GPU programming:
Learned through Tutorials 1-7: Memory hierarchy, parallelism, tiling, optimization
Use in production (Tutorial 8): cuBLAS for standard operations
Key takeaways:
Education vs production: Learning how things work internally (custom kernels) vs using optimized libraries (cuBLAS)
Performance: cuBLAS achieves 5-10× speedup over hand-written tiled kernels
Maintainability: Library updates automatically benefit from new GPU features
When to customize: Only write custom kernels for specialized operations not in cuBLAS
The complete GPU programming workflow:
Understand the fundamentals (Tutorials 1-7): Memory hierarchy, parallelism, optimization
Use libraries when possible (Tutorial 8): cuBLAS, cuFFT, cuDNN, Thrust
Write custom kernels only when needed: Unique operations, fused operations, research
Tutorial 9: Batched cuBLAS for large-scale processing
Problem
Tutorial 7 showed how to process 100,000 matrices using custom kernels with streams and chunking. Now we combine that approach with cuBLAS for production-quality performance. Instead of writing custom matrix multiplication kernels, we use cuBLAS batched operations.
The challenge: 100,000 matrices of 1024×1024 each:
Total data: 400 GB (too large for GPU memory)
Need to process in chunks with pipelining
Want production performance from cuBLAS
Solution: Combine Tutorial 7 (chunking + streams) with Tutorial 8 (cuBLAS)
Process data in chunks of 1,000 matrices (4 GB per chunk)
Use 3 CUDA streams for pipelining
Use cublasSgemmStridedBatched for efficient batch processing
Solution
cuBLAS batched operations:
cuBLAS provides two functions for processing multiple matrices:
cublasSgemmBatched: For matrices at arbitrary memory locations (array of pointers)
cublasSgemmStridedBatched: For matrices at regular intervals in memory (most efficient)
We use cublasSgemmStridedBatched because our matrices are stored contiguously:
// Process multiple matrices in one call
cublasSgemmStridedBatched(
handle,
CUBLAS_OP_N, CUBLAS_OP_N,
N, N, N, // Matrix dimensions (all N×N)
&alpha,
d_A, N, N*N, // A matrices: leading dim N, stride N*N between matrices
d_B, N, N*N, // B matrices: leading dim N, stride N*N between matrices
&beta,
d_C, N, N*N, // C matrices: leading dim N, stride N*N between matrices
batchCount); // Number of matrices to multiply
Key parameters:
Stride: Distance in elements between consecutive matrices (N*N for contiguous storage)
Batch count: How many matrix multiplications to perform
Single kernel launch: cuBLAS handles all matrices efficiently in one call
Implementation
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
// ═══════════════════════════════════════════════════════════
// STEP 1: Configuration
// ═══════════════════════════════════════════════════════════
const int N = 1024; // Matrix size (1024×1024)
const int CHUNK_SIZE = 1000; // Matrices per chunk
const int NUM_CHUNKS = 100; // Total 100 chunks = 100,000 matrices
const int NUM_STREAMS = 3; // Streams for pipelining
const size_t matrixSize = N * N * sizeof(float);
const size_t chunkBytes = CHUNK_SIZE * matrixSize; // 4 GB per chunk
printf("Processing %d matrices (%d chunks of %d)\n",
NUM_CHUNKS * CHUNK_SIZE, NUM_CHUNKS, CHUNK_SIZE);
printf("Chunk size: %.2f GB\n", chunkBytes / (1024.0*1024.0*1024.0));
// ═══════════════════════════════════════════════════════════
// STEP 2: Allocate pinned host memory (for async transfers)
// ═══════════════════════════════════════════════════════════
float *h_A[NUM_STREAMS], *h_B[NUM_STREAMS], *h_C[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaMallocHost(&h_A[i], chunkBytes); // Input matrices A
cudaMallocHost(&h_B[i], chunkBytes); // Input matrices B
cudaMallocHost(&h_C[i], chunkBytes); // Output matrices C
}
// ═══════════════════════════════════════════════════════════
// STEP 3: Allocate device memory (one buffer per stream)
// ═══════════════════════════════════════════════════════════
float *d_A[NUM_STREAMS], *d_B[NUM_STREAMS], *d_C[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaMalloc(&d_A[i], chunkBytes);
cudaMalloc(&d_B[i], chunkBytes);
cudaMalloc(&d_C[i], chunkBytes);
}
// ═══════════════════════════════════════════════════════════
// STEP 4: Create cuBLAS handle and streams
// ═══════════════════════════════════════════════════════════
cublasHandle_t handle;
cublasCreate(&handle);
cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamCreate(&streams[i]);
}
// ═══════════════════════════════════════════════════════════
// STEP 5: Process chunks with pipelining
// ═══════════════════════════════════════════════════════════
const float alpha = 1.0f;
const float beta = 0.0f;
for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
int streamIdx = chunk % NUM_STREAMS;
cudaStream_t stream = streams[streamIdx];
// Set cuBLAS to use this stream
cublasSetStream(handle, stream);
// ───────────────────────────────────────────────────────
// Load chunk from disk into pinned memory
// ───────────────────────────────────────────────────────
// In real application: read from file
// For now: generate random data
for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
h_A[streamIdx][i] = rand() / (float)RAND_MAX;
h_B[streamIdx][i] = rand() / (float)RAND_MAX;
}
// ───────────────────────────────────────────────────────
// Pipeline: Transfer in → Compute → Transfer out
// ───────────────────────────────────────────────────────
// Transfer input chunk to GPU (async)
cudaMemcpyAsync(d_A[streamIdx], h_A[streamIdx],
chunkBytes, cudaMemcpyHostToDevice, stream);
cudaMemcpyAsync(d_B[streamIdx], h_B[streamIdx],
chunkBytes, cudaMemcpyHostToDevice, stream);
// Perform batched matrix multiplication (async on stream)
// Process all 1,000 matrices in the chunk with one call
cublasSgemmStridedBatched(
handle,
CUBLAS_OP_N, CUBLAS_OP_N,
N, N, N, // Matrix dimensions
&alpha,
d_A[streamIdx], N, N*N, // A: leading dim, stride
d_B[streamIdx], N, N*N, // B: leading dim, stride
&beta,
d_C[streamIdx], N, N*N, // C: leading dim, stride
CHUNK_SIZE); // Number of matrices
// Transfer results back to host (async)
cudaMemcpyAsync(h_C[streamIdx], d_C[streamIdx],
chunkBytes, cudaMemcpyDeviceToHost, stream);
// ───────────────────────────────────────────────────────
// Save results from previous chunk (if available)
// ───────────────────────────────────────────────────────
if (chunk >= NUM_STREAMS) {
int prevChunk = chunk - NUM_STREAMS;
cudaStreamSynchronize(stream);
// Save results: write to disk
printf("Chunk %d processed (1000 matrices)\n", prevChunk);
}
}
// ═══════════════════════════════════════════════════════════
// STEP 6: Wait for final chunks and save results
// ═══════════════════════════════════════════════════════════
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamSynchronize(streams[i]);
printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
}
printf("\nAll 100,000 matrices processed successfully!\n");
// ═══════════════════════════════════════════════════════════
// STEP 7: Cleanup
// ═══════════════════════════════════════════════════════════
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamDestroy(streams[i]);
cudaFree(d_A[i]);
cudaFree(d_B[i]);
cudaFree(d_C[i]);
cudaFreeHost(h_A[i]);
cudaFreeHost(h_B[i]);
cudaFreeHost(h_C[i]);
}
cublasDestroy(handle);
return 0;
}
Compilation:
nvcc -o tutorial9 tutorial9.cu -lcublas
Key differences from Tutorial 7:
Aspect |
Tutorial 7 (Custom kernel) |
Tutorial 9 (cuBLAS batched) |
|---|---|---|
Kernel code |
Manual tiling implementation |
No kernel needed |
Batch processing |
Loop over matrices, launch kernel per matrix |
Single cublasSgemmStridedBatched call |
Performance |
~27 GFLOPS per matrix |
~140 GFLOPS per matrix |
Code complexity |
100+ lines of kernel code |
Single function call |
Maintainability |
Must update for new GPUs |
Automatically optimized |
Why use batched operations
Performance benefits:
Reduced kernel launch overhead: One launch for 1,000 matrices vs 1,000 launches
Better GPU utilization: cuBLAS schedules work efficiently across SMs
Optimized memory access: Library handles data layout for maximum throughput
When to use batched operations:
Processing many small to medium matrices (common in deep learning, simulations)
Matrices stored contiguously in memory (use StridedBatched)
Need production performance without custom optimization
When not to use:
Single large matrix (use regular cublasSgemm)
Matrices at non-uniform memory locations (may need cublasSgemmBatched with pointer array)
Custom operations not available in cuBLAS
Summary
Tutorial 9 completes the progression from basics to production:
Tutorials 1-6: Learn GPU fundamentals (memory, parallelism, optimization)
Tutorial 7: Apply concepts at scale (chunking, streams, custom kernels)
Tutorial 8: Use optimized libraries (cuBLAS for single operation)
Tutorial 9: Combine both (cuBLAS batched + chunking + streams)
Key takeaways:
Batched operations: Process multiple matrices efficiently with single library call
Production workflow: Chunking + streams + cuBLAS = optimal throughput
Simplicity: 5-10× speedup with less code than custom kernels
Scalability: Pattern works for millions of matrices with streaming from disk
Tutorial 10: Multi-operation pipeline (cuBLAS + cuFFT)
Problem
Real applications often need multiple operations on the same data. For example, processing 100,000 images (1024×1024) where each image needs:
Matrix multiplication: Apply a transformation matrix (using cuBLAS)
FFT: Transform to frequency domain (using cuFFT)
This appears in ptychography reconstruction, where each diffraction pattern undergoes matrix updates followed by Fourier transforms. Processing 100,000 patterns requires efficient pipelining of both operations.
The challenge:
400 GB total data (too large for GPU memory)
Two different library operations per image
Need to maintain high GPU utilization throughout the pipeline
Solution: Combine cuBLAS and cuFFT in a streaming pipeline:
Process in chunks of 1,000 images (4 GB per chunk)
Use 3 streams for pipelining
Apply both operations to each chunk before moving to the next
Solution
Understanding FFT (Fast Fourier Transform):
The Fourier transform converts data from spatial/time domain to frequency domain. For images:
Spatial domain: Pixel intensities at positions \((x, y)\)
Frequency domain: Amplitude and phase of different spatial frequencies
Complex numbers in CUDA:
C and C++ don’t have built-in complex number types like Python’s 3+4j. CUDA libraries define their own types to represent complex numbers.
cuFFT provides cufftComplex - a struct that holds the real and imaginary parts:
// This is already defined in cufft.h when you #include <cufft.h>
// You don't write this yourself - the library provides it
typedef struct {
float x; // Real part
float y; // Imaginary part
} cufftComplex;
Think of it as a simple container holding two floats. When you create a cufftComplex variable, you’re allocating space for both the real and imaginary components.
Example usage:
#include <cufft.h> // Provides cufftComplex type
// Create the complex number 3 + 4i
cufftComplex z;
z.x = 3.0f; // Real part (the "3" in 3+4i)
z.y = 4.0f; // Imaginary part (the "4" in 3+4i)
// Calculate magnitude: |z| = sqrt(3² + 4²) = sqrt(9 + 16) = 5
float magnitude = sqrtf(z.x * z.x + z.y * z.y);
printf("Magnitude: %.1f\n", magnitude); // Prints: 5.0
// For an array of complex numbers (like an image):
cufftComplex* image = (cufftComplex*)malloc(1024 * 1024 * sizeof(cufftComplex));
// Each pixel now has .x (real) and .y (imaginary) parts
For our \(1024 \times 1024\) image:
Input: \(1024 \times 1024\) complex values (8 MB per image: 2 floats × 4 bytes each)
Output: \(1024 \times 1024\) complex values in frequency domain (same size)
Operation: 2D FFT processes rows, then columns
Using multiple CUDA libraries together:
Both cuBLAS and cuFFT work with CUDA streams, allowing us to pipeline operations:
// For each chunk:
// 1. Transfer data to GPU (async on stream)
// 2. Apply matrix multiplication using cuBLAS (async on stream)
// - Works on complex matrices using cublasCgemm
// 3. Apply 2D FFT using cuFFT (async on stream)
// - Transforms spatial domain → frequency domain
// 4. Transfer results back (async on stream)
No custom kernels needed - both libraries handle GPU optimization internally.
What’s actually running on the GPU:
When you call cublasCgemmStridedBatched and cufftExecC2C, each library launches its own highly optimized GPU kernels:
cuBLAS call → Launches multiple kernels internally:
Tiling kernels (like Tutorial 6, but more sophisticated)
Register blocking kernels
Tensor core kernels (on modern GPUs)
Each library kernel is pre-compiled and optimized
cuFFT call → Launches FFT kernels internally:
Cooley-Tukey algorithm kernels for row transforms
Additional kernels for column transforms
Twiddle factor calculations
Memory transpose operations
These kernels execute sequentially on the same stream:
Stream timeline:
──────────────────────────────────────────────────────────────
[Transfer In] → [cuBLAS kernels] → [cuFFT kernels] → [Transfer Out]
(multiple) (multiple)
The stream ensures operations happen in order
Key insight: You’re not writing kernels, but you’re still using them! Each library function is like a black box containing dozens of optimized GPU kernels. Different CUDA libraries can share streams and work on the same data buffer, enabling complex multi-operation pipelines without writing any GPU kernel code yourself.
Implementation
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
// ═══════════════════════════════════════════════════════════
// STEP 1: Configuration
// ═══════════════════════════════════════════════════════════
const int N = 1024; // Image size (1024×1024)
const int CHUNK_SIZE = 1000; // Images per chunk
const int NUM_CHUNKS = 100; // Total 100 chunks = 100,000 images
const int NUM_STREAMS = 3; // Streams for pipelining
const size_t imageSize = N * N * sizeof(cufftComplex);
const size_t chunkBytes = CHUNK_SIZE * imageSize;
printf("Processing %d images (%d chunks of %d)\n",
NUM_CHUNKS * CHUNK_SIZE, NUM_CHUNKS, CHUNK_SIZE);
printf("Operations per image: Matrix multiplication + FFT\n");
printf("Chunk size: %.2f GB\n", chunkBytes / (1024.0*1024.0*1024.0));
// ═══════════════════════════════════════════════════════════
// STEP 2: Allocate pinned host memory
// ═══════════════════════════════════════════════════════════
cufftComplex *h_data[NUM_STREAMS]; // Input/output data
cufftComplex *h_result[NUM_STREAMS]; // Final results
for (int i = 0; i < NUM_STREAMS; i++) {
cudaMallocHost(&h_data[i], chunkBytes);
cudaMallocHost(&h_result[i], chunkBytes);
}
// ═══════════════════════════════════════════════════════════
// STEP 3: Allocate device memory
// ═══════════════════════════════════════════════════════════
cufftComplex *d_data[NUM_STREAMS]; // Working buffer on GPU
cufftComplex *d_temp[NUM_STREAMS]; // Temporary for matrix mult
for (int i = 0; i < NUM_STREAMS; i++) {
cudaMalloc(&d_data[i], chunkBytes);
cudaMalloc(&d_temp[i], chunkBytes);
}
// ═══════════════════════════════════════════════════════════
// STEP 4: Initialize cuBLAS and cuFFT
// ═══════════════════════════════════════════════════════════
// cuBLAS handle
cublasHandle_t cublasHandle;
cublasCreate(&cublasHandle);
// cuFFT plan for batched 2D FFT
// Process multiple 1024×1024 FFTs in one call
cufftHandle fftPlan;
int rank = 2; // 2D FFT
int n[2] = {N, N}; // Dimensions
int batch = CHUNK_SIZE; // Number of FFTs per batch
cufftPlanMany(&fftPlan, rank, n,
NULL, 1, N*N, // Input layout
NULL, 1, N*N, // Output layout
CUFFT_C2C, batch); // Complex-to-complex, batched
// Create CUDA streams
cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamCreate(&streams[i]);
}
// ═══════════════════════════════════════════════════════════
// STEP 5: Process chunks with multi-operation pipeline
// ═══════════════════════════════════════════════════════════
const cufftComplex alpha = {1.0f, 0.0f}; // Scaling factor for matrix mult
const cufftComplex beta = {0.0f, 0.0f};
for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
int streamIdx = chunk % NUM_STREAMS;
cudaStream_t stream = streams[streamIdx];
// Set both libraries to use this stream
cublasSetStream(cublasHandle, stream);
cufftSetStream(fftPlan, stream);
// ───────────────────────────────────────────────────────
// Load chunk from disk
// ───────────────────────────────────────────────────────
// In real application: read diffraction patterns from file
// For now: generate random complex data
for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
h_data[streamIdx][i].x = rand() / (float)RAND_MAX;
h_data[streamIdx][i].y = rand() / (float)RAND_MAX;
}
// ───────────────────────────────────────────────────────
// Pipeline: Transfer → Matrix Mult → FFT → Transfer back
// ───────────────────────────────────────────────────────
// Transfer input to GPU (async)
cudaMemcpyAsync(d_data[streamIdx], h_data[streamIdx],
chunkBytes, cudaMemcpyHostToDevice, stream);
// Operation 1: Matrix multiplication using cuBLAS (async)
// Apply transformation to all images in chunk
cublasCgemmStridedBatched(
cublasHandle,
CUBLAS_OP_N, CUBLAS_OP_N,
N, N, N,
&alpha,
d_data[streamIdx], N, N*N,
d_data[streamIdx], N, N*N, // Multiply with itself (demo)
&beta,
d_temp[streamIdx], N, N*N,
CHUNK_SIZE);
// Copy result back to d_data for FFT
cudaMemcpyAsync(d_data[streamIdx], d_temp[streamIdx],
chunkBytes, cudaMemcpyDeviceToDevice, stream);
// Operation 2: 2D FFT using cuFFT (async)
// Transform all images to frequency domain
cufftExecC2C(fftPlan, d_data[streamIdx], d_data[streamIdx],
CUFFT_FORWARD);
// Transfer results back to host (async)
cudaMemcpyAsync(h_result[streamIdx], d_data[streamIdx],
chunkBytes, cudaMemcpyDeviceToHost, stream);
// ───────────────────────────────────────────────────────
// Save results from previous chunk
// ───────────────────────────────────────────────────────
if (chunk >= NUM_STREAMS) {
int prevChunk = chunk - NUM_STREAMS;
cudaStreamSynchronize(stream);
// Save results: write to disk
printf("Chunk %d processed (1000 images: MatMul + FFT)\n",
prevChunk);
}
}
// ═══════════════════════════════════════════════════════════
// STEP 6: Wait for final chunks
// ═══════════════════════════════════════════════════════════
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamSynchronize(streams[i]);
printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
}
printf("\nAll 100,000 images processed successfully!\n");
printf("Each image: Matrix multiplication + 2D FFT\n");
// ═══════════════════════════════════════════════════════════
// STEP 7: Cleanup
// ═══════════════════════════════════════════════════════════
cufftDestroy(fftPlan);
cublasDestroy(cublasHandle);
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamDestroy(streams[i]);
cudaFree(d_data[i]);
cudaFree(d_temp[i]);
cudaFreeHost(h_data[i]);
cudaFreeHost(h_result[i]);
}
return 0;
}
Compilation:
nvcc -o tutorial10 tutorial10.cu -lcublas -lcufft
Key points:
Link both libraries:
-lcublas -lcufftBoth libraries share the same CUDA stream
Operations execute sequentially on the stream (transfer → cuBLAS → cuFFT → transfer)
No custom kernels needed
cuFFT batched operations
Similar to cuBLAS, cuFFT provides batched operations for processing multiple FFTs efficiently:
Function |
Description |
|---|---|
|
Creates a plan for batched FFT operations |
|
Complex-to-complex FFT (forward or inverse) |
|
Real-to-complex FFT (forward) |
|
Complex-to-real FFT (inverse) |
|
Associates FFT plan with CUDA stream |
Plan parameters:
rank: Dimensionality (1D, 2D, or 3D)n[]: Size of each dimensionbatch: Number of FFTs to performCUFFT_C2C: Transform type (complex-to-complex, real-to-complex, etc.)
Can I call cuBLAS/cuFFT from within custom kernels?
Short answer: No, you cannot call cuBLAS or cuFFT functions from inside your own __global__ kernel.
Why not:
cuBLAS and cuFFT functions are host functions (run on CPU) that launch their own GPU kernels internally. They cannot be called from device code (GPU kernels).
What happens between kernel launches:
Between separate kernel launches:
Kernel 1 finishes → writes results to global memory (VRAM)
GPU waits for all threads to complete
Kernel 2 launches → reads data from global memory (VRAM)
Is the VRAM write→read a bottleneck?
Yes and no, depending on the operation.
The concern is real: Writing to VRAM and reading back from it takes time. For a \(1024 \times 1024\) complex matrix (8 MB):
Write time: ~0.01 ms (at ~800 GB/s VRAM bandwidth)
Read time: ~0.01 ms
Total memory traffic: ~0.02 ms
Here’s the key insight: Fusion reduces memory traffic
Separate kernels (Tutorial 10):
─────────────────────────────────────────────────────────────
Kernel 1 (cuBLAS): Read input (8 MB) → Compute → Write output (8 MB)
Kernel 2 (cuFFT): Read input (8 MB) → Compute → Write output (8 MB)
Total VRAM traffic: 32 MB (4 transfers)
Memory time: ~0.04 ms (32 MB at 800 GB/s)
Hypothetical fused kernel:
─────────────────────────────────────────────────────────────
Fused kernel: Read input (8 MB) → MatMul → FFT → Write output (8 MB)
Total VRAM traffic: 16 MB (2 transfers)
Memory time: ~0.02 ms (16 MB at 800 GB/s)
Savings: 50% memory traffic, ~0.02 ms saved
The ptychography case is interesting: You have 100,000 images where each image undergoes multiple operations:
Matrix multiplication (probe update)
Forward FFT (propagate to detector)
Apply constraints (amplitude/phase corrections)
Inverse FFT (back to real space)
Another matrix multiplication (object update)
Should you fuse these operations?
For ptychography with \(1024 \times 1024\) images, the library approach (Tutorial 10) is still optimal because:
Each operation is compute-intensive (FFT: ~0.5 ms, matrix mult: ~0.1 ms)
Memory traffic between kernels (~0.02 ms) is only 3% of total time
cuBLAS and cuFFT are 5-10× faster than custom implementations
Processing 100,000 images takes minutes either way
When fusion WOULD matter:
Small images: \(256 \times 256\) or smaller where computation is fast
Simple operations: Element-wise operations (scaling, phase shifts) rather than FFT/matrix mult
Tight iteration loops: Iterative refinement with 10+ operations per image per iteration
Memory-bound regime: When GPU is waiting on memory rather than computing
Important note about ptychography: You’re right that many ptychography operations are memory-bound! The constraint application steps (amplitude corrections, phase updates, element-wise multiplications) are typically very fast computations (~0.01 ms) but require reading and writing full images. For these operations, fusion can help significantly.
Practical recommendation for ptychography reconstruction:
The pipeline has two types of operations:
Compute-intensive (use libraries, don’t fuse):
Matrix multiplication (probe/object updates): cuBLAS
Forward/inverse FFT: cuFFT
These dominate computation time (~0.6 ms total)
Memory traffic between them is negligible (~3% of time)
Memory-bound (consider fusion):
Amplitude constraint application
Phase corrections
Element-wise multiplications
These are fast (~0.01 ms each) but happen many times
Fusing these into a single kernel reduces intermediate VRAM writes
Optimal approach:
for each iteration:
// Use library (no fusion needed, computation dominates)
cuBLAS: probe update // ~0.1 ms
cuFFT: forward transform // ~0.5 ms
// Fuse memory-bound operations (saves VRAM traffic)
custom kernel: {
apply amplitude constraint
phase correction
element-wise operations
} // ~0.01 ms total
cuFFT: inverse transform // ~0.5 ms
cuBLAS: object update // ~0.1 ms
This hybrid approach gives you the best of both worlds: optimized libraries for heavy computation, fused kernels for memory-bound operations.
When the savings DON’T matter (like Tutorial 10):
For our \(1024 \times 1024\) matrices:
cuBLAS computation time: ~0.1 ms (memory traffic is ~20% of total time)
cuFFT computation time: ~0.5 ms (FFT is compute-intensive)
Memory transfer time: ~0.02 ms (only ~3% of total time)
The computation dominates, so the intermediate VRAM write→read is negligible compared to the actual cuBLAS and cuFFT work.
Bottom line: For large, compute-intensive operations like \(1024 \times 1024\) matrix multiplication and FFT, the intermediate VRAM traffic between kernels is a small fraction of total time. The benefit of using highly optimized libraries far outweighs the cost of the extra memory traffic.
How can I fuse kernels?
If you need to combine operations in a single kernel launch, you have two options:
Write custom fused kernel: Implement both operations yourself in one kernel
__global__ void fusedMatMulAndFFT(float* input, float* output, int N) {
// Your custom matrix multiplication code
// + Your custom FFT code
// Both in one kernel - no intermediate global memory writes
}
Pros: No intermediate memory traffic, single launch overhead
Cons: Much harder to implement, lose cuBLAS/cuFFT optimizations
Use device-side libraries (limited availability)
CUDA provides some device-callable math libraries (like curand for random numbers), but cuBLAS and cuFFT are not available as device functions.
For Tutorial 10’s use case:
The sequential approach (separate cuBLAS and cuFFT calls) is actually fine because:
Data stays in GPU memory between calls (no CPU←→GPU transfer)
Kernel launch overhead is microseconds (negligible compared to computation)
You get highly optimized implementations from both libraries
The stream ensures operations execute in order without CPU intervention
When to worry about kernel launch overhead:
Processing very small matrices (< 64×64) where launch overhead dominates
Launching thousands of tiny kernels in a loop
Operations that need custom fusion (e.g., applying constraints between matrix ops)
For our 1024×1024 images, cuBLAS + cuFFT as separate calls is the right choice!
Real-world application: Ptychography
This pattern directly applies to ptychography reconstruction:
Load diffraction pattern: Read from detector (1024×1024 complex array)
Apply probe update: Matrix multiplication with probe function (cuBLAS)
Propagate to detector: 2D FFT to frequency domain (cuFFT)
Apply constraints: Can add custom kernel if needed
Inverse FFT: Transform back to real space (cuFFT)
Update object: Another matrix operation (cuBLAS)
Processing 100,000 diffraction patterns with this pipeline achieves high-resolution image reconstruction in minutes instead of hours.
Multi-library pipeline benefits
Why combine libraries:
No reinventing the wheel: Use highly optimized implementations
Composability: Libraries work together through CUDA streams
Maintainability: Updates benefit all operations automatically
Performance: Each library is tuned for its specific operation
Common library combinations:
cuBLAS + cuFFT: Signal processing, ptychography, convolution
cuBLAS + cuDNN: Deep learning (matrix ops + neural network layers)
cuFFT + Thrust: FFT processing with sorting/reduction
cuSolver + cuBLAS: Linear system solving with matrix operations
Summary
Tutorial 10 demonstrates production-scale multi-operation pipelines:
Tutorials 1-6: GPU fundamentals
Tutorial 7: Large-scale processing with custom kernels
Tutorials 8-9: Single library optimization (cuBLAS)
Tutorial 10: Multi-library pipeline (cuBLAS + cuFFT)
Key takeaways:
Library composition: Multiple CUDA libraries work together via streams
No custom kernels: Standard operations use optimized library implementations
Real-world pattern: Typical for scientific computing (matrix ops + FFT)
Scalability: Same pattern works for any number of images with chunking
Complete workflow:
Learn fundamentals with custom kernels (understand how GPUs work)
Replace custom code with libraries (cuBLAS, cuFFT, cuDNN)
Combine libraries in pipelines (handle complex real-world applications)
Mix custom kernels with libraries (Tutorial 11 shows how)
Tutorial 11: Hybrid pipeline (custom kernels + CUDA libraries)
Problem
Real-world applications often need a mix of custom kernels and optimized libraries. For example, in signal processing:
Custom operations: Domain-specific transformations (thresholding, masking, custom filters)
Standard operations: FFT transforms, matrix multiplications
Writing everything as custom kernels means reinventing the wheel for standard operations. Using only libraries means you cannot implement custom domain logic. The solution: combine custom CUDA kernels with CUDA libraries.
This tutorial shows a simple image filtering pipeline that:
Uses custom kernel to apply a threshold mask
Uses cuFFT for frequency domain transformation
Uses custom kernel to apply frequency domain filtering
Uses cuFFT for inverse transform back to spatial domain
Solution overview
Image filtering pipeline:
1. Apply threshold mask: Mask pixels below threshold
→ Custom kernel (domain-specific logic)
2. Transform to frequency domain: spectrum = FFT(masked_image)
→ cuFFT (optimized library)
3. Apply frequency filter: Filter high/low frequencies
→ Custom kernel (element-wise operations)
4. Transform back: filtered_image = IFFT(spectrum)
→ cuFFT (inverse transform)
Why this hybrid approach:
Custom kernels: Fast for simple element-wise operations and custom logic
cuFFT: Highly optimized FFT implementation using specialized algorithms
Best of both worlds: Speed of libraries + flexibility of custom code
Example: For 1024×1024 image, ~2 ms total (vs ~8 ms if implementing FFT manually)
Step 1: Define custom kernels
Custom kernel 1: Apply threshold mask
// Set pixels below threshold to zero
__global__ void applyThreshold(
const float* input,
float* output,
float threshold,
int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
// Keep pixel if above threshold, otherwise set to zero
output[idx] = (input[idx] >= threshold) ? input[idx] : 0.0f;
}
Custom kernel 2: Apply frequency filter
// Apply low-pass filter in frequency domain
// Attenuate high frequencies (reduce noise)
__global__ void frequencyFilter(
cufftComplex* spectrum,
int width,
int height,
float cutoff_frequency)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
// Calculate distance from center (DC component)
int center_x = width / 2;
int center_y = height / 2;
float dx = (float)(x - center_x) / width;
float dy = (float)(y - center_y) / height;
float distance = sqrtf(dx * dx + dy * dy);
// Low-pass filter: Attenuate based on distance
float attenuation = 1.0f / (1.0f + powf(distance / cutoff_frequency, 4.0f));
int idx = y * width + x;
spectrum[idx].x *= attenuation;
spectrum[idx].y *= attenuation;
}
Custom kernel 3: Normalize after IFFT
// cuFFT doesn't normalize, so we need to divide by N
__global__ void normalize(
cufftComplex* data,
int N,
float scale)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
data[idx].x *= scale;
data[idx].y *= scale;
}
Step 2: Set up cuFFT library
#include <cuda_runtime.h>
#include <cufft.h>
#include <stdio.h>
int main() {
// ═══════════════════════════════════════════════════════════
// Configuration
// ═══════════════════════════════════════════════════════════
const int WIDTH = 1024;
const int HEIGHT = 1024;
const int TOTAL_SIZE = WIDTH * HEIGHT;
const float THRESHOLD = 0.3f; // Threshold for masking
const float CUTOFF_FREQ = 0.1f; // Low-pass filter cutoff
// ═══════════════════════════════════════════════════════════
// Initialize cuFFT
// ═══════════════════════════════════════════════════════════
// Create 2D FFT plan for complex-to-complex transform
cufftHandle plan_forward, plan_inverse;
cufftPlan2d(&plan_forward, HEIGHT, WIDTH, CUFFT_C2C);
cufftPlan2d(&plan_inverse, HEIGHT, WIDTH, CUFFT_C2C);
// Create CUDA stream for ordering operations
cudaStream_t stream;
cudaStreamCreate(&stream);
// Associate FFT plans with stream
cufftSetStream(plan_forward, stream);
cufftSetStream(plan_inverse, stream);
printf("Image filtering pipeline\n");
printf("Image size: %d×%d\n", WIDTH, HEIGHT);
printf("Threshold: %.2f\n", THRESHOLD);
printf("Cutoff frequency: %.2f\n", CUTOFF_FREQ);
// ═══════════════════════════════════════════════════════════
// Allocate GPU memory
// ═══════════════════════════════════════════════════════════
float *d_input_real, *d_output_real;
cufftComplex *d_complex_input, *d_spectrum, *d_filtered;
cudaMalloc(&d_input_real, TOTAL_SIZE * sizeof(float));
cudaMalloc(&d_output_real, TOTAL_SIZE * sizeof(float));
cudaMalloc(&d_complex_input, TOTAL_SIZE * sizeof(cufftComplex));
cudaMalloc(&d_spectrum, TOTAL_SIZE * sizeof(cufftComplex));
cudaMalloc(&d_filtered, TOTAL_SIZE * sizeof(cufftComplex));
// Load input image (for demo, we'll use synthetic data)
// In real application: cudaMemcpy from host
// ... load image data ...
Step 3: Implement the hybrid pipeline
// ═══════════════════════════════════════════════════════════
// Set up kernel launch configuration
// ═══════════════════════════════════════════════════════════
// For 1D kernels (threshold, normalize)
int threadsPerBlock_1D = 256;
int blocks_1D = (TOTAL_SIZE + threadsPerBlock_1D - 1) / threadsPerBlock_1D;
// For 2D kernel (frequency filter)
dim3 threadsPerBlock_2D(16, 16);
dim3 blocks_2D((WIDTH + 15) / 16, (HEIGHT + 15) / 16);
// ═══════════════════════════════════════════════════════════
// Pipeline execution (Custom + cuFFT + Custom + cuFFT)
// ═══════════════════════════════════════════════════════════
printf("\nProcessing pipeline:\n");
// ───────────────────────────────────────────────────────────
// STEP 1: Apply threshold (custom kernel)
// ───────────────────────────────────────────────────────────
printf(" [1/4] Applying threshold mask...\n");
applyThreshold<<<blocks_1D, threadsPerBlock_1D, 0, stream>>>(
d_input_real,
d_output_real,
THRESHOLD,
TOTAL_SIZE
);
// Convert to complex for FFT (real part = masked data, imag = 0)
// In practice, you'd have a kernel for this; simplified here
cudaMemcpy(d_complex_input, d_output_real,
TOTAL_SIZE * sizeof(float), cudaMemcpyDeviceToDevice);
// ───────────────────────────────────────────────────────────
// STEP 2: Forward FFT (cuFFT library)
// ───────────────────────────────────────────────────────────
printf(" [2/4] Computing FFT...\n");
cufftExecC2C(plan_forward, d_complex_input, d_spectrum, CUFFT_FORWARD);
// ───────────────────────────────────────────────────────────
// STEP 3: Apply frequency filter (custom kernel)
// ───────────────────────────────────────────────────────────
printf(" [3/4] Applying frequency domain filter...\n");
frequencyFilter<<<blocks_2D, threadsPerBlock_2D, 0, stream>>>(
d_spectrum,
WIDTH,
HEIGHT,
CUTOFF_FREQ
);
// ───────────────────────────────────────────────────────────
// STEP 4: Inverse FFT (cuFFT library)
// ───────────────────────────────────────────────────────────
printf(" [4/4] Computing inverse FFT...\n");
cufftExecC2C(plan_inverse, d_spectrum, d_filtered, CUFFT_INVERSE);
// Normalize (cuFFT doesn't normalize by default)
float scale = 1.0f / (WIDTH * HEIGHT);
normalize<<<blocks_1D, threadsPerBlock_1D, 0, stream>>>(
d_filtered,
TOTAL_SIZE,
scale
);
// Wait for all operations to complete
cudaStreamSynchronize(stream);
printf("\nPipeline complete!\n");
// Copy result back to host if needed
// cudaMemcpy(host_output, d_filtered, ...);
// ═══════════════════════════════════════════════════════════
// Cleanup
// ═══════════════════════════════════════════════════════════
cufftDestroy(plan_forward);
cufftDestroy(plan_inverse);
cudaStreamDestroy(stream);
cudaFree(d_input_real);
cudaFree(d_output_real);
cudaFree(d_complex_input);
cudaFree(d_spectrum);
cudaFree(d_filtered);
return 0;
}
Compilation:
nvcc -o tutorial11 tutorial11.cu -lcufft
Key points about hybrid pipelines
1. All operations share the same stream
// Set library to use our stream
cufftSetStream(plan_forward, stream);
cufftSetStream(plan_inverse, stream);
// Custom kernels also use the same stream
myKernel<<<blocks, threads, 0, stream>>>(...);
This ensures operations execute in order without extra synchronization between custom kernels and library calls.
2. Data stays on GPU throughout
GPU Memory (no CPU transfers during processing):
────────────────────────────────────────────────────────
Custom kernel → cuFFT → Custom kernel → cuFFT → Custom kernel
↓ ↓ ↓ ↓ ↓
All operations read/write GPU VRAM (fast, ~800 GB/s)
No CPU ←→ GPU transfers (would be slow, ~16 GB/s PCIe)
Everything happens on GPU from input to output.
3. Custom kernels for simple operations, libraries for complex ones
Operation |
Complexity |
Implementation |
|---|---|---|
Thresholding |
Simple |
Custom kernel |
Frequency filtering |
Simple |
Custom kernel |
Normalization |
Simple |
Custom kernel |
2D FFT |
Complex |
cuFFT library |
4. Flexibility of custom kernels
Custom kernels let you implement domain-specific logic:
// Your custom logic (not available in any library)
if (input[idx] >= threshold) {
output[idx] = input[idx]; // Keep value
} else {
output[idx] = 0.0f; // Mask out
}
Libraries provide optimized standard operations, custom kernels provide flexibility.
Performance comparison
For 1024×1024 images:
Approach |
Time |
Notes |
|---|---|---|
CPU (NumPy + SciPy) |
~50 ms |
Single-threaded |
Custom kernels only |
~8 ms |
Manual FFT implementation (slow) |
Libraries only |
N/A |
Cannot implement custom logic |
Hybrid (Tutorial 11) |
~2 ms |
Best of both worlds |
Breakdown for hybrid approach:
Threshold kernel: ~0.1 ms
Forward FFT (cuFFT): ~0.8 ms
Frequency filter kernel: ~0.2 ms
Inverse FFT (cuFFT): ~0.8 ms
Normalize kernel: ~0.1 ms
Total: ~2 ms (25× faster than CPU)
Why the hybrid approach works
Principle 1: Use the right tool for each operation
Custom kernels: Quick to write for simple logic (thresholding, masking, element-wise ops)
cuFFT: Years of optimization, specialized algorithms and hardware acceleration
Writing FFT from scratch would take weeks and still be slower than cuFFT.
Principle 2: Libraries handle complex algorithms
Manual FFT implementation:
- Cooley-Tukey algorithm
- Bit-reversal permutation
- Twiddle factor computation
- Memory access optimization
- Edge case handling
→ Weeks of work, still slower than cuFFT
cuFFT:
- One function call
- Optimized by NVIDIA engineers
- Hardware-specific tuning
→ ~0.8 ms for 1024×1024
Principle 3: Minimize synchronization
// Bad: Synchronize after every operation
myKernel1<<<...>>>();
cudaDeviceSynchronize(); // ❌ Unnecessary wait
myKernel2<<<...>>>();
cudaDeviceSynchronize(); // ❌ Unnecessary wait
// Good: Let stream handle ordering
myKernel1<<<..., stream>>>();
myKernel2<<<..., stream>>>(); // ✅ Automatically waits for kernel1
cufftExec(...); // ✅ Waits for kernel2
cudaStreamSynchronize(stream); // ✅ Only sync at end
Real-world applications
This hybrid pattern appears in many domains:
Image processing:
Custom kernel: Preprocessing (crop, normalize)
cuFFT: Frequency analysis
Custom kernel: Apply filters
cuFFT: Inverse transform
Custom kernel: Postprocessing
Scientific simulation:
cuBLAS: Matrix operations
Custom kernel: Apply boundary conditions
cuFFT: Transform to frequency domain
Custom kernel: Apply physics constraints
cuFFT: Transform back
Audio processing:
Custom kernel: Window function
cuFFT: Spectrogram computation
Custom kernel: Noise reduction
cuFFT: Inverse transform
Custom kernel: Normalize output
Summary
Tutorial 11 shows how to combine custom kernels with CUDA libraries:
Tutorials 1-6: GPU fundamentals (memory, parallelism, optimization)
Tutorial 7: Large-scale processing (chunking, streams)
Tutorials 8-9: Library usage (cuBLAS for matrix operations)
Tutorial 10: Multi-library pipelines (cuBLAS + cuFFT)
Tutorial 11: Hybrid approach (custom kernels + libraries)
Key takeaways:
Mix custom and library code: Libraries for complex algorithms, custom kernels for simple logic
Share CUDA streams: Ensures correct execution order between custom and library operations
Keep data on GPU: All operations run on GPU, no CPU transfers
Don’t reinvent the wheel: Use cuFFT instead of writing FFT from scratch
Custom kernels add flexibility: Implement domain-specific operations not available in libraries
When to use hybrid approach:
Need standard operations (FFT, matrix multiply) plus custom logic
Want speed of optimized libraries plus flexibility of custom code
Processing pipeline requires both simple and complex operations
Production code where both performance and maintainability matter
This completes the CUDA tutorial series from fundamentals to hybrid pipelines combining custom kernels with optimized libraries.
Next steps
For GPU optimization and profiling: See GPU optimization to learn how to profile your code with NSight, identify bottlenecks, and apply kernel fusion for memory-bound operations
For conceptual GPU architecture: See CUDA fundamentals for fundamentals of GPU hardware and programming model
To-do
[ ] Benchmark for GPU
[ ] Random matrix generation, 1024-1024px, 100,000 of them
[ ] Mixed-precision example with FP16/cuBLAS
[ ] Multi-GPU