CUDA tutorials

This page contains progressive tutorials for learning CUDA C++ programming through practical examples. Each tutorial builds on the previous one, from basic array addition to production-ready pipelined processing with CUDA streams.

For fundamental CUDA concepts, architecture, and memory model, see CUDA fundamentals.

Tutorial progression

Note

These tutorials progress from simple to realistic production scenarios:

  1. Memory basics (0/6): 1D array addition with memory management fundamentals

  2. Single block (1/6): Basic 2D matrix addition with 16×16 matrix using one block

  3. Multiple blocks (2/6): Scale to 256×256 matrix using 256 blocks

  4. Batch processing (3/6): Process 1,000 matrices simultaneously (65.5M elements)

  5. Chunked processing (4/6): Handle 100,000 matrices with limited VRAM

  6. Pipelined streams (5/6): Overlap data transfer and computation for maximum performance

  7. Shared memory (6/6): Matrix multiplication with tiling for 10-20× speedup

Tutorial 0: Memory management basics

Understanding how to allocate and transfer data between CPU (host) and GPU (device) is fundamental to CUDA programming. This tutorial shows a complete example of 1D array addition.

Memory workflow visualization:

Computer with CPU and GPU
┌──────────────────────────────────────────────────────┐
│ CPU Side                    GPU Side                 │
│                                                      │
│ RAM (System Memory)         VRAM (GPU Memory)        │
│ ┌─────────────────┐         ┌──────────────────┐     │
│ │ h_A[1.0, 2.0]   │────────▶│ d_A[1.0, 2.0]    │     │
│ │ h_B[3.0, 4.0]   │ memcpy  │ d_B[3.0, 4.0]    │     │
│ └─────────────────┘         └──────────────────┘     │
│                                      │               │
│                              Kernel accesses         │
│                              VRAM using d_A, d_B     │
│                                      ▼               │
│                             ┌──────────────────┐     │
│                             │ 1000 GPU threads │     │
│                             │ compute in VRAM  │     │
│                             └──────────────────┘     │
└──────────────────────────────────────────────────────┘

Workflow:
1. Allocate h_A in RAM (CPU memory)
2. Allocate d_A in VRAM (GPU memory) using cudaMalloc
3. Copy h_A → d_A using cudaMemcpy (RAM to VRAM, happens once)
4. Launch kernel with d_A pointer (just passing an address, not copying data)
5. Kernel threads access d_A directly in VRAM (fast, no copying)

Here is a pedagogical example of adding two arrays:

// Problem: We want 1000 GPU threads to add two arrays
// Arrays must be in VRAM, and threads need to know where to find them

__global__ void addArrays(float* A, float* B, float* C, int N)
{
    int i = threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];  // Accesses data in VRAM
    }
}

int main()
{
    int N = 1000;
    size_t size = N * sizeof(float);  // 1000 × 4 bytes = 4000 bytes

    // Step 1: Allocate and initialize arrays in CPU RAM
    float *h_A = (float*)malloc(size);  // h_ = host (CPU)
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // malloc(size) behind the scenes:
    // 1. Find 4000 bytes of free RAM
    // 2. Return starting address of the allocated memory (e.g., 0x5000)
    // 3. Store that address in h_A

    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Step 2: Allocate arrays in GPU VRAM
    float *d_A, *d_B, *d_C;  // d_ = device (GPU)

    cudaMalloc(&d_A, size);  // Behind the scenes:
    // 1. Find 4000 bytes of free VRAM
    // 2. Get starting address of that VRAM block (e.g., 0x7f8a4b000000)
    // 3. Write that starting address into d_A (on CPU)
    // Result: d_A points to first element of the array in VRAM
    //
    // Summary:
    // d_A is a pointer variable defined as `float* d_A`
    // => d_A    = VRAM address (where array starts in GPU memory)
    // => *d_A   = data at that VRAM address (first element value)
    // => &d_A   = CPU RAM address (where d_A variable itself lives)

    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Step 3: Copy data from RAM to VRAM (once)
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Step 4: Launch kernel with VRAM pointers
    addArrays<<<1, N>>>(d_A, d_B, d_C, N);
    // Threads access data directly in VRAM

    // Step 5: Copy result back to RAM
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Step 6: Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
}

Here is a visual:

CPU Memory (RAM)              GPU Memory (VRAM)
┌────────────────┐            ┌──────────────────┐
│ Address:0x1000 │            │ Address:0x7f8a...│
│ ┌────────────┐ │            │ ┌──────────────┐ │
│ │ d_A        │ │            │ │ [1.5, 2.3...]│ │
│ │ = 0x7f8a...│─┼─points to─▶│ └──────────────┘ │
│ └────────────┘ │            │        ↑         │
│       ↑        │            │     *d_A = 1.5   │
│    &d_A        │            └──────────────────┘
└────────────────┘

&d_A = 0x1000 (address of pointer on CPU)
d_A  = 0x7f8a... (VRAM address stored in d_A)
*d_A = 1.5 (data in VRAM)
Why pass pointers to kernels?

The problem is that GPUs have separate memory (VRAM) from CPUs (RAM). Your data must be in GPU memory before a kernel can process it. But how do you tell the kernel where to find that data in VRAM? If you tried to pass entire arrays by value, the program would copy gigabytes every kernel launch, which defeats the purpose of having data in fast VRAM.

The solution: First copy data from CPU RAM to GPU VRAM once using cudaMemcpy. Then pass pointers (GPU memory addresses in VRAM) to kernels. A pointer is just a small number (8 bytes) that tells threads where to find the array in VRAM. Thousands of threads can then work on the same dataset in VRAM without any copying. The naming convention uses d_ prefix for device (GPU/VRAM) pointers and h_ prefix for host (CPU/RAM) pointers.

Why doesn’t the kernel just know the VRAM addresses automatically?

Kernels are reusable functions, not hardcoded to specific data.

If addresses were hardcoded, you could only use each kernel once. By passing pointers as parameters, one kernel can process unlimited different datasets - same reason C++ functions take parameters instead of hardcoding values.

// Same kernel, different data!
__global__ void addArrays(float* A, float* B, float* C, int N) {
    int i = threadIdx.x;
    C[i] = A[i] + B[i];  // Works on ANY arrays you pass
}

int main() {
    // Scenario 1: Physics simulation
    float *d_physics_A, *d_physics_B, *d_physics_C;
    cudaMalloc(&d_physics_A, size);
    cudaMalloc(&d_physics_B, size);
    cudaMalloc(&d_physics_C, size);
    addArrays<<<1, N>>>(d_physics_A, d_physics_B, d_physics_C, N);

    // Scenario 2: Same kernel, different data for ML
    float *d_ml_A, *d_ml_B, *d_ml_C;
    cudaMalloc(&d_ml_A, size);
    cudaMalloc(&d_ml_B, size);
    cudaMalloc(&d_ml_C, size);
    addArrays<<<1, N>>>(d_ml_A, d_ml_B, d_ml_C, N);
    // ▲ Same kernel code, different VRAM addresses!

    // Scenario 3: Use outputs from previous kernels as inputs
    addArrays<<<1, N>>>(d_physics_C, d_ml_C, d_final, N);
}

Tutorial 1: 2D matrix addition with single block (16×16)

Now we progress from 1D arrays to 2D matrices. We organize threads in a 2D layout (16×16) within a single block, where each thread computes one matrix element.

Here is the complete working example:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void matAdd(float* A, float* B, float* C, int N)
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    int index = i * N + j;

    if (i < N && j < N) {
        C[index] = A[index] + B[index];
    }
}

int main()
{
    int N = 16;
    size_t size = N * N * sizeof(float);

    // Host allocation and initialization
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    for (int i = 0; i < N * N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Device allocation
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy to GPU
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(1, 1);
    matAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result back
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

Thread organization (1 block with 16×16 threads):

Single block with 16×16 = 256 threads
┌────────────────────────────────────────┐
│ Thread  Thread  Thread  ...  Thread    │
│ (0,0)   (1,0)   (2,0)        (15,0)    │
│                                        │
│ Thread  Thread  Thread  ...  Thread    │
│ (0,1)   (1,1)   (2,1)        (15,1)    │
│                                        │
│   ...     ...     ...    ...   ...     │
│                                        │
│ Thread  Thread  Thread  ...  Thread    │
│ (0,15)  (1,15)  (2,15)       (15,15)   │
└────────────────────────────────────────┘

Each thread (i,j) computes: C[i][j] = A[i][j] + B[i][j]

Key insight: Thread layout matches data layout!
- Thread (0,0) handles matrix element [0][0]
- Thread (5,7) handles matrix element [5][7]
- Thread (15,15) handles matrix element [15][15]

The kernel function:

__global__ void matAdd(float* A, float* B, float* C, int N)
{
    // Get thread position in 2D block
    int i = threadIdx.x;  // row index (0-15)
    int j = threadIdx.y;  // column index (0-15)

    // Convert 2D (i,j) to 1D memory index
    int index = i * N + j;
    // Why? Memory is 1D, but we think in 2D
    // Example: Element [2][3] is at position 2×16 + 3 = 35

    // Bounds check (important for general code)
    if (i < N && j < N) {
        C[index] = A[index] + B[index];
    }
}

What each thread does:

Thread (0,0): index = 0×16 + 0 = 0   → C[0]  = A[0]  + B[0]   (element [0][0])
Thread (0,1): index = 0×16 + 1 = 1   → C[1]  = A[1]  + B[1]   (element [0][1])
Thread (1,0): index = 1×16 + 0 = 16  → C[16] = A[16] + B[16]  (element [1][0])
Thread (2,3): index = 2×16 + 3 = 35  → C[35] = A[35] + B[35]  (element [2][3])

All 256 threads execute this same code in parallel!
Each thread just has different threadIdx values.

Host code setup:

int main()
{
    // Define problem size
    int N = 16;  // 16×16 matrix
    size_t size = N * N * sizeof(float);  // 16×16×4 = 1,024 bytes

    // Allocate host (CPU) memory
    float *h_A = (float*)malloc(size);  // h_ prefix = host memory
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize matrices with sample values
    for (int i = 0; i < N * N; i++) {
        h_A[i] = 1.0f;  // Matrix A filled with 1.0
        h_B[i] = 2.0f;  // Matrix B filled with 2.0
    }
    // Expected result: C should be filled with 3.0

Memory visualization after initialization:

CPU RAM:
h_A: [1.0, 1.0, 1.0, ..., 1.0]  (256 elements × 4 bytes = 1,024 bytes)
h_B: [2.0, 2.0, 2.0, ..., 2.0]  (256 elements × 4 bytes = 1,024 bytes)
h_C: [?, ?, ?, ..., ?]           (uninitialized, will hold results)

GPU memory allocation:

// Declare device (GPU) pointers
float *d_A, *d_B, *d_C;  // d_ prefix = device memory

// Allocate GPU memory
cudaMalloc(&d_A, size);  // Allocate 1,024 bytes in VRAM for A
cudaMalloc(&d_B, size);  // Allocate 1,024 bytes in VRAM for B
cudaMalloc(&d_C, size);  // Allocate 1,024 bytes in VRAM for C

What happens with cudaMalloc:

CPU side:                          GPU side:
┌─────────────────┐                ┌──────────────────┐
│ d_A (pointer)   │───points to───▶│ VRAM block       │
│ value: 0x7f8a...│                │ [empty, 1KB]     │
└─────────────────┘                └──────────────────┘

cudaMalloc(&d_A, 1024):
1. Find 1,024 free bytes in GPU VRAM
2. Return VRAM address (e.g., 0x7f8a4b000000)
3. Store that address in CPU variable d_A

Result: d_A contains a VRAM address (lives on CPU, points to GPU)

Data transfer to GPU:

// Copy data from CPU RAM to GPU VRAM
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

Data flow visualization:

Before cudaMemcpy:
CPU RAM                     GPU VRAM
h_A: [1.0, 1.0, ...]        d_A: [?, ?, ...]  (empty)
h_B: [2.0, 2.0, ...]        d_B: [?, ?, ...]  (empty)

After cudaMemcpy:
CPU RAM                     GPU VRAM
h_A: [1.0, 1.0, ...]        d_A: [1.0, 1.0, ...]  (copied!)
h_B: [2.0, 2.0, ...]        d_B: [2.0, 2.0, ...]  (copied!)

This is a synchronous operation: CPU waits until copy completes
Time taken: ~1-10 microseconds for 1 KB over PCIe

Launching the kernel:

// Configure launch parameters
dim3 threadsPerBlock(16, 16);  // 16×16 = 256 threads per block
dim3 numBlocks(1, 1);           // Just 1 block total

// Launch kernel
matAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);

Understanding dim3:

dim3 is a CUDA type for 3D dimensions (x, y, z)

dim3 threadsPerBlock(16, 16);
- Creates 16×16 = 256 threads in a 2D grid within one block
- Equivalent to: dim3 threadsPerBlock(16, 16, 1);
- Each thread gets unique (threadIdx.x, threadIdx.y)

dim3 numBlocks(1, 1);
- Creates just 1 block
- For now, we only need 1 block (256 threads enough for 16×16 matrix)

What happens during kernel launch:

GPU Execution:
┌────────────────────────────────────────┐
│ Block (0,0) with 256 threads           │
│                                        │
│ Thread(0,0)  Thread(1,0)  ...         │
│    ↓            ↓           ↓          │
│ C[0]=A[0]+B[0]  C[1]=...   ...        │
│                                        │
│ All execute simultaneously!            │
│ Time: ~1 microsecond for 256 adds     │
└────────────────────────────────────────┘

CPU: Continues immediately (kernel launch is asynchronous)
GPU: Schedules 256 threads, executes addition, writes to d_C

Getting results back:

// Copy result from GPU VRAM to CPU RAM
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

Data flow:

Before this line:
GPU VRAM                    CPU RAM
d_C: [3.0, 3.0, ...]        h_C: [?, ?, ...]

After cudaMemcpy:
GPU VRAM                    CPU RAM
d_C: [3.0, 3.0, ...]        h_C: [3.0, 3.0, ...]  (copied!)

This is synchronous: CPU waits for:
1. Kernel to finish computing
2. Data transfer to complete

Now h_C contains the results, we can use them on CPU!

Cleanup:

    // Free GPU memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free CPU memory
    free(h_A);
    free(h_B);
    free(h_C);
}

Why cleanup is important:

cudaFree(d_A):
- Releases 1,024 bytes of GPU VRAM
- Makes that memory available for other GPU operations
- GPU memory is limited (typically 4-24 GB)

free(h_A):
- Releases 1,024 bytes of CPU RAM
- Standard C memory management

Without cleanup: Memory leak! Resources stay allocated until program ends.
2D to 1D memory layout visualization
Consider a 2D matrix B[3][4]:
Row 0: B[0][0]  B[0][1]  B[0][2]  B[0][3]
Row 1: B[1][0]  B[1][1]  B[1][2]  B[1][3]
Row 2: B[2][0]  B[2][1]  B[2][2]  B[2][3]

Stored in 1D memory:
Index:  0   1   2   3   4   5   6   7   8   9  10  11
      ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
A[]   │0,0│0,1│0,2│0,3│1,0│1,1│1,2│1,3│2,0│2,1│2,2│2,3│
      └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
      └───────Row 0───────┘└───────Row 1───────┘└───────Row 2───────┘

Formula: A[i * N + j] where N = 4 (width)

Example mappings:
B[0][1]  →  A[0 × 4 + 1]  =  A[1]   (row 0, column 1)
B[1][0]  →  A[1 × 4 + 0]  =  A[4]   (row 1, column 0)
B[2][3]  →  A[2 × 4 + 3]  =  A[11]  (row 2, column 3)

How it works:
i × N      →  skip i complete rows (each row is N elements)
+ j        →  move j positions within the current row
Why do you need boundary checks in this 2D matrix addition kernel?

The problem is that you must launch whole blocks, not individual threads.

Example with N=100 and block size 16×16:

Problem: Need 100×100 threads (one per matrix element)
Block size: 16×16 threads per block

Calculate blocks needed:
blocks_x = ⌈100 / 16⌉ = ⌈6.25⌉ = 7 blocks
blocks_y = ⌈100 / 16⌉ = ⌈6.25⌉ = 7 blocks

Threads launched:
total_x = 7 blocks × 16 threads = 112 threads
total_y = 7 blocks × 16 threads = 112 threads
Total = 112 × 112 = 12,544 threads

But your matrix is only 100×100 = 10,000 elements!
Extra threads = 12,544 - 10,000 = 2,544 threads doing nothing

Visual representation of the thread grid:

Matrix (100×100)              Thread grid (112×112)
┌──────────────┐              ┌──────────────┬──┐
│              │              │              │XX│ ← Extra threads
│   Valid      │              │   Valid      │XX│   (threads 100-111)
│   region     │              │   region     │XX│
│   100×100    │              │   100×100    │XX│
│              │              │              │XX│
│              │              │              │XX│
└──────────────┘              ├──────────────┼──┤
                              │XXXXXXXXXXXXXX│XX│ ← Extra row of threads
                              └──────────────┴──┘   (rows 100-111)

Without boundary check: Threads with i≥100 or j≥100 access invalid memory
With boundary check: if (i < N && j < N) prevents out-of-bounds access

Code pattern:

int i = blockIdx.x * blockDim.x + threadIdx.x;  // Could be 0-111
int j = blockIdx.y * blockDim.y + threadIdx.y;  // Could be 0-111

if (i < N && j < N)  // Only process if within 0-99 range
    C[i * N + j] = A[i * N + j] + B[i * N + j];

When you do not need boundary checks: When problem size is guaranteed to be a multiple of block size (for example, N=256 with 16×16 blocks gives exactly 16×16 blocks with no remainder).

What’s next?

We are using a single block containing 256 threads to add two 16×16 matrices. What if we want to add larger matrices, like 256×256? See the next tutorial on using multiple blocks.

Tutorial 2: 2D matrix addition with multiple blocks (256x256)

Recall block refers to a group of threads that can cooperate via shared memory. A grid is a collection of blocks launched by a single kernel.

For larger matrices like 256×256, we divide the matrix into smaller 16×16 chunks, where each chunk is processed by one block. This means we need 16 blocks in the X direction and 16 blocks in the Y direction, giving us 16 × 16 = 256 blocks total. Each block contains 16 × 16 = 256 threads that process the elements in that chunk.

2D block organization visualization:

Grid (multiple blocks arranged in 2D)
┌─────────┬─────────┬─────────┐
│ Block   │ Block   │ Block   │
│ (0,0)   │ (1,0)   │ (2,0)   │
├─────────┼─────────┼─────────┤
│ Block   │ Block   │ Block   │
│ (0,1)   │ (1,1)   │ (2,1)   │
├─────────┼─────────┼─────────┤
│ Block   │ Block   │ Block   │
│ (0,2)   │ (1,2)   │ (2,2)   │
└─────────┴─────────┴─────────┘

Each block contains 16×16 threads:
Block (0,0) expanded:
┌──────────────────────────────┐
│ T(0,0) T(1,0) ... T(15,0)    │
│ T(0,1) T(1,1) ... T(15,1)    │
│   ...    ...   ...   ...     │
│ T(0,15) ...      T(15,15)    │
└──────────────────────────────┘

256 threads per block

Here is a sample implementation:

// Kernel definition
__global__ void MatAdd(float* d_A, float* d_B, float* d_C, int N)
{
    // Calculate global thread indices
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;

    // Boundary check (prevents out-of-bounds access when N not divisible by block size)
    if (i < N && j < N) {
        int index = i * N + j;  // Convert 2D to 1D index
        d_C[index] = d_A[index] + d_B[index];
    }
}

int main()
{
    ...
    // 2D thread organization
    dim3 threadsPerBlock(16, 16);     // x=16, y=16, z=1
    dim3 numBlocks(N/16, N/16, 1);    // Explicit z dimension
    MatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
    ...
}
Why use multiple blocks for large matrices?

GPUs achieve parallelism by distributing blocks across multiple streaming multiprocessors (SMs).

Comparison for 256×256 matrix (65,536 elements):

Option 1: One block with 1024 threads (maximum per block)
- Launch: 1 block × 1024 threads = 1,024 threads
- Problem: Need 65,536 threads (one per element)
- Result: Must run kernel 64 times sequentially or use complex indexing
- Parallelism: Only 1 SM utilized (out of 82 for RTX 3090, 108 for A100)

Why only 1 SM?
• 1 block = assigned to exactly 1 SM (blocks never split across SMs)
• That SM has 32 CUDA cores working
• Other 81 SMs sit completely idle
• Massive waste of GPU resources

Option 2: 256 blocks with 256 threads each
- Launch: 256 blocks × 256 threads = 65,536 threads
- All elements processed in parallel
- Parallelism: All 82 SMs working (256 blocks distributed across 82 SMs)

How blocks distribute:
• Each SM gets multiple blocks: 256 blocks ÷ 82 SMs ≈ 3 blocks per SM
• Each SM processes its assigned blocks' warps concurrently
• All SMs work simultaneously → ~82× more parallelism

Result: Option 2 is dramatically faster (~50-80× in practice)

Tutorial 3: Batch processing multiple matrices

Process 1,000 matrices of 256×256 elements simultaneously using a 3D grid configuration where the Z dimension represents the batch index.

Problem setup:

Batch size: 1,000 matrices
Matrix size: 256×256 elements each
Total elements: 1,000 × 256 × 256 = 65,536,000 elements

Memory layout (flattened):
Matrix 0: [elements 0 to 65,535]
Matrix 1: [elements 65,536 to 131,071]
Matrix 2: [elements 131,072 to 196,607]
...
Matrix 999: [elements 65,470,464 to 65,535,999]

3D grid organization:

Grid dimensions:
- X dimension: 16 blocks (for 256 columns ÷ 16)
- Y dimension: 16 blocks (for 256 rows ÷ 16)
- Z dimension: 1,000 blocks (one per matrix in batch)

Total blocks = gridDim.x × gridDim.y × gridDim.z
             = 16 × 16 × 1,000
             = 256,000 blocks

Threads per block = blockDim.x × blockDim.y × blockDim.z
                  = 16 × 16 × 1
                  = 256 threads

Total threads = Total blocks × Threads per block
              = 256,000 × 256
              = 65,536,000 threads

Warp organization inside each block:

Since each block has 256 threads and warps contain 32 threads:

Warps per block = 256 threads ÷ 32 threads/warp = 8 warps

Total warps in grid = 256,000 blocks × 8 warps/block
                    = 2,048,000 warps

These 2,048,000 warps get distributed across the GPU's streaming
multiprocessors (SMs). Each SM executes warps in groups, scheduling
them to hide memory latency. On an A100 GPU with 108 SMs:

Warps per SM ≈ 2,048,000 warps ÷ 108 SMs ≈ 18,963 warps/SM

The SM's warp scheduler executes these warps over time, switching
between warps whenever one stalls on memory access.

Thread-to-element mapping

Each thread computes one element in one matrix. Think of each block as processing a 16×16 chunk of elements within a specific matrix:

Single matrix (256×256) divided into 16×16 chunks:

Matrix 0:                    ← blockIdx.z = 0
┌────────┬────────┬─────┬────────┐
│ Block  │ Block  │ ... │ Block  │  ← blockIdx.y = 0
│ (0,0)  │ (1,0)  │     │ (15,0) │
│ 16×16  │ 16×16  │     │ 16×16  │
├────────┼────────┼─────┼────────┤
│ Block  │ Block  │ ... │ Block  │  ← blockIdx.y = 1
│ (0,1)  │ (1,1)  │     │ (15,1) │
├────────┼────────┼─────┼────────┤
│  ...   │  ...   │ ... │  ...   │
├────────┼────────┼─────┼────────┤
│ Block  │ Block  │ ... │ Block  │  ← blockIdx.y = 15
│ (0,15) │ (1,15) │     │(15,15) │
└────────┴────────┴─────┴────────┘
   ↑        ↑              ↑
bx=0     bx=1          bx=15

Inside each block, 16×16 threads process the elements:

Block (3, 5, 0):  ← Processing chunk at column 3, row 5, matrix 0
┌──┬──┬──┬───┬──┐
│T │T │T │...│T │  ← threadIdx.y = 0
│00│10│20│   │F0│
├──┼──┼──┼───┼──┤
│T │T │T │...│T │  ← threadIdx.y = 1
│01│11│21│   │F1│
├──┼──┼──┼───┼──┤
│..│..│..│...│..│
├──┼──┼──┼───┼──┤
│T │T │T │...│T │  ← threadIdx.y = 15
│0F│1F│2F│   │FF│
└──┴──┴──┴───┴──┘
 ↑  ↑  ↑      ↑
tx=0 1  2    15

Important: threadIdx always ranges from 0 to (blockDim - 1)
- threadIdx.x ranges from 0 to 15 (always resets for each block)
- threadIdx.y ranges from 0 to 15 (always resets for each block)
- Every block has its own local thread numbering starting at (0,0)
- To get the global position, we combine blockIdx and threadIdx

Coordinate calculation:

For a thread at (threadIdx.x, threadIdx.y) in block (blockIdx.x, blockIdx.y, blockIdx.z):

// Which matrix in the batch
int batch_idx = blockIdx.z;          // Range: 0 to 999

// Position within that matrix
int row = blockIdx.y * 16 + threadIdx.y;  // Range: 0 to 255
int col = blockIdx.x * 16 + threadIdx.x;  // Range: 0 to 255

Example: Thread at threadIdx(5, 7) in block(3, 10, 42):
- Processing matrix 42
- Row = 10 × 16 + 7 = 167
- Col = 3 × 16 + 5 = 53
- Computing element at position (167, 53) in matrix 42

Here is the complete implementation:

// Kernel for batch matrix addition
__global__ void batchMatAdd(float* d_A, float* d_B, float* d_C,
                             int N, int batch_size)
{
    // Which matrix in the batch
    int batch_idx = blockIdx.z;

    // Position within this matrix
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    // Boundary check
    if (row < N && col < N && batch_idx < batch_size) {
        // Calculate flat index: skip to correct matrix, then to element
        int matrix_offset = batch_idx * N * N;
        int element_offset = row * N + col;
        int index = matrix_offset + element_offset;

        d_C[index] = d_A[index] + d_B[index];
    }
}

int main()
{
    const int N = 256;              // Matrix dimension
    const int batch_size = 1000;    // Number of matrices

    // Total elements for all matrices
    size_t total_size = batch_size * N * N * sizeof(float);

    // Allocate host memory
    float* h_A = (float*)malloc(total_size);
    float* h_B = (float*)malloc(total_size);
    float* h_C = (float*)malloc(total_size);

    // Initialize with sample data
    for (int i = 0; i < batch_size * N * N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, total_size);
    cudaMalloc(&d_B, total_size);
    cudaMalloc(&d_C, total_size);

    // Copy data to device
    cudaMemcpy(d_A, h_A, total_size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, total_size, cudaMemcpyHostToDevice);

    // Configure 3D grid: 16×16×1000 blocks, each with 16×16 threads
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N/16, N/16, batch_size);

    // Launch kernel
    batchMatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N, batch_size);

    // Copy results back
    cudaMemcpy(h_C, d_C, total_size, cudaMemcpyDeviceToHost);

    // Verify (check first and last matrix)
    printf("First element: %f (expected 3.0)\n", h_C[0]);
    printf("Last element: %f (expected 3.0)\n", h_C[batch_size * N * N - 1]);

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
}
How does GPU handle 256,000 blocks?

Block assignment to SMs:

GPU: NVIDIA A100 (108 SMs)
Total blocks: 256,000 blocks
Total warps: 2,048,000 warps (each block has 8 warps)

Block assignment (initial wave):
256,000 blocks ÷ 108 SMs ≈ 2,370 blocks per SM

Block to SM mapping:
SM 0: Assigned blocks 0, 108, 216, ... (2,370 blocks total)
SM 1: Assigned blocks 1, 109, 217, ... (2,370 blocks total)
...
SM 107: Assigned blocks 107, 215, 323, ...

How SMs actually execute: warp-level scheduling:

Important: SMs don't execute "blocks" - they execute warps!

Each block (256 threads) contains 8 warps (32 threads each).

Inside SM 0 (each SM operates independently):
┌─────────────────────────────────────────┐
│ SM 0 Warp Scheduler                     │
│                                         │
│ Warps assigned to THIS SM:              │
│ - Block 0, Warp 0   (ready)   ← execute │
│ - Block 0, Warp 1   (stalled) ← skip    │
│ - Block 0, Warp 2   (ready)   ← execute │
│ - ...                                   │
│ - Block 108, Warp 0 (ready)   ← execute │
│ - Block 108, Warp 1 (ready)   ← execute │
│ - ...                                   │
└─────────────────────────────────────────┘

Each SM has its own warp scheduler that:
1. Manages only the warps from blocks assigned to THIS SM
2. Every few clock cycles, picks ready warps (not stalled on memory)
3. Executes those warps on this SM's CUDA cores
4. When a warp stalls (memory access), switches to another ready warp

Key point: Each SM independently switches between its own warps.
The 108 SMs on A100 all do this in parallel, each managing their
own set of blocks and warps.

Timeline:

The GPU doesn't process blocks sequentially. Instead:

- Blocks are assigned to SMs as resources become available
- Within each SM, warps from multiple blocks execute concurrently
- Warp scheduler switches between warps every few cycles
- When a block finishes, SM gets assigned a new block

This happens automatically—no explicit management needed!

Tutorial 4: Processing large batches with limited VRAM

The previous example processed 1,000 matrices. What if you need to process 100,000 matrices but your GPU VRAM can only hold 1,000 at a time? You process them in chunks using a loop.

Problem setup:

Given:
- Matrix size: 256×256 elements per matrix
- Total matrices: 100,000 matrices (entire dataset)
- Chunk size: 1,000 matrices (GPU VRAM limit per chunk)

Calculate:
- Total elements: 100,000 matrices × 256 × 256 = 6,553,600,000 elements
- Elements per chunk: 1,000 matrices × 256 × 256 = 65,536,000 elements
- Number of chunks: 100,000 matrices ÷ 1,000 matrices/chunk = 100 chunks

Solution: Process in 100 chunks of 1,000 matrices each

Chunk 0: Matrices 0-999
Chunk 1: Matrices 1,000-1,999
Chunk 2: Matrices 2,000-2,999
...
Chunk 99: Matrices 99,000-99,999

Implementation strategy:

For each chunk:
1. Copy 1,000 matrices from host (CPU) to device (GPU)
2. Launch kernel to process these 1,000 matrices
3. Copy results back from device to host
4. Repeat for next chunk

This is called "streaming" or "pipelined processing"

Here is the complete implementation:

#include <cuda_runtime.h>
#include <cstdio>

// Same kernel as before (processes up to 1,000 matrices)
__global__ void batchMatAdd(float* d_A, float* d_B, float* d_C,
                             int N, int batch_size)
{
    int batch_idx = blockIdx.z;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N && batch_idx < batch_size) {
        int matrix_offset = batch_idx * N * N;
        int element_offset = row * N + col;
        int index = matrix_offset + element_offset;

        d_C[index] = d_A[index] + d_B[index];
    }
}

int main()
{
    const int N = 256;                    // Matrix dimension
    const int total_matrices = 100000;    // Total dataset size
    const int chunk_size = 1000;          // GPU memory limit
    const int num_chunks = total_matrices / chunk_size;

    // Size for one chunk
    size_t chunk_bytes = chunk_size * N * N * sizeof(float);

    // Allocate host memory for entire dataset
    float* h_A_all = (float*)malloc(total_matrices * N * N * sizeof(float));
    float* h_B_all = (float*)malloc(total_matrices * N * N * sizeof(float));
    float* h_C_all = (float*)malloc(total_matrices * N * N * sizeof(float));

    // Initialize data (would typically load from file)
    for (int i = 0; i < total_matrices * N * N; i++) {
        h_A_all[i] = 1.0f;
        h_B_all[i] = 2.0f;
    }

    // Allocate device memory (only enough for one chunk)
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, chunk_bytes);
    cudaMalloc(&d_B, chunk_bytes);
    cudaMalloc(&d_C, chunk_bytes);

    // Grid configuration for 1,000 matrices
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N/16, N/16, chunk_size);

    printf("Processing %d matrices in %d chunks of %d\n",
           total_matrices, num_chunks, chunk_size);

    // Process each chunk
    for (int chunk = 0; chunk < num_chunks; chunk++) {
        // Calculate starting position in the full dataset
        // Chunk 0: offset = 0 * 1000 * 256 * 256 = 0 (start at matrix 0)
        // Chunk 1: offset = 1 * 1000 * 256 * 256 = 65,536,000 (start at matrix 1000)
        // Chunk 2: offset = 2 * 1000 * 256 * 256 = 131,072,000 (start at matrix 2000)
        int offset = chunk * chunk_size * N * N;

        printf("Processing chunk %d/%d (matrices %d-%d)\n",
               chunk + 1, num_chunks,
               chunk * chunk_size, (chunk + 1) * chunk_size - 1);

        // Copy chunk from host to device
        cudaMemcpy(d_A, h_A_all + offset, chunk_bytes, cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, h_B_all + offset, chunk_bytes, cudaMemcpyHostToDevice);

        // Process this chunk
        batchMatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N, chunk_size);

        // Copy results back
        cudaMemcpy(h_C_all + offset, d_C, chunk_bytes, cudaMemcpyDeviceToHost);
    }

    // Verify results
    bool success = true;
    for (int i = 0; i < total_matrices * N * N; i++) {
        if (h_C_all[i] != 3.0f) {
            success = false;
            break;
        }
    }
    printf("Result: %s\n", success ? "PASS" : "FAIL");

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A_all); free(h_B_all); free(h_C_all);

    return 0;
}

Tutorial 5: Pipelined processing with CUDA streams

Tutorial 4 processes chunks sequentially - each chunk must fully complete (copy to GPU → compute → copy back) before the next chunk begins. This leaves the GPU idle during data transfers and the CPU waiting during computation.

Tutorial 5 uses CUDA streams to process multiple chunks concurrently - while one chunk computes on the GPU, another chunk’s data is being copied in, and a third chunk’s results are being copied out. This dramatically improves hardware utilization.

Key concept:

Without streams (previous example):
════════════════════════════════════════════════════════════
Chunk 0: [H→D]---[Compute]---[D→H]
Chunk 1:                           [H→D]---[Compute]---[D→H]
Chunk 2:                                                     [H→D]---[Compute]---[D→H]

GPU idle during copies! CPU waits for GPU!

With streams (this example):
════════════════════════════════════════════════════════════
Stream 0: [H→D]---[Compute]---[D→H]
Stream 1:      [H→D]---[Compute]---[D→H]
Stream 2:           [H→D]---[Compute]---[D→H]

Operations overlap! Much better utilization!

Why asynchronous operations matter:

The regular ``cudaMemcpy`` is blocking - when you call it, your CPU waits
until the entire transfer completes before moving to the next line of code.
Similarly, kernel launches with the default stream wait for all previous
operations to finish.

This creates idle time:
- While data copies from CPU to GPU, the GPU compute units sit idle
- While GPU computes, the CPU cannot prepare the next chunk
- While results copy back, both CPU and GPU wait

``cudaMemcpyAsync`` is non-blocking - it returns immediately after queuing
the transfer, letting your CPU continue executing code. Combined with
streams, this allows multiple chunks to be processed concurrently. While
one chunk computes in stream 0, another chunk can copy data in via
stream 1, and a third chunk can copy results out via stream 2 - all
happening simultaneously.

Requirements for overlapping:

1. Use ``cudaMemcpyAsync`` instead of cudaMemcpy (non-blocking transfers)
2. Use pinned (page-locked) host memory with cudaMallocHost (enables DMA)
3. Use multiple CUDA streams to manage concurrent operations
4. Launch kernels in different streams (allows parallel execution)

Complete implementation

Here is the full code for Tutorial 5. Read through it first to see the overall structure, then continue to the section-by-section breakdown below.

Key parameters:

Given (same dataset as Tutorial 4):
- Matrix size: 256×256 elements per matrix
- Total matrices: 100,000 matrices (entire dataset)
- Chunk size: 1,000 matrices (amount processed per chunk)
- Number of streams: 4 streams (new in Tutorial 5!)

Calculate:
- Total chunks: 100,000 ÷ 1,000 = 100 chunks
- Chunks per stream: 100 chunks ÷ 4 streams = 25 chunks/stream
- Concurrent chunks: 4 chunks processing simultaneously (one per stream)

Key difference from Tutorial 4:
Tutorial 4 processes 1 chunk at a time sequentially (100 iterations)
Tutorial 5 processes 4 chunks at a time concurrently (still 100 iterations,
but chunks overlap in execution)
#include <cuda_runtime.h>
#include <cstdio>

__global__ void batchMatAdd(float* d_A, float* d_B, float* d_C,
                             int N, int batch_size)
{
    int batch_idx = blockIdx.z;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N && batch_idx < batch_size) {
        int matrix_offset = batch_idx * N * N;
        int element_offset = row * N + col;
        int index = matrix_offset + element_offset;

        d_C[index] = d_A[index] + d_B[index];
    }
}

int main()
{
    const int N = 256;
    const int total_matrices = 100000;
    const int chunk_size = 1000;
    const int num_chunks = total_matrices / chunk_size;
    const int num_streams = 4;

    size_t chunk_bytes = chunk_size * N * N * sizeof(float);

    // Allocate pinned host memory
    float *h_A_all, *h_B_all, *h_C_all;
    cudaMallocHost(&h_A_all, total_matrices * N * N * sizeof(float));
    cudaMallocHost(&h_B_all, total_matrices * N * N * sizeof(float));
    cudaMallocHost(&h_C_all, total_matrices * N * N * sizeof(float));

    // Initialize data
    for (int i = 0; i < total_matrices * N * N; i++) {
        h_A_all[i] = 1.0f;
        h_B_all[i] = 2.0f;
    }

    // Allocate device memory for each stream
    float *d_A[num_streams], *d_B[num_streams], *d_C[num_streams];
    cudaStream_t streams[num_streams];

    for (int i = 0; i < num_streams; i++) {
        cudaMalloc(&d_A[i], chunk_bytes);
        cudaMalloc(&d_B[i], chunk_bytes);
        cudaMalloc(&d_C[i], chunk_bytes);
        cudaStreamCreate(&streams[i]);
    }

    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N/16, N/16, chunk_size);

    // Process chunks using streams
    for (int chunk = 0; chunk < num_chunks; chunk++) {
        int stream_idx = chunk % num_streams;
        int offset = chunk * chunk_size * N * N;

        cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes,
                      cudaMemcpyHostToDevice, streams[stream_idx]);
        cudaMemcpyAsync(d_B[stream_idx], h_B_all + offset, chunk_bytes,
                      cudaMemcpyHostToDevice, streams[stream_idx]);

        batchMatAdd<<<numBlocks, threadsPerBlock, 0, streams[stream_idx]>>>(
            d_A[stream_idx], d_B[stream_idx], d_C[stream_idx], N, chunk_size);

        cudaMemcpyAsync(h_C_all + offset, d_C[stream_idx], chunk_bytes,
                      cudaMemcpyDeviceToHost, streams[stream_idx]);
    }

    // Wait for all streams to complete
    for (int i = 0; i < num_streams; i++) {
        cudaStreamSynchronize(streams[i]);
    }

    // Verify results
    bool success = true;
    for (int i = 0; i < total_matrices * N * N; i++) {
        if (h_C_all[i] != 3.0f) {
            success = false;
            break;
        }
    }
    printf("Result: %s\n", success ? "PASS" : "FAIL");

    // Cleanup
    for (int i = 0; i < num_streams; i++) {
        cudaFree(d_A[i]); cudaFree(d_B[i]); cudaFree(d_C[i]);
        cudaStreamDestroy(streams[i]);
    }
    cudaFreeHost(h_A_all); cudaFreeHost(h_B_all); cudaFreeHost(h_C_all);

    return 0;
}

Code breakdown

Now let’s examine each section in detail to understand how streams enable concurrent processing.

Section 1: Allocate pinned memory

// Tutorial 4 used: malloc(total_matrices * N * N * sizeof(float))
// Tutorial 5 uses: cudaMallocHost for pinned memory

float *h_A_all, *h_B_all, *h_C_all;
cudaMallocHost(&h_A_all, total_matrices * N * N * sizeof(float));
cudaMallocHost(&h_B_all, total_matrices * N * N * sizeof(float));
cudaMallocHost(&h_C_all, total_matrices * N * N * sizeof(float));

Here we use pinned memery. This is required for asynchronous.

What is pinned memory?

Important: Pinned memory is HOST (CPU) memory, not GPU memory!

Pinned memory (also called page-locked memory) is a special type of CPU RAM that the operating system guarantees will stay at a fixed physical location and never be swapped to disk.

Memory location clarification:

Computer Architecture:
┌─────────────────────────────────────────────────────────┐
│ CPU                              GPU                    │
│                                                         │
│ RAM (Host Memory)                VRAM (Device Memory)   │
│ ┌─────────────────────┐          ┌──────────────────┐   │
│ │ Regular malloc()    │          │ cudaMalloc()     │   │
│ │ (pageable)          │          │ (always pinned)  │   │
│ ├─────────────────────┤          └──────────────────┘   │
│ │ cudaMallocHost()    │                ↑                │
│ │ (pinned/page-locked)│────DMA────────┘                 │
│ └─────────────────────┘       Fast async transfer       │
│         ↑                                               │
│    Lives in CPU RAM, not GPU VRAM!                      │
└─────────────────────────────────────────────────────────┘

Regular host memory (pageable, malloc):

float* data = (float*)malloc(1000 * sizeof(float));

Where it lives: CPU RAM (pageable memory)

What happens:
- OS gives you virtual memory addresses in CPU RAM
- Physical location in CPU RAM can change over time
- OS can move pages to disk (swap) if RAM is low
- When you access the data, OS might need to swap it back from disk

This is good: Lets you use more memory than physically available
This is bad: GPU cannot directly access memory that might move or swap

Pinned host memory (page-locked, cudaMallocHost):

float* data;
cudaMallocHost(&data, 1000 * sizeof(float));

Where it lives: CPU RAM (pinned/page-locked memory)

What happens:
- Memory is allocated at a fixed physical CPU RAM address
- OS promises this CPU RAM will never move to a different address
- OS promises this CPU RAM will never swap to disk
- GPU can directly read from this CPU RAM via DMA (Direct Memory Access)

This is good: GPU can transfer from CPU RAM without OS intervention (fast, async)
This is bad: Reduces CPU RAM available for other programs (limited resource)
What problem does non-pinned memory solve?

OS uses virtual memory to let programs use more memory than physically available in RAM. The OS divides memory into fixed-size blocks called pages (typically 4 KB each). When RAM gets full, the OS moves less-used pages from RAM to disk storage. This process is called swapping or paging (paging = moving pages between RAM and disk). This lets your computer run many programs simultaneously even with limited RAM.

When you access swapped-out memory, the OS transparently swaps it back from disk to RAM. This is invisible to your program but creates unpredictable delays. For pageable memory (memory divided into pages that can move), allocated with malloc, the OS has full control and can swap pages to disk at any time based on memory pressure.

For GPU transfers, this creates problems because the GPU needs stable physical RAM addresses. If memory is on disk, the GPU cannot access it until the OS swaps it back. Pinned memory prevents swapping by telling the OS “keep these pages in RAM permanently” - the OS locks those pages in physical RAM and never moves them to disk, giving the GPU reliable direct access via DMA.

Section 2: Create multiple device buffers

// Tutorial 4: One pointer (float *d_A) pointing to one GPU memory allocation
// Tutorial 5: Array of pointers, each pointing to separate GPU memory

// Declare arrays of pointers (with num_streams = 4, this creates 4-element arrays)
// d_A is a 4-element array: d_A[0], d_A[1], d_A[2], d_A[3]
// Each element is a pointer that will point to separate GPU memory
float *d_A[num_streams], *d_B[num_streams], *d_C[num_streams];

// Array to hold 4 stream objects (execution contexts)
cudaStream_t streams[num_streams];

// Create memory allocations and streams
for (int i = 0; i < num_streams; i++) {
    // Allocate GPU memory for stream i's input A
    // &d_A[i] returns CPU RAM address where pointer d_A[i] is stored
    // cudaMalloc writes GPU VRAM address into d_A[i]
    cudaMalloc(&d_A[i], chunk_bytes);

    // Allocate GPU memory for stream i's input B
    cudaMalloc(&d_B[i], chunk_bytes);

    // Allocate GPU memory for stream i's output C
    cudaMalloc(&d_C[i], chunk_bytes);

    // Create execution stream i
    // Each stream is an independent queue of GPU operations
    cudaStreamCreate(&streams[i]);
}

Why multiple buffers? If all streams shared one set of buffers, they would overwrite each other’s data. Each stream needs its own workspace in GPU memory to process chunks independently. This multiplies VRAM usage by the number of streams, but enables true concurrency.

What is a buffer and why is it called that?

A buffer is just memory. We call it a “buffer” because it temporarily holds data flowing between systems that operate at different speeds (CPU ↔ GPU).

Why buffers are critical for streaming:

Without multiple buffers:
- Stream 0 copies data into d_A
- Stream 1 tries to copy into d_A → OVERWRITES stream 0's data!
- Result: Data corruption, wrong results

With multiple buffers (one per stream):
- Stream 0 uses d_A[0], d_B[0], d_C[0]
- Stream 1 uses d_A[1], d_B[1], d_C[1]
- Stream 2 uses d_A[2], d_B[2], d_C[2]
- Stream 3 uses d_A[3], d_B[3], d_C[3]
- Result: Each stream has its own workspace, no conflicts!

This is why Tutorial 5 needs num_streams sets of buffers. Each stream operates independently on its own data, enabling true concurrent processing. Without separate buffers, streams cannot overlap their operations.

Section 3: Process chunks asynchronously

Recall what we’re working with:

  • Total dataset: 100,000 matrices (each 256×256 elements)

  • Chunk size: 1,000 matrices per chunk

  • Total chunks: 100 chunks to process (100,000 ÷ 1,000)

  • Number of streams: 4 streams for concurrent processing

Terminology reminder:

  • Chunk: A portion of data (1,000 matrices). What we’re processing.

  • Stream: An execution context on the GPU. How we process it concurrently.

The loop processes all 100 chunks by assigning each chunk to one of the 4 streams in round-robin fashion. Multiple streams process different chunks simultaneously.

// Process all 100 chunks, assigning each to a stream in round-robin fashion
for (int chunk = 0; chunk < num_chunks; chunk++) {
    // Determine which stream to use (cycles through 0, 1, 2, 3, 0, 1, ...)
    // Chunk 0→Stream 0, Chunk 1→Stream 1, Chunk 4→Stream 0 (reuse), etc.
    int stream_idx = chunk % num_streams;

    // Calculate starting element offset (not byte offset!) for this chunk
    // "offset" tells us where this chunk's data starts in the h_A_all array
    // Chunk 0: offset = 0 × 1000 × 256 × 256 = 0 (start at element 0)
    // Chunk 1: offset = 1 × 1000 × 256 × 256 = 65,536,000 (skip first 1000 matrices)
    // Chunk 2: offset = 2 × 1000 × 256 × 256 = 131,072,000 (skip first 2000 matrices)
    // This is an element index, not bytes. We use it with pointer arithmetic.
    int offset = chunk * chunk_size * N * N;

    // Copy input matrix A for this chunk from CPU RAM to GPU VRAM
    // Source: h_A_all + offset (pinned CPU RAM, starting at chunk's data)
    // Destination: d_A[stream_idx] (GPU VRAM buffer for this stream)
    // Returns immediately (async), CPU continues to next line
    cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes,
                  cudaMemcpyHostToDevice, streams[stream_idx]);

    // Copy input matrix B for this chunk from CPU RAM to GPU VRAM
    // Uses same stream, so this waits for previous operation in this stream
    cudaMemcpyAsync(d_B[stream_idx], h_B_all + offset, chunk_bytes,
                  cudaMemcpyHostToDevice, streams[stream_idx]);

    // Launch kernel to process this chunk in GPU VRAM
    // Reads from: d_A[stream_idx], d_B[stream_idx] (inputs in VRAM)
    // Writes to: d_C[stream_idx] (output in VRAM)
    // Fourth argument (0) is shared memory size (not used here)
    // Last argument specifies which stream to use
    batchMatAdd<<<numBlocks, threadsPerBlock, 0, streams[stream_idx]>>>(
        d_A[stream_idx], d_B[stream_idx], d_C[stream_idx], N, chunk_size);

    // Copy result back from GPU VRAM to CPU RAM
    // Source: d_C[stream_idx] (GPU VRAM with computed results)
    // Destination: h_C_all + offset (pinned CPU RAM, write at chunk's location)
    // Returns immediately (async), loop continues to process next chunk
    cudaMemcpyAsync(h_C_all + offset, d_C[stream_idx], chunk_bytes,
                  cudaMemcpyDeviceToHost, streams[stream_idx]);
}

Offset and pointer arithmetic visualization:

h_A_all array (100,000 matrices in CPU RAM):
┌──────────────────┬──────────────────┬──────────────────┬─────┬──────────────────┐
│  Chunk 0         │  Chunk 1         │  Chunk 2         │ ... │  Chunk 99        │
│  1000 matrices   │  1000 matrices   │  1000 matrices   │     │  1000 matrices   │
│  65,536,000 elem │  65,536,000 elem │  65,536,000 elem │     │  65,536,000 elem │
└──────────────────┴──────────────────┴──────────────────┴─────┴──────────────────┘
 ↑                  ↑                  ↑                         ↑
 offset=0           offset=65,536,000  offset=131,072,000        offset=6,488,064,000
 h_A_all+0          h_A_all+65536000   h_A_all+131072000         h_A_all+6488064000

When chunk=1:
offset = 1 × 1000 × 256 × 256 = 65,536,000

cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes, ...)
                                 ↑
                                 Pointer arithmetic!
                                 h_A_all is float*, so +offset means:
                                 "skip 65,536,000 float elements"
                                 = skip 65,536,000 × 4 bytes
                                 = skip 262,144,000 bytes
                                 = start at chunk 1's data

Key point: When you add to a pointer in C/C++, the compiler automatically
multiplies by sizeof(type). So h_A_all + offset means "start offset floats
into the array", not "start offset bytes into memory".

Here, cudaMemcpyAsync returns immediately, letting the loop continue. Operations within a stream execute in order (copy → compute → copy), but different streams execute concurrently. While stream 0 computes, stream 1 copies data in, and stream 2 copies results out.

What actually happens: CPU queues work, GPU executes it:

CPU behavior (very fast, microseconds per iteration):
- Loop runs through all 100 chunks rapidly
- Each iteration just QUEUES operations (doesn't wait for them)
- `cudaMemcpyAsync` returns immediately (just adds to stream's queue)
- Kernel launch returns immediately (just adds to stream's queue)
- CPU finishes the entire loop in milliseconds
- Then CPU continues to other work (or waits at cudaStreamSynchronize)

GPU behavior (slow, processes when ready):
- Receives queued operations in each stream
- Executes operations in each stream sequentially (H→D, compute, D→H)
- BUT executes different streams concurrently
- Takes seconds to process all 100 chunks

Timeline example (to scale):

Time →    0ms          1ms                                    2000ms (2 seconds)
          ▼            ▼                                      ▼
CPU:      |████| Loop finishes instantly! → [CPU free to do other work]
               ↑
               All 100 chunks queued

GPU:      |─────────────────────────────────────────────────────────────────|
Stream 0: |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|─D→H─|... (chunks 0,4,8,12...)
Stream 1:     |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|─D→H─|... (chunks 1,5,9,13...)
Stream 2:         |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|... (chunks 2,6,10,14...)
Stream 3:             |─H→D─|─Compute─|─D→H─|─H→D─|─Compute| (chunks 3,7,11,15...)
                      ↑ GPU crunching numbers for ~2 seconds total

What happens in practice:
1. for loop runs (100 iterations): ~1 millisecond
   - Each iteration just queues 3 async operations per stream
   - No waiting, just scheduling work
   - Loop exits immediately after queuing all 100 chunks

2. GPU processes queued work: ~2000 milliseconds (2 seconds)
   - Works through each stream's queue sequentially
   - Processes 4 streams concurrently
   - Stream 0 processes chunks 0→4→8→12→...→96 (25 chunks total)
   - Stream 1 processes chunks 1→5→9→13→...→97 (25 chunks total)
   - Stream 2 processes chunks 2→6→10→14→...→98 (25 chunks total)
   - Stream 3 processes chunks 3→7→11→15→...→99 (25 chunks total)
How to choose the number of streams?

The optimal number of streams depends on the balance between copy time and compute time for your workload.

Theoretical guideline (simplified rule of thumb):

Goal: Keep GPU compute fully utilized (never idle waiting for data)

To achieve full overlap, while GPU computes one chunk, you need enough
streams to be copying data for upcoming chunks.

Derivation:r
- During compute time T, GPU is busy computing
- During that same time T, we need to finish copying for the next chunk
- Copy time for one chunk = C (includes both H→D and D→H)
- Number of chunks we can copy during T = T / C
- We need at least T / C streams to keep pipeline full

Formula: num_streams ≥ C / T (this keeps compute continuously fed)

However, the formula num_streams ≥ 2 × (C / T) + 1 adds safety margin for:
- Separate H→D and D→H copy engines (factor of 2)
- One extra stream for buffer (the +1)

Example:
- Copy H→D: 10 ms
- Compute:  50 ms
- Copy D→H: 10 ms
- Total copy: C = 20 ms, Compute: T = 50 ms

Basic formula: 20 / 50 = 0.4 → Need at least 1 stream
Conservative formula: 2 × (20 / 50) + 1 = 1.8 → Round up to 2 streams

Reality check: This is a rough guideline, not a law!
- Assumes perfect overlap (rarely happens in practice)
- Ignores PCIe bandwidth limits, copy engine contention
- Best approach: Start with 4 streams, then measure and tune

Practical considerations:

1. VRAM limitation (most important):
   - Each stream needs its own buffers (d_A, d_B, d_C)
   - More streams = more VRAM usage
   - Example: 4 streams × 3 buffers × 1000 matrices × 256² × 4 bytes = 3 GB
   - If VRAM is limited, reduce num_streams or chunk_size

2. Compute-bound vs memory-bound:
   - Compute-bound (kernel time >> copy time): Use 2-4 streams
   - Memory-bound (copy time >> kernel time): Use 4-8 streams
   - Balanced workload: Use 3-4 streams

3. Diminishing returns:
   - Beyond 4-8 streams, speedup plateaus
   - More streams = more synchronization overhead
   - GPU hardware has limited concurrent copy engines (typically 2)

4. Hardware constraints:
   - PCIe lanes limit concurrent transfers
   - GPU copy engines (typically 2: one H→D, one D→H)
   - SM occupancy and scheduling capacity

Typical recommendations:

- Start with 4 streams (good balance for most workloads)
- If compute-heavy: Try 2-3 streams
- If memory-bound: Try 6-8 streams
- Always measure performance with your specific workload
- Watch VRAM usage: nvidia-smi to monitor memory consumption

How to test:

Run your code with different num_streams values (2, 4, 8, 16) and measure:
- Total execution time (lower is better)
- VRAM usage (must fit in available memory)
- GPU utilization (nvidia-smi dmon -s u)

Choose the smallest num_streams that gives near-maximum speedup.
What speedup can we expect with streaming?

Ideal speedup calculation:

Assume for one chunk:
- Copy H→D: 10 ms
- Compute:  50 ms
- Copy D→H: 10 ms
- Total:    70 ms per chunk

Without streams (100 chunks):
Total time = 100 × 70 ms = 7,000 ms = 7 seconds

With streams (4 concurrent):
Pipeline fills up after first 4 chunks, then processes 4 at a time.

After pipeline fills:
- 4 chunks complete every ~50 ms (limited by slowest operation)
- 100 chunks ÷ 4 streams × 50 ms ≈ 1,250 ms
- Plus initial fill time: ~200 ms
- Total: ~1,450 ms = 1.45 seconds

Speedup: 7.0 / 1.45 ≈ 4.8× faster!

Real-world considerations:

- PCIe bandwidth limits (16 GB/s typical)
- GPU compute capability and occupancy
- Memory allocation overhead
- Synchronization costs

Typical real speedup: 2-3× (still very significant!)

Section 4: Synchronize and verify

// Wait for all streams to finish before reading results
for (int i = 0; i < num_streams; i++) {
    cudaStreamSynchronize(streams[i]);
}

// Now it's safe to verify results in h_C_all
bool success = true;
for (int i = 0; i < total_matrices * N * N; i++) {
    if (h_C_all[i] != 3.0f) {
        success = false;
        break;
    }
}

Why synchronize? Because all operations are asynchronous, the loop completes before the GPU finishes processing. Without synchronization, you would read h_C_all while async copies are still writing to it, causing race conditions. cudaStreamSynchronize blocks the CPU until that stream completes all queued operations.

Tutorial 6: Matrix multiplication with shared memory (1024×1024)

In previous tutorials, we did matrix addition where each thread computes C[i][j] = A[i][j] + B[i][j]. This is simple: read two values, add them, write one value. Matrix multiplication is very different and much more complex.

Key difference: Each thread does MUCH more work!

  • Matrix addition: Thread reads 2 values, does 1 addition, writes 1 value

  • Matrix multiplication (N=1024): Thread reads 2048 values, does 1024 multiplications and 1024 additions, writes 1 value

For matrix multiplication, computing just ONE output element C[i][j] requires: - Reading an entire row from A (1024 values) - Reading an entire column from B (1024 values) - Computing 1024 multiplications and summing them up

This creates a performance bottleneck that we’ll solve step by step using shared memory.

Problem size: We’ll work with 1024×1024 matrices throughout this tutorial. This large size makes the performance issues obvious and demonstrates why shared memory optimization is essential.

Understanding matrix multiplication

Let’s start with what matrix multiplication actually does:

Matrix multiplication formula:

C[i][j] = Σ(A[i][k] × B[k][j]) for k = 0 to N-1

Example for N=4:

Matrix A (4×4):        Matrix B (4×4):        Matrix C (4×4):
┌           ┐          ┌           ┐          ┌           ┐
│ 1  2  3  4│          │ 5  6  7  8│          │ 70  ...   │
│ 5  6  7  8│    ×     │ 9 10 11 12│    =     │ ...  ...  │
│ 9 10 11 12│          │13 14 15 16│          │ ...  ...  │
│13 14 15 16│          │17 18 19 20│          │ ...  ...  │
└           ┘          └           ┘          └           ┘

Computing C[0][0]:
C[0][0] = A[0][0]×B[0][0] + A[0][1]×B[1][0] + A[0][2]×B[2][0] + A[0][3]×B[3][0]
        = 1×5 + 2×9 + 3×13 + 4×17
        = 5 + 18 + 39 + 68
        = 130

Key observation: Computing one element requires reading an entire row of A
and an entire column of B (N elements each).

Step 1: Simple implementation

Let’s start with the most straightforward way to implement matrix multiplication on GPU. Each thread computes one output element.

// Naive matrix multiplication kernel (slow - for educational purposes)
__global__ void matMulNaive(float* A, float* B, float* C, int N)
{
    // Calculate which element this thread computes
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;

        // Each thread reads entire row of A and column of B from global memory
        for (int k = 0; k < N; k++) {
            // Every access goes to slow global memory (400-800 cycle latency!)
            sum += A[row * N + k] * B[k * N + col];
        }

        C[row * N + col] = sum;
    }
}

int main()
{
    int N = 1024;  // 1024×1024 matrices
    size_t size = N * N * sizeof(float);

    // Allocate and initialize host matrices
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize matrices (for example, with random values)
    for (int i = 0; i < N * N; i++) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // Allocate device matrices
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy data to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel with 16×16 thread blocks
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks((N + 15) / 16, (N + 15) / 16);
    // Why (N + 15) / 16 instead of N / 16?
    // This is ceiling division: ensures we have enough blocks for any N
    // Examples:
    //   N = 1024: (1024 + 15) / 16 = 1039 / 16 = 64 blocks ✓
    //   N = 1020: (1020 + 15) / 16 = 1035 / 16 = 64 blocks (need 64, not 63!)
    //   N = 1000: (1000 + 15) / 16 = 1015 / 16 = 63 blocks (need 63, not 62!)
    // Formula: (N + TILE_SIZE - 1) / TILE_SIZE always rounds UP

    matMulNaive<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result back
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
}

What each thread does (visual explanation):

Example: Thread computing C[5][3] (row 5, column 3)

Matrix A (1024×1024):              Matrix B (1024×1024):
Row 0                              Col 0  Col 3 (target)
Row 1                              │      │
Row 2                              │      │
Row 3                              │      │
Row 4                              │      │
Row 5 (target) ►►►►►►►►►►►►►►►     │      ▼
┌─────────────────────────────┐    ┌──────▓▓▓────┐
│                             │    │      ▓▓▓    │
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│    │      ▓▓▓    │
│                             │    │      ▓▓▓    │
│                             │    │      ▓▓▓    │
│                             │    │      ▓▓▓    │
│                             │    │      ▓▓▓    │
└─────────────────────────────┘    └──────▓▓▓────┘
 1024 elements (shaded)              1024 elements (shaded)
 from A[5][0] to A[5][1023]          from B[0][3] to B[1023][3]

Computation:
C[5][3] = A[5][0]×B[0][3] + A[5][1]×B[1][3] + ... + A[5][1023]×B[1023][3]
        = (1024 multiplications) + (1024 additions)
        = Reads 2048 values from global memory

Grid configuration:
- 64×64 = 4,096 blocks (covers entire 1024×1024 output)
- 16×16 = 256 threads per block
- Total: 1,048,576 threads, each computing one output element

This works correctly and is similar to how we’ve written previous kernels. But there’s a performance problem we need to understand.

Step 2: Understanding the performance problem

Let’s analyze what happens when this kernel runs. Focus on one block computing a 16×16 chunk of the output matrix C.

Memory access pattern visualization:

Block computing C[0:16][0:16] (256 threads) for 1024×1024 matrices

Thread (0,0) computing C[0][0]:
- Reads A[0][0], A[0][1], A[0][2], ..., A[0][1023]     (1024 reads)
- Reads B[0][0], B[1][0], B[2][0], ..., B[1023][0]     (1024 reads)
- Total: 2048 reads from global memory

Thread (0,1) computing C[0][1]:
- Reads A[0][0], A[0][1], A[0][2], ..., A[0][1023]     (same 1024 reads!)
- Reads B[0][1], B[1][1], B[2][1], ..., B[1023][1]     (1024 reads)
- Total: 2048 reads from global memory

Notice: Thread (0,0) and Thread (0,1) BOTH read the same A[0][:] row!

All 16 threads in row 0 read the same A[0][:] data → 16× redundant reads
All 16 threads in col 0 read the same B[:][0] data → 16× redundant reads

The redundancy problem:

For a 16×16 block (256 threads), N=1024 matrices:

Each thread: 2048 global memory reads
Total reads by all threads: 256 × 2048 = 524,288 reads

But how much unique data do we actually need?
- Block needs 16 rows from A: 16 × 1024 = 16,384 elements
- Block needs 16 cols from B: 1024 × 16 = 16,384 elements
- Total unique data: 32,768 elements

Redundancy factor: 524,288 / 32,768 = 16×

We're reading the same data 16 times! This is wasteful.

Why is this a problem?:

Global memory (VRAM) is slow:
- Latency: ~400-800 cycles per access
- For 2048 reads per thread: 2048 × 400 = 819,200 cycles just waiting for memory!

Even with 1000 threads computing in parallel, memory is the bottleneck.
What do you mean by “cycle”?

A cycle (or clock cycle) is the basic unit of time in a processor. It’s one tick of the GPU’s internal clock.

Simple analogy:

Think of a cycle like a heartbeat:
- Your heart beats at ~60 beats per minute
- GPU's clock "beats" at 1-2 GHz (1-2 billion beats per second!)
- Each beat (cycle) is when the GPU can do one basic operation

GPU clock speed:

Modern GPU: 1.5 GHz clock speed
- 1.5 GHz = 1,500,000,000 cycles per second
- Each cycle = 1/1,500,000,000 seconds ≈ 0.67 nanoseconds
- Extremely fast, but memory is still much slower!

Why cycles matter for memory:

When we say "400 cycles latency":
- GPU requests data from global memory
- Must wait 400 clock cycles before data arrives
- During those 400 cycles, GPU could do 400 other operations!
- This is why memory latency is expensive

Shared memory at 20 cycles:
- Only wait 20 clock cycles for data
- 20× faster than 400 cycles
- GPU wastes less time waiting, does more useful work

Real-world timing:

At 1.5 GHz GPU clock:
- 400 cycles = 400 × 0.67 ns ≈ 267 nanoseconds (global memory)
- 20 cycles = 20 × 0.67 ns ≈ 13 nanoseconds (shared memory)

Even though nanoseconds sound fast, in GPU time this is huge!
A GPU can execute billions of operations per second, so every
nanosecond counts.

Step 3: Introducing shared memory

To solve the redundancy problem, we need a faster place to store data that multiple threads can reuse. This is called shared memory.

What is shared memory?:

GPU memory hierarchy (fastest to slowest):

1. Registers (per thread)
   - Fastest: ~1 cycle latency
   - Tiny: ~255 registers per thread
   - Private to each thread

2. Shared memory (per block)
   - Fast: ~20-40 cycle latency (10-20× faster than global memory!)
   - Small: ~48-96 KB per SM
   - Shared by all threads in a block (NOT divided by warp!)
   - All threads in the block can access any part of it
   - Explicitly managed by programmer

3. Global memory (all threads)
   - Slow: ~400-800 cycle latency
   - Large: GBs of VRAM
   - Accessible by all threads
   - Where cudaMalloc allocates

Key insight:

If we load data from global memory into shared memory once,
then all 256 threads in the block can reuse it from shared memory!

Instead of:
- 256 threads each reading A[0][k] from global memory (256× redundant)

We can:
- One thread loads A[0][k] into shared memory once
- 256 threads read from shared memory (16× faster per access!)
Is shared memory divided per warp?

No! Shared memory is shared by the entire block, not divided among warps.

How shared memory works:

Block with 256 threads (8 warps × 32 threads):
- Total shared memory allocated: 2 KB (two 16×16 tiles)
- Each warp does NOT get 2KB ÷ 8 = 256 bytes
- ALL 256 threads can access the FULL 2 KB

Think of it like a shared whiteboard:
- NOT: Each warp gets their own section of the whiteboard
- YES: All threads can read/write anywhere on the whiteboard

Memory access pattern:

__shared__ float tileA[16][16];  // 1 KB total

Block has 256 threads (warps 0-7):
- Thread in warp 0 can read tileA[15][15]
- Thread in warp 7 can read tileA[0][0]
- Thread in warp 3 can read tileA[8][12]
- Any thread can access any element!

Shared memory is a single pool accessible by all threads in the block.

Why this matters:

If shared memory WERE divided per warp:
- Warp 0 could only see 1/8 of the data
- Warp 7 could only see their 1/8 of the data
- Defeats the purpose! (We want ALL threads to reuse the SAME data)

Because it's shared across the entire block:
- All 256 threads can reuse the same loaded tiles
- Perfect for matrix multiplication (all threads need same rows/columns)
- This is what makes shared memory so powerful!

Important note:

While shared memory is accessible by all threads in a block, execution
happens in warps (32 threads at a time). So when multiple warps access
shared memory, the GPU schedules them appropriately to avoid conflicts.

But there’s a catch: shared memory is small (48-96 KB). We can’t fit entire 1024×1024 matrices. The solution is tiling.

What are typical shared memory sizes?

Shared memory size varies by GPU architecture and is much smaller than global memory (VRAM).

Shared memory per Streaming Multiprocessor (SM):

GPU Architecture         Shared Memory per SM
─────────────────────────────────────────────
Kepler (GTX 700)         48 KB
Maxwell (GTX 900)        48 KB or 96 KB (configurable)
Pascal (GTX 1000)        64 KB or 96 KB (configurable)
Volta (V100)             96 KB
Turing (RTX 2000)        64 KB
Ampere (RTX 3000, A100)  100 KB (A100), 100 KB (RTX 3090)
Ada Lovelace (RTX 4000)  100 KB
Hopper (H100)            228 KB

Why so small?:

Global memory (VRAM):     8-80 GB (billions of bytes)
Shared memory per SM:     48-228 KB (thousands of bytes)

Ratio: ~1,000,000:1 difference!

Shared memory is:
- On-chip (physically on the GPU die, not external VRAM chips)
- Extremely fast (20-40 cycle latency vs 400-800 for global memory)
- Limited by chip area (silicon real estate is expensive)
- Shared among ALL blocks running on that SM

Practical implications:

If your GPU has 96 KB shared memory per SM:
- One 1024×1024 float matrix = 1024 × 1024 × 4 bytes = 4 MB
- Can't fit even a tiny fraction in shared memory!
- Must process in smaller tiles (e.g., 16×16 = 1 KB per tile)

If you try to allocate too much:
- Kernel launch fails with "out of resources" error
- Or fewer blocks can run per SM (lower occupancy = slower)

This is why tiling is essential for large matrices!

Step 4: Tiling strategy

Since shared memory is too small for full matrices, we break the computation into smaller pieces called tiles. Let’s build this understanding step by step.

4.1 The problem: Shared memory is too small

To compute one 16×16 output tile C[0:16][0:16] from 1024×1024 matrices::

we need data from both A and B:

Visual showing what data is needed:

        Matrix A (1024×1024)                    Matrix B (1024×1024)
        ┌──────────────────────────────┐        ┌──────────────────────────────┐
        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
Rows    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
0-15    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
(16     │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
rows)   │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        │                              │        │      ...     │               │
        │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        └──────────────────────────────┘        └──────────────────────────────┘
            ALL 1024 columns                       Cols 0-15 (16 cols)
                ↓                                  ALL 1024 rows
                                                        ↓

                            Output: C[0:16][0:16]
                            ┌──────────────┐
                            │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│  16×16 tile
                            │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│  (256 elements)
                            │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
                            │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
                            │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
                            └──────────────┘

The math: Each element C[i][j] = Σ(A[i][k] × B[k][j]) for k = 0 to 1023

Key insight: We need ALL 1024 values from each strip to compute the final result! - For C[0][0]: Need A[0][0] through A[0][1023] AND B[0][0] through B[1023][0] - For C[0][1]: Need A[0][0] through A[0][1023] AND B[0][1] through B[1023][1] - Cannot skip any values - all 1024 terms are needed for the sum

Data requirements: - Need 16 rows of A: 16 × 1024 elements = 16,384 floats = 65 KB - Need 16 cols of B: 1024 × 16 elements = 16,384 floats = 65 KB - Total: 130 KB needed to keep in shared memory simultaneously

But shared memory available: Only 48-96 KB per SM!

130 KB > 96 KB → Doesn’t fit! We need a different approach.

4.2 The solution: Break into smaller tiles

Instead of loading all 1024 elements at once, we break them into manageable chunks:

The key idea: C[i][j] = Σ(A[i][k] × B[k][j]) for k = 0 to 1023

Break the 1024-wide strips into 64 sequential chunks of width 16:

Matrix A (one row, 1024 elements total):
┌──────┬──────┬──────┬─────────┬──────┐
│Tile 0│Tile 1│Tile 2│   ...   │Tile63│
│  0-15│16-31 │32-47 │         │1008+ │
└──────┴──────┴──────┴─────────┴──────┘
 1 KB   1 KB   1 KB             1 KB

Matrix B (one column, 1024 elements total):
┌──────┐
│Tile 0│  ← elements 0-15
│  0-15│
├──────┤
│Tile 1│  ← elements 16-31
│16-31 │
├──────┤
│Tile 2│  ← elements 32-47
│32-47 │
├──────┤
│  ... │
├──────┤
│Tile63│  ← elements 1008-1023
│1008+ │
└──────┘
 Each tile = 1 KB

Sequential processing (one iteration at a time):

Iteration 0:  Load A[i][0:15]    and B[0:15][j]    → Compute partial sum (2 KB)
Iteration 1:  Load A[i][16:31]   and B[16:31][j]   → Add to sum (2 KB)
Iteration 2:  Load A[i][32:47]   and B[32:47][j]   → Add to sum (2 KB)
...
Iteration 63: Load A[i][1008:1023] and B[1008:1023][j] → Final sum (2 KB)

4.3 How tile sharing works between threads

This section shows how all 256 threads in a block cooperate and share data through iterations.

In iteration 0, the ENTIRE BLOCK loads TWO 16×16 tiles into shared memory:

Shared memory (accessible by all 256 threads in the block):
┌─────────────────────┐  ┌─────────────────────┐
│     tileA (16×16)   │  │     tileB (16×16)   │
│   A[0:15][0:15]     │  │   B[0:15][0:15]     │
│                     │  │                     │
│   ONE copy shared   │  │   ONE copy shared   │
│   by 256 threads    │  │   by 256 threads    │
└─────────────────────┘  └─────────────────────┘
     1 KB                      1 KB

Now all 256 threads work IN PARALLEL using these shared tiles:

Thread (0,0): Uses row 0 of tileA, col 0 of tileB
              → Computes partial sum for C[0][0]
              → (2.1×1.5) + (3.4×2.3) + ... = 87.4

Thread (0,1): Uses row 0 of tileA, col 1 of tileB
              → Computes partial sum for C[0][1]
              → (2.1×4.2) + (3.4×1.8) + ... = 73.2

Thread (5,7): Uses row 5 of tileA, col 7 of tileB
              → Computes partial sum for C[5][7]
              → (6.3×2.7) + (4.1×5.9) + ... = 95.8

Thread (15,15): Uses row 15 of tileA, col 15 of tileB
                → Computes partial sum for C[15][15]
                → (3.8×7.2) + (9.1×2.4) + ... = 81.3

The magic: All 256 threads read from the SAME tiles in shared memory!
- Tiles loaded ONCE from global memory (slow)
- Accessed MANY TIMES by different threads (fast - shared memory)
- No redundant loading!

Iteration 1: Block loads NEW tiles into shared memory (overwrites old data)

┌─────────────────────┐  ┌─────────────────────┐
│     tileA (16×16)   │  │     tileB (16×16)   │
│  A[0:15][16:31]     │  │  B[16:31][0:15]     │
│                     │  │                     │
│   NEW data from     │  │   NEW data from     │
│   columns 16-31     │  │   rows 16-31        │
└─────────────────────┘  └─────────────────────┘

Same threads, NEW tiles, ACCUMULATE to existing sums:

Thread (0,0): Uses row 0 of NEW tileA, col 0 of NEW tileB
              → Adds to sum: 87.4 + [(5.7×3.2) + (2.8×6.1) + ...] = 87.4 + 64.8
              → Current sum for C[0][0] = 152.2

Thread (0,1): Uses row 0 of NEW tileA, col 1 of NEW tileB
              → Adds to sum: 73.2 + [(5.7×1.9) + (2.8×4.5) + ...] = 73.2 + 58.6
              → Current sum for C[0][1] = 131.8

Thread (5,7): Uses row 5 of NEW tileA, col 7 of NEW tileB
              → Adds to sum: 95.8 + [(3.4×8.1) + (7.2×3.3) + ...] = 95.8 + 71.4
              → Current sum for C[5][7] = 167.2

Thread (15,15): Uses row 15 of NEW tileA, col 15 of NEW tileB
                → Adds to sum: 81.3 + [(6.1×4.7) + (8.3×2.1) + ...] = 81.3 + 69.5
                → Current sum for C[15][15] = 150.8

Pattern continues for 62 more iterations (iteration 2, 3, ..., 63)
Each iteration: Load 2 new tiles → All threads compute → Accumulate sums


After iteration 63, all threads have computed their full 1024-term sum::

- Thread (0,0) has final C[0][0] = 2,847.
- Thread (0,1) has final C[0][1] = 3,012.5
- Thread (5,7) has final C[5][7] = 2,691.8
- Thread (15,15) has final C[15][15] = 2,934.1

Result: Each iteration uses only 2 KB, well within 48-96 KB shared memory limit!
After 64 iterations, all 256 output elements are complete.

Let’s follow ONE thread computing ONE output element through the entire process:

Thread (3,5) in Block (0,0) computes C[3][5].

The formula: C[3][5] = A[3][0]×B[0][5] + A[3][1]×B[1][5] + ... + A[3][1023]×B[1023][5]
                     = 1024 multiplication-and-add operations

Thread (3,5)'s private variable:
float sum = 0.0f;  // Lives in this thread's register, NOT shared!

─────────────────────────────────────────────────────────────────
ITERATION 0: Process elements k = 0 to 15
─────────────────────────────────────────────────────────────────

Step 1: ALL threads cooperatively load tiles (each thread loads 2 elements total)
        Thread (3,5) loads TWO elements:
        - For tileA: tileA[3][5] = A[3][5]
        - For tileB: tileB[3][5] = B[3][5]

        Thread (0,0) loads TWO elements:
        - For tileA: tileA[0][0] = A[0][0]
        - For tileB: tileB[0][0] = B[0][0]

        Thread (7,2) loads TWO elements:
        - For tileA: tileA[7][2] = A[7][2]
        - For tileB: tileB[7][2] = B[7][2]

        ... all 256 threads load, filling both tiles (512 elements total)

        Result: Shared memory now contains:
        tileA[0:15][0:15] = A[0:15][0:15]
        tileB[0:15][0:15] = B[0:15][0:15]

Visual: What Thread (3,5) accesses across iterations

Matrix A (1024×1024) - Thread (3,5) always uses row 3:
Row 0
Row 1
Row 2
Row 3 ►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►
┌─────┬─────┬─────┬───┬─────┐
│▓▓▓▓▓│     │     │...│     │  ▓ = Iter 0: A[3][0:15]
│ 0-15│16-31│32-47│   │1008+│    Loads into tileA[3][5] = A[3][5]
└─────┴─────┴─────┴───┴─────┘
│     │▓▓▓▓▓│     │...│     │  ▓ = Iter 1: A[3][16:31]
│     │16-31│     │   │     │    Loads into tileA[3][5] = A[3][21]
└─────┴─────┴─────┴───┴─────┘
│     │     │▓▓▓▓▓│...│     │  ▓ = Iter 2: A[3][32:47]
│     │     │32-47│   │     │    Loads into tileA[3][5] = A[3][37]
└─────┴─────┴─────┴───┴─────┘
              ... continues for 64 iterations
│     │     │     │...│▓▓▓▓▓│  ▓ = Iter 63: A[3][1008:1023]
│     │     │     │   │1008+│    Loads into tileA[3][5] = A[3][1013]
└─────┴─────┴─────┴───┴─────┘

Matrix B (1024×1024) - Thread (3,5) always uses column 5:
     Col 5 (target)
        │
        ▼
┌──────▓▓▓──────┐  ▓ = Iter 0: B[0:15][5]
│      ▓▓▓      │    Loads into tileB[3][5] = B[3][5]
│      0-15     │
├──────▓▓▓──────┤  ▓ = Iter 1: B[16:31][5]
│      ▓▓▓      │    Loads into tileB[3][5] = B[19][5]
│     16-31     │
├──────▓▓▓──────┤  ▓ = Iter 2: B[32:47][5]
│      ▓▓▓      │    Loads into tileB[3][5] = B[35][5]
│     32-47     │
├──────▓▓▓──────┤
│      ...      │    ... continues for 64 iterations
├──────▓▓▓──────┤  ▓ = Iter 63: B[1008:1023][5]
│      ▓▓▓      │    Loads into tileB[3][5] = B[1011][5]
│    1008+      │
└──────▓▓▓──────┘

Key insight: Thread (3,5) accesses row 3 and column 5 progressively:
- Same tile position [3][5] every iteration
- Different global matrix positions each iteration
- Marches through all 1024 elements across 64 chunks

Step 2: Synchronize (__syncthreads()) - wait for all tiles to be loaded

Step 3: Thread (3,5) computes using row 3 of tileA and column 5 of tileB:
        sum = 0.0f
        for (int k = 0; k < 16; k++)
            sum += tileA[3][k] * tileB[k][5];

        Expands to:
        sum = A[3][0]×B[0][5] + A[3][1]×B[1][5] + ... + A[3][15]×B[15][5]
        sum = 3.2×1.5 + 2.7×4.3 + ... + 5.1×2.9
        sum = 87.4

Thread (3,5)'s state: sum = 87.4 (stored in register, private to this thread)

─────────────────────────────────────────────────────────────────
ITERATION 1: Process elements k = 16 to 31
─────────────────────────────────────────────────────────────────

Now we need to compute the next 16 products (k = 16 to 31), so we need
the next chunk of data from global matrices A and B.

Step 1: ALL threads cooperatively load NEW tiles from global memory
        Thread (3,5) loads TWO elements (one for each tile):

        For tileA: tileA[3][5] = A[3][21]
        - Same tile position [3][5], but from column 21 of global matrix A
        - Formula: A[threadIdx.y][16 + threadIdx.x] = A[3][16 + 5] = A[3][21]

        For tileB: tileB[3][5] = B[19][5]
        - Same tile position [3][5], but from row 19 of global matrix B
        - Formula: B[16 + threadIdx.y][threadIdx.x] = B[16 + 3][5] = B[19][5]

        All 256 threads load their assigned elements...

        Result: Shared memory NOW contains:
        tileA[0:15][0:15] = A[0:15][16:31] (NEW data from global A, overwrites old!)
        tileB[0:15][0:15] = B[16:31][0:15] (NEW data from global B, overwrites old!)

Step 2: Synchronize again

Step 3: Thread (3,5) computes using NEW tiles, ACCUMULATES to existing sum:
        for (int k = 0; k < 16; k++)
            sum += tileA[3][k] * tileB[k][5];

        Expands to:
        sum += A[3][16]×B[16][5] + A[3][17]×B[17][5] + ... + A[3][31]×B[31][5]
        sum = 87.4 + (4.1×3.2 + 6.3×1.8 + ... + 2.5×4.7)
        sum = 87.4 + 64.2 = 151.6

Thread (3,5)'s state: sum = 151.6 (accumulating!)

Key insight: Thread (3,5) uses row 3 from tileA and column 5 from tileB.
             Same positions in the tiles, but tiles contain DIFFERENT data!

─────────────────────────────────────────────────────────────────
ITERATIONS 2-62: Same pattern continues...
─────────────────────────────────────────────────────────────────

Iteration 2: Load A[0:15][32:47] and B[32:47][0:15]
             Thread (3,5) adds 16 more products, sum = 219.3

Iteration 3: Load A[0:15][48:63] and B[48:63][0:15]
             Thread (3,5) adds 16 more products, sum = 288.7

... (continues for 60 more iterations)

Iteration 62: Load A[0:15][992:1007] and B[992:1007][0:15]
              Thread (3,5) adds 16 more products, sum = 2,783.5

─────────────────────────────────────────────────────────────────
ITERATION 63: Final iteration, k = 1008 to 1023
─────────────────────────────────────────────────────────────────

Step 1: Load final tiles: A[0:15][1008:1023] and B[1008:1023][0:15]
Step 2: Synchronize
Step 3: Thread (3,5) adds final 16 products:
        sum += A[3][1008]×B[1008][5] + ... + A[3][1023]×B[1023][5]
        sum = 2,783.5 + 68.9 = 2,852.4

Step 4: Write final result: C[3][5] = 2,852.4

─────────────────────────────────────────────────────────────────
SUMMARY: What happened?
─────────────────────────────────────────────────────────────────

Thread (3,5)'s perspective:
- Started with sum = 0
- In each of 64 iterations:
  * Helped load 1 element into shared memory
  * Waited for all threads to finish loading
  * Used its designated row/column from the shared tiles
  * Added 16 products to its private sum
- After 64 iterations: sum = 2,852.4 = final C[3][5]

Key points:
1. Each thread has its OWN private sum (register variable)
2. Shared tiles contain DIFFERENT data each iteration
3. Each thread uses the SAME positions (row/col) but DIFFERENT data
4. All threads work in lockstep through 64 iterations
5. Shared memory is reused 64 times, but each thread's sum keeps growing

4.4 Putting it all together

Workspace organization::
  • Grid: 64×64 = 4,096 blocks launched

  • Each block: 16×16 = 256 threads

  • Total threads: 1,048,576 threads (one per output element)

Each block’s job: - Computes one 16×16 output tile (256 elements of C) - Processes 128 input tiles sequentially (64 from A, 64 from B) - Uses 2 KB shared memory, reuses it 64 times

Each thread’s job: - Computes ONE output element - Participates in loading tiles (1 element per load) - Accumulates partial sums across 64 iterations

Key advantage: - 16× reduction in global memory reads (shared memory reuse) - 10-20× faster memory access (shared vs global memory) - Overall: 10-20× speedup!

Why use 16×16 tiles (2 KB) when we have 48-96 KB available?

With 48-96 KB available, couldn’t we use 32×32 or 64×64 tiles? Yes, but there are trade-offs!

Tile size options:

Available shared memory: 48 KB (conservative estimate)

16×16 tiles: 2 × 16 × 16 × 4 = 2 KB per block
- Can fit: 48 KB / 2 KB = 24 blocks per SM
- Threads per block: 16 × 16 = 256 threads
- Total threads: 24 blocks × 256 = 6,144 threads per SM

32×32 tiles: 2 × 32 × 32 × 4 = 8 KB per block
- Can fit: 48 KB / 8 KB = 6 blocks per SM
- Threads per block: 32 × 32 = 1,024 threads (maximum per block!)
- Total threads: 6 blocks × 1,024 = 6,144 threads per SM

64×64 tiles: 2 × 64 × 64 × 4 = 32 KB per block
- Can fit: 48 KB / 32 KB = 1 block per SM
- Threads per block: 64 × 64 = 4,096 threads (exceeds 1,024 limit!)
- Result: Cannot launch! Maximum threads per block is 1,024

The trade-offs:

Smaller tiles (16×16):
+ More blocks per SM (better occupancy)
+ Better latency hiding (more warps to switch between)
+ More flexible (works on older GPUs)
- More loop iterations (64 tiles for 1024×1024)
- Less data reuse per tile

Larger tiles (32×32):
+ Fewer loop iterations (better for large matrices)
+ More data reuse (each element used 32 times vs 16)
- Fewer blocks per SM (lower occupancy)
- Hits thread limit (1,024 threads per block)
- Requires more shared memory

Why 16×16 is a good choice:

1. Pedagogical: Easy to understand and visualize
2. Portable: Works on all modern GPUs (even those with 48 KB shared memory)
3. Balanced: Good occupancy without hitting thread limits
4. Production-ready: cuBLAS actually uses multiple tile sizes depending on:
   - Matrix dimensions
   - GPU architecture
   - Data types (float vs double)
   - Available shared memory

For learning, 16×16 is perfect. For production, use cuBLAS (auto-tuned)!

Visual representation of tiling for 1024×1024:

Matrix A (1024×1024)              Matrix B (1024×1024)
┌──────────────────────┐            ┌──────────────────────┐
│ Tile  Tile  ... Tile │            │ Tile  Tile  ... Tile │
│ A₀₀   A₀₁      A₀,₆₃ │            │ B₀₀   B₀₁      B₀,₆₃ │
├──────────────────────┤            ├──────────────────────┤
│ Tile  Tile  ... Tile │            │ Tile  Tile  ... Tile │
│ A₁₀   A₁₁      A₁,₆₃ │            │ B₁₀   B₁₁      B₁,₆₃ │
│  ...   ...      ...  │            │  ...   ...      ...  │
│ Tile  Tile  ... Tile │            │ Tile  Tile  ... Tile │
│ A₆₃,₀ A₆₃,₁    A₆₃,₆₃│            │ B₆₃,₀ B₆₃,₁    B₆₃,₆₃│
└──────────────────────┘            └──────────────────────┘

Each matrix divided into 64×64 = 4,096 tiles (each tile is 16×16)

Each tile is 16×16 elements (256 floats = 1 KB)
Total tiles per matrix: 64×64 = 4,096 tiles
Total tiles to process: 256 / 16 = 16 tiles along the K dimension

How tiling reduces memory traffic:

Without tiling (naive approach):
- Thread (0,0) reads A[0][0] from global memory
- Thread (0,1) reads A[0][0] from global memory (redundant!)
- Thread (0,2) reads A[0][0] from global memory (redundant!)
- ... all 16 threads in row 0 read A[0][0] separately
- Total: 16 reads from slow global memory

With tiling (shared memory approach):
- One thread loads A[0][0] into shared memory (1 global read)
- Thread (0,0) reads from shared memory (fast!)
- Thread (0,1) reads from shared memory (fast!)
- Thread (0,2) reads from shared memory (fast!)
- ... all 16 threads read from fast shared memory
- Total: 1 read from global memory, 16 reads from shared memory

Result: 16× less global memory traffic + each access is 10-20× faster!

Step 5: Implementation with shared memory

Now let’s implement the tiled matrix multiplication with extensive inline comments showing exactly what happens.

// ═════════════════════════════════════════════════════════════════════════════
// TILE SIZE CONFIGURATION
// ═════════════════════════════════════════════════════════════════════════════
// Tile size: 16×16 = 256 threads per block = 256 elements per tile
// Each tile = 256 floats × 4 bytes = 1 KB
// Why 16? Common choice balancing occupancy and shared memory usage
#define TILE_SIZE 16

__global__ void matMulTiled(float* A, float* B, float* C, int N)
{
    // ═════════════════════════════════════════════════════════════════════════
    // SHARED MEMORY DECLARATION (2 KB total per block)
    // ═════════════════════════════════════════════════════════════════════════
    // __shared__ = visible to all 256 threads in this block
    // Allocated once per block, reused across all 64 iterations
    // Speed: 20-40 cycle latency (vs 400-800 cycles for global memory)
    // Size limit: 48-96 KB per SM (shared among all active blocks)
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];  // 16×16 = 256 floats = 1 KB
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];  // 16×16 = 256 floats = 1 KB
    // Total: 2 KB allocated in shared memory

    // ═════════════════════════════════════════════════════════════════════════
    // THREAD IDENTIFICATION: Calculate which output element this thread computes
    // ═════════════════════════════════════════════════════════════════════════
    // Each thread in the grid computes ONE element of output matrix C
    // row = which row of C this thread computes
    // col = which column of C this thread computes
    //
    // Formula: global_position = block_position × block_size + thread_position
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    //
    // Example: Thread (3,5) in Block (10,20) for N=1024:
    //   row = blockIdx.y × 16 + threadIdx.y = 20 × 16 + 3 = 323
    //   col = blockIdx.x × 16 + threadIdx.x = 10 × 16 + 5 = 165
    //   → This thread computes C[323][165]

    // ═════════════════════════════════════════════════════════════════════════
    // ACCUMULATOR: Stores partial sum for this thread's output element
    // ═════════════════════════════════════════════════════════════════════════
    // Lives in register (private to this thread, fastest memory)
    // Persists across all iterations, accumulating partial dot products
    // For N=1024: Accumulates 1024 products across 64 iterations
    float sum = 0.0f;
    //
    // Example: Thread (3,5) computing C[323][165]:
    //   Start:        sum = 0.0
    //   Iteration 0:  sum = 87.4   (after adding 16 products)
    //   Iteration 1:  sum = 152.2  (after adding 16 more products)
    //   Iteration 2:  sum = 218.7
    //   ...
    //   Iteration 63: sum = 2847.3 (final result after 1024 products total)

    // ═════════════════════════════════════════════════════════════════════════
    // CALCULATE NUMBER OF ITERATIONS needed to cover all K elements
    // ═════════════════════════════════════════════════════════════════════════
    // C[i][j] = A[i][0]*B[0][j] + A[i][1]*B[1][j] + ... + A[i][N-1]*B[N-1][j]
    // That's N products to compute, but we process TILE_SIZE at a time
    // So we need ceiling(N / TILE_SIZE) iterations
    int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;
    //
    // Example: N=1024, TILE_SIZE=16:
    //   numTiles = (1024 + 15) / 16 = 1039 / 16 = 64 iterations
    //   Iteration 0: Process k = 0 to 15
    //   Iteration 1: Process k = 16 to 31
    //   Iteration 2: Process k = 32 to 47
    //   ...
    //   Iteration 63: Process k = 1008 to 1023

    // ═════════════════════════════════════════════════════════════════════════
    // MAIN LOOP: Iterate through tiles along the K dimension
    // ═════════════════════════════════════════════════════════════════════════
    // Each iteration:
    //   1. Load one 16×16 tile from A (256 elements)
    //   2. Load one 16×16 tile from B (256 elements)
    //   3. Compute 16 multiply-adds using the loaded tiles
    //   4. Accumulate results into sum
    for (int t = 0; t < numTiles; t++) {

        // ─────────────────────────────────────────────────────────────────────
        // STEP 1: COOPERATIVE LOADING FROM MATRIX A
        // ─────────────────────────────────────────────────────────────────────
        // All 256 threads work together to load one tile from matrix A
        // Each thread loads exactly ONE element from global memory
        // Together they fill the entire 16×16 shared memory tile
        //
        // For Thread (ty, tx): Loads A[row][t×16+tx] → tileA[ty][tx]
        //   - row stays CONSTANT across all iterations (always same row of A)
        //   - column CHANGES each iteration (moves right through row of A)
        int aRow = row;
        int aCol = t * TILE_SIZE + threadIdx.x;
        //
        // Example: Thread (3,5) in Block (10,20), iteration t=1:
        //   aRow = 323 (same for all 64 iterations - always row 323 of A)
        //   aCol = 1 × 16 + 5 = 21
        //   Loads: tileA[3][5] = A[323][21]
        //
        // All threads in this block cooperatively load (iteration t=1):
        //   Thread (0,0): A[320][16] → tileA[0][0]
        //   Thread (0,1): A[320][17] → tileA[0][1]
        //   Thread (3,5): A[323][21] → tileA[3][5]
        //   Thread (15,15): A[335][31] → tileA[15][15]
        //   Result: tileA contains A[320:335][16:31]
        //
        // Boundary checking: Needed when N is not multiple of TILE_SIZE
        if (aRow < N && aCol < N) {
            tileA[threadIdx.y][threadIdx.x] = A[aRow * N + aCol];
        } else {
            tileA[threadIdx.y][threadIdx.x] = 0.0f;  // Zero padding
        }

        // ─────────────────────────────────────────────────────────────────────
        // STEP 2: COOPERATIVE LOADING FROM MATRIX B
        // ─────────────────────────────────────────────────────────────────────
        // All 256 threads work together to load one tile from matrix B
        // Each thread loads exactly ONE element from global memory
        //
        // For Thread (ty, tx): Loads B[t×16+ty][col] → tileB[ty][tx]
        //   - row CHANGES each iteration (moves down through column of B)
        //   - column stays CONSTANT across all iterations (always same col of B)
        int bRow = t * TILE_SIZE + threadIdx.y;
        int bCol = col;
        //
        // Example: Thread (3,5) in Block (10,20), iteration t=1:
        //   bRow = 1 × 16 + 3 = 19
        //   bCol = 165 (same for all 64 iterations - always column 165 of B)
        //   Loads: tileB[3][5] = B[19][165]
        //
        // All threads in this block cooperatively load (iteration t=1):
        //   Thread (0,0): B[16][160] → tileB[0][0]
        //   Thread (0,1): B[16][161] → tileB[0][1]
        //   Thread (3,5): B[19][165] → tileB[3][5]
        //   Thread (15,15): B[31][175] → tileB[15][15]
        //   Result: tileB contains B[16:31][160:175]
        if (bRow < N && bCol < N) {
            tileB[threadIdx.y][threadIdx.x] = B[bRow * N + bCol];
        } else {
            tileB[threadIdx.y][threadIdx.x] = 0.0f;  // Zero padding
        }

        // ─────────────────────────────────────────────────────────────────────
        // STEP 3: SYNCHRONIZATION BARRIER - Wait for all threads to finish loading
        // ─────────────────────────────────────────────────────────────────────
        // CRITICAL: Must ensure ALL threads complete loading before ANY thread
        // starts computing, otherwise we'd read incomplete/corrupted data!
        //
        // What __syncthreads() does:
        //   - Blocks execution until ALL 256 threads in this block reach here
        //   - Fast threads wait for slow threads
        //   - Only when last thread arrives do all threads proceed together
        //
        // Example timeline (times in arbitrary units):
        //   Time 0:   All threads start loading
        //   Time 50:  Thread (0,0) finishes, reaches __syncthreads(), WAITS
        //   Time 100: Thread (5,7) finishes, reaches __syncthreads(), WAITS
        //   Time 150: Thread (10,12) finishes, reaches __syncthreads(), WAITS
        //   Time 200: Thread (15,15) finishes (slowest), reaches __syncthreads()
        //   Time 201: ALL threads released, proceed to computation ✓
        //
        // Without this sync:
        //   Thread (0,0) loads fast → starts computing → reads tileA[8][9]
        //   But Thread (8,9) hasn't finished loading yet!
        //   Result: Thread (0,0) reads garbage data → WRONG ANSWER!
        __syncthreads();

        // ─────────────────────────────────────────────────────────────────────
        // STEP 4: COMPUTE PARTIAL DOT PRODUCT using shared memory tiles
        // ─────────────────────────────────────────────────────────────────────
        // Each thread computes 16 multiply-add operations
        // Uses one ROW from tileA and one COLUMN from tileB
        // All 16 operations read from FAST shared memory (20-40 cycles)
        //
        // For Thread (ty, tx): Computes using tileA[ty][0:15] and tileB[0:15][tx]
        //
        // Example: Thread (3,5) in Block (10,20), iteration t=1:
        //   Uses row 3 of tileA → contains A[323][16:31]
        //   Uses col 5 of tileB → contains B[16:31][165]
        //
        //   k=0:  sum += tileA[3][0] × tileB[0][5]
        //         sum += A[323][16] × B[16][165]  (global positions)
        //
        //   k=1:  sum += tileA[3][1] × tileB[1][5]
        //         sum += A[323][17] × B[17][165]
        //
        //   k=2:  sum += tileA[3][2] × tileB[2][5]
        //         sum += A[323][18] × B[18][165]
        //
        //   ... (13 more operations)
        //
        //   k=15: sum += tileA[3][15] × tileB[15][5]
        //         sum += A[323][31] × B[31][165]
        //
        //   After this loop (iteration 1 complete):
        //     sum = (previous value) + (16 new products)
        //     If sum was 87.4, now sum ≈ 152.2
        //
        // KEY OPTIMIZATION: All 16 reads from shared memory (fast!)
        //   Without tiling: 16 reads from global memory (slow!)
        //   Speedup: ~10-20× per access
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        // ─────────────────────────────────────────────────────────────────────
        // STEP 5: SYNCHRONIZATION BARRIER - Wait before loading next tile
        // ─────────────────────────────────────────────────────────────────────
        // CRITICAL: Must ensure ALL threads finish computing before ANY thread
        // starts loading next tile, otherwise we'd overwrite data still in use!
        //
        // What happens without this sync (causes bugs!):
        //   Time 0:   All threads computing with current tiles
        //   Time 50:  Thread (0,0) finishes → starts iteration t=2
        //           → Loads new data → OVERWRITES tileA[0][0]
        //   Time 100: Thread (15,15) still computing with iteration t=1
        //           → Tries to read tileA[0][0] → Gets WRONG data!
        //           → Result is CORRUPTED!
        //
        // With this sync (correct behavior):
        //   Time 0:   All threads computing
        //   Time 50:  Thread (0,0) finishes, reaches __syncthreads(), WAITS
        //   Time 100: Thread (10,12) finishes, reaches __syncthreads(), WAITS
        //   Time 150: Thread (15,15) finishes, reaches __syncthreads()
        //   Time 151: ALL threads proceed to next iteration together ✓
        //           → Safe to overwrite shared memory now
        __syncthreads();
    }
    // Loop ends: After 64 iterations, each thread has accumulated 1024 products

    // ═════════════════════════════════════════════════════════════════════════
    // WRITE FINAL RESULT to global memory
    // ═════════════════════════════════════════════════════════════════════════
    // After all iterations complete, sum contains the final dot product
    // Write it to the corresponding position in output matrix C
    //
    // Boundary check: Some threads may be outside matrix bounds
    //   Example: N=1020, block has threads up to position 1023
    //   Threads computing row/col >= 1020 shouldn't write (out of bounds)
    //
    // Example: Thread (3,5) in Block (10,20):
    //   row = 323, col = 165
    //   sum = 2847.3 (accumulated over 64 iterations)
    //   Writes: C[323 × 1024 + 165] = 2847.3
    //           C[323][165] = 2847.3 ✓
    //
    // Summary of what this thread accomplished:
    //   Iteration 0:  Added A[323][0:15] · B[0:15][165]     → sum = 87.4
    //   Iteration 1:  Added A[323][16:31] · B[16:31][165]   → sum = 152.2
    //   Iteration 2:  Added A[323][32:47] · B[32:47][165]   → sum = 218.7
    //   ...
    //   Iteration 63: Added A[323][1008:1023] · B[1008:1023][165] → sum = 2847.3
    //   Total: 1024 multiply-add operations = complete dot product
    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

Host code and kernel launch

int main()
{
    int N = 1024;
    size_t size = N * N * sizeof(float);

    // Allocate and initialize host matrices
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    for (int i = 0; i < N * N; i++) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // Allocate device matrices
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch optimized kernel with tiling
    // Each block: 16×16 = 256 threads
    // Grid: 64×64 = 4,096 blocks (for N=1024)
    // Total: 1,048,576 threads computing 1,048,576 output elements
    dim3 threadsPerBlock(TILE_SIZE, TILE_SIZE);
    dim3 numBlocks((N + TILE_SIZE - 1) / TILE_SIZE,
                   (N + TILE_SIZE - 1) / TILE_SIZE);

    matMulTiled<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result back
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

Why this optimization works

The tiled implementation achieves 10-20× speedup by exploiting the memory hierarchy.

Memory reuse pattern:

For N=1024, each thread needs to access 1024 elements from A and 1024 from B. Without tiling, these 2048 accesses would all hit slow global memory (400-800 cycles each). With tiling, we break this into 64 iterations:

  • Each iteration: Load 2 KB into shared memory (one slow access)

  • Then perform 16 operations using shared memory (16 fast accesses)

  • Memory access ratio: 4,096 shared reads : 256 global reads = 16:1 reuse

Performance breakdown per block:

Without tiling (naive):
  Each thread: 1024 reads from A (global) + 1024 reads from B (global)
  256 threads × 2048 global reads × 400 cycles = 209 million cycles

With tiling:
  64 iterations × (256 global reads + 4096 shared reads)
  64 × (256×400 + 4096×30) ≈ 14 million cycles

Speedup: 209M / 14M ≈ 15× faster

Why cooperative loading matters:

When all 256 threads in a block load simultaneously, the GPU can coalesce adjacent memory accesses into fewer transactions. Thread (0,0) loads A[row][0], Thread (0,1) loads A[row][1], etc. These adjacent accesses combine into a single memory transaction, maximizing bandwidth utilization.

Why synchronization is critical:

The two __syncthreads() barriers prevent data races. After loading, all threads must wait before computing to ensure the complete tile is available. After computing, all threads must wait before loading the next tile to prevent overwriting data still in use. Without these barriers, fast threads would corrupt data needed by slower threads.

Why not use bigger tiles?

Shared memory is limited (48-96 KB per SM), so tile size involves trade-offs:

  • 16×16 tiles (2 KB): High occupancy, many blocks per SM, good latency hiding, moderate reuse

  • 32×32 tiles (8 KB): Better reuse (32 elements reused 32 times), fewer blocks per SM

  • 64×64 tiles (32 KB): Excellent reuse, but may limit parallelism significantly

Optimal size depends on GPU architecture, matrix dimensions, and occupancy needs. Our 16×16 choice balances memory usage with parallelism. For production, use cuBLAS which auto-tunes for your hardware.

How does memory coalescing work here?

Memory coalescing combines adjacent thread accesses into fewer transactions, maximizing bandwidth.

Good (coalesced): Thread 0 reads A[0], Thread 1 reads A[1], Thread 2 reads A[2]. The GPU combines these into one 128-byte transaction serving 32 threads. Bandwidth: Full utilization.

Bad (non-coalesced): Thread 0 reads A[0], Thread 1 reads A[32], Thread 2 reads A[64]. Each triggers a separate 128-byte transaction but uses only 4 bytes. Bandwidth: 32× slower.

Our implementation: When loading tiles, Thread (0,0) loads A[row][t×16+0], Thread (0,1) loads A[row][t×16+1], Thread (0,2) loads A[row][t×16+2]. These adjacent accesses naturally coalesce, achieving optimal memory bandwidth.

Summary

This tutorial series progressed from basic concepts to advanced optimization:

  • Tutorials 0-2: Memory management and basic parallelism

  • Tutorials 3-4: Scaling with batching and chunking

  • Tutorial 5: Overlapping transfers with streams

  • Tutorial 6: Memory hierarchy optimization with tiling

  • Tutorial 7: Combining streams and tiling for large batches

The core principle: move data closer to computation. By exploiting the memory hierarchy (registers → shared memory → global memory) and overlapping operations with streams, we achieve significant speedups over naive implementations. For production code, use cuBLAS library, which implements additional optimizations achieving near-peak performance.

For conceptual material on GPU architecture and CUDA fundamentals, see CUDA fundamentals.

Tutorial 7: Batch processing with streams and tiling (100k images of 1024×1024)

Process 100,000 matrices (1024×1024 each) using both tiling and streams to maximize GPU utilization.

Problem: Large batch matrix operations

You have 100,000 images (1024×1024 pixels each) and need to apply a transformation like convolution, filtering, or matrix multiplication. Processing them one at a time wastes GPU resources. Processing all at once exceeds memory limits.

Note

Terminology: In this tutorial, “batch” refers to a group of matrices processed together in one iteration (same as “chunk” in Tutorial 4). We use “batch” here because it’s more common in machine learning and image processing contexts. The concept is identical: dividing a large dataset into smaller groups that fit in memory.

Challenge:

  • Total data: 100,000 matrices × 1024×1024 × 4 bytes = 400 GB

  • GPU memory: Typically 8-24 GB

  • Need to process in batches while keeping GPU busy

Solution approach:

Combine Tutorial 5 (streams) and Tutorial 6 (tiling):

  1. Divide 100,000 matrices into batches that fit in GPU memory

  2. Use multiple streams to overlap CPU→GPU transfer, computation, and GPU→CPU transfer

  3. Within each computation, use tiled matrix operations for efficiency

  4. Pipeline the batches to maximize throughput

Step 1: Understanding the pipeline

We’ll process matrices in batches of 1,000, using 3 streams for triple buffering:

Time →

Stream 0: [H→D batch 0] [Compute batch 0] [D→H batch 0]
Stream 1:               [H→D batch 1]     [Compute batch 1] [D→H batch 1]
Stream 2:                                 [H→D batch 2]     [Compute batch 2] [D→H batch 2]

Result: All three stages happen simultaneously across different batches

Step 2: Memory organization

For 100,000 matrices in batches of 1,000, we need a realistic memory strategy:

Storage (disk/database): 100,000 matrices (400 GB total)
  ├─ Load batches on-demand as needed
  └─ Save processed results back to disk

Host memory (pinned, for fast transfer):
  Stream 0 buffers: input (4 GB) + output (4 GB) = 8 GB
  Stream 1 buffers: input (4 GB) + output (4 GB) = 8 GB
  Stream 2 buffers: input (4 GB) + output (4 GB) = 8 GB
  Total: 24 GB (realistic for modern systems)

GPU memory (device):
  Stream 0: input (4 GB) + output (4 GB) = 8 GB
  Stream 1: input (4 GB) + output (4 GB) = 8 GB
  Stream 2: input (4 GB) + output (4 GB) = 8 GB
  Total: 24 GB (fits on 32 GB GPU, tight on 16 GB GPU)

Data flow:
  Disk → Pinned buffer → GPU → Pinned buffer → Disk

Key: Only 3 batches in memory at once, cycling through all 100 batches

Step 3: Implementation strategy

// Configuration
#define N 1024              // Matrix dimension
#define TILE_SIZE 16        // Tile size for shared memory
#define CHUNK_SIZE 1000     // Matrices per chunk
#define NUM_CHUNKS 100      // Total chunks (100,000 / 1,000)
#define NUM_STREAMS 3       // Triple buffering

// Tiled kernel: Apply 3×3 Gaussian blur filter
// Why tiling helps: Each output pixel needs 9 input pixels
// With shared memory: Load once, reuse 9 times by different threads
// Without shared memory: Each thread would load from slow global memory
__global__ void processMatrixTiled(float* input, float* output, int N)
{
    // Shared memory tile (with halo for neighboring pixels)
    __shared__ float tile[TILE_SIZE + 2][TILE_SIZE + 2];

    // Calculate global position
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;

    // Load tile cooperatively (including 1-pixel border)
    int tileRow = threadIdx.y + 1;  // +1 for halo
    int tileCol = threadIdx.x + 1;

    if (row < N && col < N) {
        tile[tileRow][tileCol] = input[row * N + col];
    } else {
        tile[tileRow][tileCol] = 0.0f;
    }

    // Load halo (border pixels needed for filter)
    if (threadIdx.y == 0 && row > 0) {
        tile[0][tileCol] = input[(row-1) * N + col];  // Top
    }
    if (threadIdx.y == TILE_SIZE-1 && row < N-1) {
        tile[TILE_SIZE+1][tileCol] = input[(row+1) * N + col];  // Bottom
    }
    if (threadIdx.x == 0 && col > 0) {
        tile[tileRow][0] = input[row * N + (col-1)];  // Left
    }
    if (threadIdx.x == TILE_SIZE-1 && col < N-1) {
        tile[tileRow][TILE_SIZE+1] = input[row * N + (col+1)];  // Right
    }

    __syncthreads();

    // Apply 3×3 Gaussian blur using shared memory
    // Gaussian kernel weights (normalized):
    //   [1 2 1]
    //   [2 4 2] / 16
    //   [1 2 1]
    if (row < N && col < N) {
        float result =
            tile[tileRow-1][tileCol-1] * 1.0f +  // Top-left
            tile[tileRow-1][tileCol]   * 2.0f +  // Top
            tile[tileRow-1][tileCol+1] * 1.0f +  // Top-right
            tile[tileRow][tileCol-1]   * 2.0f +  // Left
            tile[tileRow][tileCol]     * 4.0f +  // Center
            tile[tileRow][tileCol+1]   * 2.0f +  // Right
            tile[tileRow+1][tileCol-1] * 1.0f +  // Bottom-left
            tile[tileRow+1][tileCol]   * 2.0f +  // Bottom
            tile[tileRow+1][tileCol+1] * 1.0f;   // Bottom-right

        output[row * N + col] = result / 16.0f;  // Normalize
    }
}

int main()
{
    // Total: 100,000 matrices
    int totalMatrices = NUM_CHUNKS * CHUNK_SIZE;
    size_t matrixSize = N * N * sizeof(float);
    size_t chunkBytes = CHUNK_SIZE * matrixSize;  // 4 GB per chunk

    // ═══════════════════════════════════════════════════════════
    // STEP 1: Allocate pinned host memory for NUM_STREAMS chunks
    // ═══════════════════════════════════════════════════════════
    // Realistic: Only allocate enough for in-flight chunks
    // With 3 streams: 3 chunks in (12 GB) + 3 chunks out (12 GB) = 24 GB
    float *h_input[NUM_STREAMS], *h_output[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaMallocHost(&h_input[i], chunkBytes);   // 4 GB per stream
        cudaMallocHost(&h_output[i], chunkBytes);  // 4 GB per stream
    }
    // Total host memory: 3 × (4 GB + 4 GB) = 24 GB (realistic!)

    // Larger dataset stored in regular pageable memory or on disk
    // For this example, we'll simulate loading from disk chunk by chunk

    // ═══════════════════════════════════════════════════════════
    // STEP 2: Allocate device memory (one set per stream)
    // ═══════════════════════════════════════════════════════════
    float *d_input[NUM_STREAMS], *d_output[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaMalloc(&d_input[i], chunkBytes);   // 4 GB per stream
        cudaMalloc(&d_output[i], chunkBytes);  // 4 GB per stream
    }
    // Total device memory: 3 × (4 GB + 4 GB) = 24 GB

    // ═══════════════════════════════════════════════════════════
    // STEP 3: Create streams
    // ═══════════════════════════════════════════════════════════
    cudaStream_t streams[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamCreate(&streams[i]);
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 4: Process chunks with pipelining
    // ═══════════════════════════════════════════════════════════
    // Grid configuration for 1000 matrices per chunk
    dim3 threadsPerBlock(TILE_SIZE, TILE_SIZE);  // 16×16 = 256 threads
    dim3 blocksPerMatrix((N + TILE_SIZE - 1) / TILE_SIZE,
                         (N + TILE_SIZE - 1) / TILE_SIZE);  // 64×64 blocks

    for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
        // Use streams in round-robin fashion
        int streamIdx = chunk % NUM_STREAMS;
        cudaStream_t stream = streams[streamIdx];

        // ───────────────────────────────────────────────────────
        // Load chunk from disk/pageable memory into pinned buffer
        // ───────────────────────────────────────────────────────
        // In real application: read from file, database, or network
        // For now, simulate by generating data on-the-fly
        for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
            h_input[streamIdx][i] = rand() / (float)RAND_MAX;
        }

        // ───────────────────────────────────────────────────────
        // Pipeline three operations on this stream:
        // ───────────────────────────────────────────────────────

        // 1. Transfer input chunk to GPU (async)
        cudaMemcpyAsync(d_input[streamIdx], h_input[streamIdx],
                       chunkBytes, cudaMemcpyHostToDevice, stream);

        // 2. Process all matrices in chunk (async on stream)
        for (int m = 0; m < CHUNK_SIZE; m++) {
            size_t matOffset = m * N * N;
            processMatrixTiled<<<blocksPerMatrix, threadsPerBlock, 0, stream>>>(
                d_input[streamIdx] + matOffset,
                d_output[streamIdx] + matOffset,
                N
            );
        }

        // 3. Transfer results back to host (async)
        cudaMemcpyAsync(h_output[streamIdx], d_output[streamIdx],
                       chunkBytes, cudaMemcpyDeviceToHost, stream);

        // ───────────────────────────────────────────────────────
        // Save results from previous chunk (if available)
        // ───────────────────────────────────────────────────────
        if (chunk >= NUM_STREAMS) {
            // Previous chunk on this stream has completed
            int prevChunk = chunk - NUM_STREAMS;
            cudaStreamSynchronize(stream);

            // Save results: write to disk, database, or network
            // For now, just acknowledge completion
            printf("Chunk %d processed (1000 matrices)\n", prevChunk);
        }
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 5: Wait for final chunks and save results
    // ═══════════════════════════════════════════════════════════
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamSynchronize(streams[i]);
        // Save final results
        printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
    }

    // Cleanup
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamDestroy(streams[i]);
        cudaFree(d_input[i]);
        cudaFree(d_output[i]);
        cudaFreeHost(h_input[i]);
        cudaFreeHost(h_output[i]);
    }

    return 0;
}

Summary

Tutorial 7 combines the best techniques from the series:

  • Tiling (Tutorial 6): Each matrix uses shared memory for 10-20× speedup

  • Streams (Tutorial 5): Multiple chunks pipeline to hide transfer latency

  • Chunking (Tutorials 3-4): Process data in manageable chunks

Key takeaways:

  1. Memory hierarchy: Use shared memory (fast) for repeated access

  2. Asynchronous execution: Overlap transfers and computation with streams

  3. Resource management: Allocate GPU memory per stream, reuse across chunks

  4. Pipeline depth: 3 streams achieve good balance between complexity and performance

This pattern scales to: - Image processing (filters, transformations) - Deep learning inference (batch predictions) - Scientific computing (simulations, data analysis) - Video processing (frame-by-frame operations)

For conceptual material on GPU architecture and CUDA fundamentals, see CUDA fundamentals.

Tutorial 8: Using cuBLAS for production matrix multiplication

Problem

In Tutorials 1-7, we learned how GPU parallelism works by implementing matrix multiplication from scratch. We progressed from naive implementations to optimized tiled versions with shared memory. However, for production code, you should use cuBLAS (CUDA Basic Linear Algebra Subprograms), NVIDIA’s highly optimized library.

Why use cuBLAS?

  1. Performance: 2-5× faster than hand-written kernels (uses advanced optimizations)

  2. Maintained: NVIDIA updates it for new GPU architectures

  3. Tensor cores: Automatically uses specialized hardware when available

  4. Less code: Single function call vs hundreds of lines of kernel code

When to write custom kernels:

  • Non-standard operations (custom transformations, domain-specific algorithms)

  • Fused operations (combine multiple operations to reduce memory traffic)

  • Research/experimentation (testing new algorithms)

When to use cuBLAS:

  • Standard linear algebra (matrix multiplication, vector operations, factorizations)

  • Production code (reliability and performance matter)

  • Scientific applications: electron microscopy (ptychography reconstruction uses extensive matrix operations and FFTs), molecular dynamics simulations, climate modeling

  • Deep learning (neural network layers are matrix multiplications)

This tutorial shows how to use cuBLAS for the same 1024×1024 matrix multiplication we’ve been working with. These operations appear repeatedly in computational pipelines: ptychography reconstructs high-resolution images through iterative matrix updates and Fourier transforms, processing thousands of diffraction patterns where each iteration involves matrix multiplications that cuBLAS accelerates significantly.

Solution

Instead of writing kernel code with shared memory tiling (Tutorial 6), we call a single cuBLAS function. The library handles all optimizations internally.

What cuBLAS does behind the scenes:

When you call cublasSgemm, the library:

  1. Analyzes your GPU architecture: Detects compute capability, memory hierarchy, and available features

  2. Selects optimal algorithm: Chooses tile sizes, blocking strategies, and memory access patterns

  3. Uses specialized hardware: Activates tensor cores on modern GPUs (Volta/Turing/Ampere) for massive speedup

  4. Manages memory efficiently: Handles shared memory allocation, register usage, and bank conflicts

  5. Launches optimized kernels: Uses multiple kernel variants depending on matrix dimensions

For example, on an RTX 3090:

  • Detects tensor cores available

  • Uses 256×128 tiles (much larger than our 16×16 in Tutorial 6)

  • Employs warp-level matrix operations

  • Achieves 140+ GFLOPS vs our 27 GFLOPS

The cuBLAS GEMM function:

// GEMM: General Matrix-Matrix multiplication
// Computes: C = alpha * op(A) * op(B) + beta * C
// where op(X) can be X or X^T (transpose)

cublasSgemm(handle,        // cuBLAS context
            transA,        // CUBLAS_OP_N (no transpose) or CUBLAS_OP_T
            transB,        // CUBLAS_OP_N or CUBLAS_OP_T
            m, n, k,       // Matrix dimensions
            &alpha,        // Pointer to scaling factor
            A, lda,        // Matrix A and leading dimension
            B, ldb,        // Matrix B and leading dimension
            &beta,         // Pointer to scaling factor
            C, ldc);       // Matrix C (output) and leading dimension

Concrete example: Multiply two 1024×1024 matrices

// We want: C = A × B where A, B, C are all 1024×1024

const int N = 1024;
const float alpha = 1.0f;  // Scaling factor for A × B (1.0 = no scaling)
const float beta = 0.0f;   // Scaling factor for C (0.0 = overwrite C)
                           // If beta=1.0, would compute C = A×B + C (accumulate)

// Matrix dimensions for standard multiplication:
// A is 1024×1024 (m × k), B is 1024×1024 (k × n), C is 1024×1024 (m × n)
int m = N, n = N, k = N;

// CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C defined in cublas_v2.h
// CUBLAS_OP_N = no transpose (use matrix as-is)
// CUBLAS_OP_T = transpose matrix
// CUBLAS_OP_C = conjugate transpose (for complex matrices)
cublasSgemm(handle,           // cuBLAS library context (created once, reused)
            CUBLAS_OP_N,      // Don't transpose A
            CUBLAS_OP_N,      // Don't transpose B
            m, n, k,          // Dimensions: C(m×n) = A(m×k) × B(k×n)
            &alpha,           // Pointer to alpha scaling factor
            d_A, N,           // Matrix A on device, leading dimension N
            d_B, N,           // Matrix B on device, leading dimension N
            &beta,            // Pointer to beta scaling factor
            d_C, N);          // Matrix C on device (output), leading dimension N

// Leading dimension (lda, ldb, ldc) = stride between rows in memory
// For contiguous row-major matrices, this equals the row width (N)

Column-major vs row-major storage:

cuBLAS follows Fortran convention (column-major), while C/C++ uses row-major:

Row-major (C/C++):           Column-major (Fortran):
Matrix [0 1 2]               Matrix [0 1 2]
       [3 4 5]                      [3 4 5]

Stored as: 0,1,2,3,4,5       Stored as: 0,3,1,4,2,5
(rows consecutive)           (columns consecutive)

However, for our purposes, we don’t need to do anything special. When we pass row-major matrices to cuBLAS with CUBLAS_OP_N (no transpose), cuBLAS interprets them as column-major and the computation still works correctly. The matrix layout difference is handled automatically by the library as long as we’re consistent with our storage format throughout the computation.

Implementation

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    // ═══════════════════════════════════════════════════════════
    // STEP 1: Setup and initialization
    // ═══════════════════════════════════════════════════════════
    const int N = 1024;
    const size_t bytes = N * N * sizeof(float);

    // Allocate host memory
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    // Initialize matrices with simple values
    for (int i = 0; i < N * N; i++) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 2: Allocate device memory
    // ═══════════════════════════════════════════════════════════
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

    // Transfer data to GPU
    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);

    // ═══════════════════════════════════════════════════════════
    // STEP 3: Initialize cuBLAS
    // ═══════════════════════════════════════════════════════════
    cublasHandle_t handle;
    cublasCreate(&handle);

    // ═══════════════════════════════════════════════════════════
    // STEP 4: Perform matrix multiplication using cuBLAS
    // ═══════════════════════════════════════════════════════════
    // Compute C = A × B using GEMM (General Matrix-Matrix multiplication)

    const float alpha = 1.0f;  // Scaling factor for A × B
    const float beta = 0.0f;   // Scaling factor for C (0 means overwrite)

    // Simply pass our row-major matrices to cuBLAS
    // cuBLAS handles the column-major/row-major difference internally
    cublasSgemm(handle,
                CUBLAS_OP_N, CUBLAS_OP_N,  // No transpose needed
                N, N, N,                    // m, n, k (all N for square matrices)
                &alpha,                     // Scaling factor alpha
                d_A, N,                     // Matrix A, leading dimension
                d_B, N,                     // Matrix B, leading dimension
                &beta,                      // Scaling factor beta
                d_C, N);                    // Matrix C, leading dimension

    // ═══════════════════════════════════════════════════════════
    // STEP 5: Copy result back
    // ═══════════════════════════════════════════════════════════
    cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);

    printf("Matrix multiplication completed\n");
    printf("Result matrix C = A × B (1024×1024)\n");

    // ═══════════════════════════════════════════════════════════
    // STEP 6: Cleanup
    // ═══════════════════════════════════════════════════════════
    cublasDestroy(handle);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

Compilation:

nvcc -o tutorial8 tutorial8.cu -lcublas

Key differences from custom kernels:

  1. No kernel code: cuBLAS handles all GPU operations internally

  2. Column-major handling: Swap A and B to account for storage order

  3. Handle management: Create/destroy cuBLAS context

  4. Linking: Need to link against cuBLAS library (-lcublas)

Performance comparison

Typical performance for 1024×1024 matrix multiplication on modern GPUs:

Implementation

Time (ms)

GFLOPS

Notes

Naive (Tutorial 1)

~800 ms

~2.7

Global memory only

Tiled (Tutorial 6)

~80 ms

~27

Shared memory, 16×16 tiles

cuBLAS (Tutorial 8)

~15 ms

~140

Highly optimized, tensor cores

Why is cuBLAS faster?

  1. Register blocking: Uses registers for innermost loops (faster than shared memory)

  2. Warp-level primitives: Uses specialized instructions for matrix operations

  3. Tensor cores: On Volta/Turing/Ampere GPUs, uses dedicated matrix multiplication units

  4. Auto-tuning: Selects optimal tile sizes and strategies based on GPU architecture

  5. Memory access patterns: Optimized coalescing and bank conflict avoidance

cuBLAS features

Beyond basic matrix multiplication, cuBLAS provides a comprehensive suite of linear algebra operations:

Category

Function

Description

Level 1 BLAS

cublasSaxpy

Vector addition: y = alpha*x + y

cublasSdot

Dot product between two vectors

cublasSnrm2

Euclidean norm of a vector

Level 2 BLAS

cublasSgemv

Matrix-vector multiplication

cublasSger

Outer product (rank-1 update)

Level 3 BLAS

cublasSgemm

Matrix-matrix multiplication (used in Tutorial 8)

cublasStrsm

Triangular system solve

cublasSsyrk

Symmetric rank-k update

Extensions

Batched operations

Process multiple small matrices efficiently

Strided batched

Regular memory patterns for batch processing

Mixed precision

FP16, FP32, FP64, and tensor core support

Complex arithmetic

Operations on complex-valued matrices

Summary

Tutorial 8 demonstrates production-ready GPU programming:

  • Learned through Tutorials 1-7: Memory hierarchy, parallelism, tiling, optimization

  • Use in production (Tutorial 8): cuBLAS for standard operations

Key takeaways:

  1. Education vs production: Learning how things work internally (custom kernels) vs using optimized libraries (cuBLAS)

  2. Performance: cuBLAS achieves 5-10× speedup over hand-written tiled kernels

  3. Maintainability: Library updates automatically benefit from new GPU features

  4. When to customize: Only write custom kernels for specialized operations not in cuBLAS

The complete GPU programming workflow:

  1. Understand the fundamentals (Tutorials 1-7): Memory hierarchy, parallelism, optimization

  2. Use libraries when possible (Tutorial 8): cuBLAS, cuFFT, cuDNN, Thrust

  3. Write custom kernels only when needed: Unique operations, fused operations, research

Tutorial 9: Batched cuBLAS for large-scale processing

Problem

Tutorial 7 showed how to process 100,000 matrices using custom kernels with streams and chunking. Now we combine that approach with cuBLAS for production-quality performance. Instead of writing custom matrix multiplication kernels, we use cuBLAS batched operations.

The challenge: 100,000 matrices of 1024×1024 each:

  • Total data: 400 GB (too large for GPU memory)

  • Need to process in chunks with pipelining

  • Want production performance from cuBLAS

Solution: Combine Tutorial 7 (chunking + streams) with Tutorial 8 (cuBLAS)

  • Process data in chunks of 1,000 matrices (4 GB per chunk)

  • Use 3 CUDA streams for pipelining

  • Use cublasSgemmStridedBatched for efficient batch processing

Solution

cuBLAS batched operations:

cuBLAS provides two functions for processing multiple matrices:

  1. cublasSgemmBatched: For matrices at arbitrary memory locations (array of pointers)

  2. cublasSgemmStridedBatched: For matrices at regular intervals in memory (most efficient)

We use cublasSgemmStridedBatched because our matrices are stored contiguously:

// Process multiple matrices in one call
cublasSgemmStridedBatched(
    handle,
    CUBLAS_OP_N, CUBLAS_OP_N,
    N, N, N,              // Matrix dimensions (all N×N)
    &alpha,
    d_A, N, N*N,          // A matrices: leading dim N, stride N*N between matrices
    d_B, N, N*N,          // B matrices: leading dim N, stride N*N between matrices
    &beta,
    d_C, N, N*N,          // C matrices: leading dim N, stride N*N between matrices
    batchCount);          // Number of matrices to multiply

Key parameters:

  • Stride: Distance in elements between consecutive matrices (N*N for contiguous storage)

  • Batch count: How many matrix multiplications to perform

  • Single kernel launch: cuBLAS handles all matrices efficiently in one call

Implementation

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    // ═══════════════════════════════════════════════════════════
    // STEP 1: Configuration
    // ═══════════════════════════════════════════════════════════
    const int N = 1024;               // Matrix size (1024×1024)
    const int CHUNK_SIZE = 1000;      // Matrices per chunk
    const int NUM_CHUNKS = 100;       // Total 100 chunks = 100,000 matrices
    const int NUM_STREAMS = 3;        // Streams for pipelining

    const size_t matrixSize = N * N * sizeof(float);
    const size_t chunkBytes = CHUNK_SIZE * matrixSize;  // 4 GB per chunk

    printf("Processing %d matrices (%d chunks of %d)\n",
           NUM_CHUNKS * CHUNK_SIZE, NUM_CHUNKS, CHUNK_SIZE);
    printf("Chunk size: %.2f GB\n", chunkBytes / (1024.0*1024.0*1024.0));

    // ═══════════════════════════════════════════════════════════
    // STEP 2: Allocate pinned host memory (for async transfers)
    // ═══════════════════════════════════════════════════════════
    float *h_A[NUM_STREAMS], *h_B[NUM_STREAMS], *h_C[NUM_STREAMS];

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaMallocHost(&h_A[i], chunkBytes);  // Input matrices A
        cudaMallocHost(&h_B[i], chunkBytes);  // Input matrices B
        cudaMallocHost(&h_C[i], chunkBytes);  // Output matrices C
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 3: Allocate device memory (one buffer per stream)
    // ═══════════════════════════════════════════════════════════
    float *d_A[NUM_STREAMS], *d_B[NUM_STREAMS], *d_C[NUM_STREAMS];

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaMalloc(&d_A[i], chunkBytes);
        cudaMalloc(&d_B[i], chunkBytes);
        cudaMalloc(&d_C[i], chunkBytes);
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 4: Create cuBLAS handle and streams
    // ═══════════════════════════════════════════════════════════
    cublasHandle_t handle;
    cublasCreate(&handle);

    cudaStream_t streams[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamCreate(&streams[i]);
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 5: Process chunks with pipelining
    // ═══════════════════════════════════════════════════════════
    const float alpha = 1.0f;
    const float beta = 0.0f;

    for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
        int streamIdx = chunk % NUM_STREAMS;
        cudaStream_t stream = streams[streamIdx];

        // Set cuBLAS to use this stream
        cublasSetStream(handle, stream);

        // ───────────────────────────────────────────────────────
        // Load chunk from disk into pinned memory
        // ───────────────────────────────────────────────────────
        // In real application: read from file
        // For now: generate random data
        for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
            h_A[streamIdx][i] = rand() / (float)RAND_MAX;
            h_B[streamIdx][i] = rand() / (float)RAND_MAX;
        }

        // ───────────────────────────────────────────────────────
        // Pipeline: Transfer in → Compute → Transfer out
        // ───────────────────────────────────────────────────────

        // Transfer input chunk to GPU (async)
        cudaMemcpyAsync(d_A[streamIdx], h_A[streamIdx],
                       chunkBytes, cudaMemcpyHostToDevice, stream);
        cudaMemcpyAsync(d_B[streamIdx], h_B[streamIdx],
                       chunkBytes, cudaMemcpyHostToDevice, stream);

        // Perform batched matrix multiplication (async on stream)
        // Process all 1,000 matrices in the chunk with one call
        cublasSgemmStridedBatched(
            handle,
            CUBLAS_OP_N, CUBLAS_OP_N,
            N, N, N,                    // Matrix dimensions
            &alpha,
            d_A[streamIdx], N, N*N,     // A: leading dim, stride
            d_B[streamIdx], N, N*N,     // B: leading dim, stride
            &beta,
            d_C[streamIdx], N, N*N,     // C: leading dim, stride
            CHUNK_SIZE);                // Number of matrices

        // Transfer results back to host (async)
        cudaMemcpyAsync(h_C[streamIdx], d_C[streamIdx],
                       chunkBytes, cudaMemcpyDeviceToHost, stream);

        // ───────────────────────────────────────────────────────
        // Save results from previous chunk (if available)
        // ───────────────────────────────────────────────────────
        if (chunk >= NUM_STREAMS) {
            int prevChunk = chunk - NUM_STREAMS;
            cudaStreamSynchronize(stream);

            // Save results: write to disk
            printf("Chunk %d processed (1000 matrices)\n", prevChunk);
        }
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 6: Wait for final chunks and save results
    // ═══════════════════════════════════════════════════════════
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamSynchronize(streams[i]);
        printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
    }

    printf("\nAll 100,000 matrices processed successfully!\n");

    // ═══════════════════════════════════════════════════════════
    // STEP 7: Cleanup
    // ═══════════════════════════════════════════════════════════
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamDestroy(streams[i]);
        cudaFree(d_A[i]);
        cudaFree(d_B[i]);
        cudaFree(d_C[i]);
        cudaFreeHost(h_A[i]);
        cudaFreeHost(h_B[i]);
        cudaFreeHost(h_C[i]);
    }

    cublasDestroy(handle);

    return 0;
}

Compilation:

nvcc -o tutorial9 tutorial9.cu -lcublas

Key differences from Tutorial 7:

Aspect

Tutorial 7 (Custom kernel)

Tutorial 9 (cuBLAS batched)

Kernel code

Manual tiling implementation

No kernel needed

Batch processing

Loop over matrices, launch kernel per matrix

Single cublasSgemmStridedBatched call

Performance

~27 GFLOPS per matrix

~140 GFLOPS per matrix

Code complexity

100+ lines of kernel code

Single function call

Maintainability

Must update for new GPUs

Automatically optimized

Why use batched operations

Performance benefits:

  1. Reduced kernel launch overhead: One launch for 1,000 matrices vs 1,000 launches

  2. Better GPU utilization: cuBLAS schedules work efficiently across SMs

  3. Optimized memory access: Library handles data layout for maximum throughput

When to use batched operations:

  • Processing many small to medium matrices (common in deep learning, simulations)

  • Matrices stored contiguously in memory (use StridedBatched)

  • Need production performance without custom optimization

When not to use:

  • Single large matrix (use regular cublasSgemm)

  • Matrices at non-uniform memory locations (may need cublasSgemmBatched with pointer array)

  • Custom operations not available in cuBLAS

Summary

Tutorial 9 completes the progression from basics to production:

  • Tutorials 1-6: Learn GPU fundamentals (memory, parallelism, optimization)

  • Tutorial 7: Apply concepts at scale (chunking, streams, custom kernels)

  • Tutorial 8: Use optimized libraries (cuBLAS for single operation)

  • Tutorial 9: Combine both (cuBLAS batched + chunking + streams)

Key takeaways:

  1. Batched operations: Process multiple matrices efficiently with single library call

  2. Production workflow: Chunking + streams + cuBLAS = optimal throughput

  3. Simplicity: 5-10× speedup with less code than custom kernels

  4. Scalability: Pattern works for millions of matrices with streaming from disk

Tutorial 10: Multi-operation pipeline (cuBLAS + cuFFT)

Problem

Real applications often need multiple operations on the same data. For example, processing 100,000 images (1024×1024) where each image needs:

  1. Matrix multiplication: Apply a transformation matrix (using cuBLAS)

  2. FFT: Transform to frequency domain (using cuFFT)

This appears in ptychography reconstruction, where each diffraction pattern undergoes matrix updates followed by Fourier transforms. Processing 100,000 patterns requires efficient pipelining of both operations.

The challenge:

  • 400 GB total data (too large for GPU memory)

  • Two different library operations per image

  • Need to maintain high GPU utilization throughout the pipeline

Solution: Combine cuBLAS and cuFFT in a streaming pipeline:

  • Process in chunks of 1,000 images (4 GB per chunk)

  • Use 3 streams for pipelining

  • Apply both operations to each chunk before moving to the next

Solution

Understanding FFT (Fast Fourier Transform):

The Fourier transform converts data from spatial/time domain to frequency domain. For images:

  • Spatial domain: Pixel intensities at positions \((x, y)\)

  • Frequency domain: Amplitude and phase of different spatial frequencies

Complex numbers in CUDA:

C and C++ don’t have built-in complex number types like Python’s 3+4j. CUDA libraries define their own types to represent complex numbers.

cuFFT provides cufftComplex - a struct that holds the real and imaginary parts:

// This is already defined in cufft.h when you #include <cufft.h>
// You don't write this yourself - the library provides it
typedef struct {
    float x;  // Real part
    float y;  // Imaginary part
} cufftComplex;

Think of it as a simple container holding two floats. When you create a cufftComplex variable, you’re allocating space for both the real and imaginary components.

Example usage:

#include <cufft.h>  // Provides cufftComplex type

// Create the complex number 3 + 4i
cufftComplex z;
z.x = 3.0f;  // Real part (the "3" in 3+4i)
z.y = 4.0f;  // Imaginary part (the "4" in 3+4i)

// Calculate magnitude: |z| = sqrt(3² + 4²) = sqrt(9 + 16) = 5
float magnitude = sqrtf(z.x * z.x + z.y * z.y);
printf("Magnitude: %.1f\n", magnitude);  // Prints: 5.0

// For an array of complex numbers (like an image):
cufftComplex* image = (cufftComplex*)malloc(1024 * 1024 * sizeof(cufftComplex));
// Each pixel now has .x (real) and .y (imaginary) parts

For our \(1024 \times 1024\) image:

  • Input: \(1024 \times 1024\) complex values (8 MB per image: 2 floats × 4 bytes each)

  • Output: \(1024 \times 1024\) complex values in frequency domain (same size)

  • Operation: 2D FFT processes rows, then columns

Using multiple CUDA libraries together:

Both cuBLAS and cuFFT work with CUDA streams, allowing us to pipeline operations:

// For each chunk:
// 1. Transfer data to GPU (async on stream)
// 2. Apply matrix multiplication using cuBLAS (async on stream)
//    - Works on complex matrices using cublasCgemm
// 3. Apply 2D FFT using cuFFT (async on stream)
//    - Transforms spatial domain → frequency domain
// 4. Transfer results back (async on stream)

No custom kernels needed - both libraries handle GPU optimization internally.

What’s actually running on the GPU:

When you call cublasCgemmStridedBatched and cufftExecC2C, each library launches its own highly optimized GPU kernels:

  1. cuBLAS call → Launches multiple kernels internally:

    • Tiling kernels (like Tutorial 6, but more sophisticated)

    • Register blocking kernels

    • Tensor core kernels (on modern GPUs)

    • Each library kernel is pre-compiled and optimized

  2. cuFFT call → Launches FFT kernels internally:

    • Cooley-Tukey algorithm kernels for row transforms

    • Additional kernels for column transforms

    • Twiddle factor calculations

    • Memory transpose operations

These kernels execute sequentially on the same stream:

Stream timeline:
──────────────────────────────────────────────────────────────
[Transfer In] → [cuBLAS kernels] → [cuFFT kernels] → [Transfer Out]
                 (multiple)          (multiple)

The stream ensures operations happen in order

Key insight: You’re not writing kernels, but you’re still using them! Each library function is like a black box containing dozens of optimized GPU kernels. Different CUDA libraries can share streams and work on the same data buffer, enabling complex multi-operation pipelines without writing any GPU kernel code yourself.

Implementation

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    // ═══════════════════════════════════════════════════════════
    // STEP 1: Configuration
    // ═══════════════════════════════════════════════════════════
    const int N = 1024;               // Image size (1024×1024)
    const int CHUNK_SIZE = 1000;      // Images per chunk
    const int NUM_CHUNKS = 100;       // Total 100 chunks = 100,000 images
    const int NUM_STREAMS = 3;        // Streams for pipelining

    const size_t imageSize = N * N * sizeof(cufftComplex);
    const size_t chunkBytes = CHUNK_SIZE * imageSize;

    printf("Processing %d images (%d chunks of %d)\n",
           NUM_CHUNKS * CHUNK_SIZE, NUM_CHUNKS, CHUNK_SIZE);
    printf("Operations per image: Matrix multiplication + FFT\n");
    printf("Chunk size: %.2f GB\n", chunkBytes / (1024.0*1024.0*1024.0));

    // ═══════════════════════════════════════════════════════════
    // STEP 2: Allocate pinned host memory
    // ═══════════════════════════════════════════════════════════
    cufftComplex *h_data[NUM_STREAMS];    // Input/output data
    cufftComplex *h_result[NUM_STREAMS];  // Final results

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaMallocHost(&h_data[i], chunkBytes);
        cudaMallocHost(&h_result[i], chunkBytes);
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 3: Allocate device memory
    // ═══════════════════════════════════════════════════════════
    cufftComplex *d_data[NUM_STREAMS];    // Working buffer on GPU
    cufftComplex *d_temp[NUM_STREAMS];    // Temporary for matrix mult

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaMalloc(&d_data[i], chunkBytes);
        cudaMalloc(&d_temp[i], chunkBytes);
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 4: Initialize cuBLAS and cuFFT
    // ═══════════════════════════════════════════════════════════
    // cuBLAS handle
    cublasHandle_t cublasHandle;
    cublasCreate(&cublasHandle);

    // cuFFT plan for batched 2D FFT
    // Process multiple 1024×1024 FFTs in one call
    cufftHandle fftPlan;
    int rank = 2;                     // 2D FFT
    int n[2] = {N, N};               // Dimensions
    int batch = CHUNK_SIZE;          // Number of FFTs per batch
    cufftPlanMany(&fftPlan, rank, n,
                  NULL, 1, N*N,       // Input layout
                  NULL, 1, N*N,       // Output layout
                  CUFFT_C2C, batch);  // Complex-to-complex, batched

    // Create CUDA streams
    cudaStream_t streams[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamCreate(&streams[i]);
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 5: Process chunks with multi-operation pipeline
    // ═══════════════════════════════════════════════════════════
    const cufftComplex alpha = {1.0f, 0.0f};  // Scaling factor for matrix mult
    const cufftComplex beta = {0.0f, 0.0f};

    for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
        int streamIdx = chunk % NUM_STREAMS;
        cudaStream_t stream = streams[streamIdx];

        // Set both libraries to use this stream
        cublasSetStream(cublasHandle, stream);
        cufftSetStream(fftPlan, stream);

        // ───────────────────────────────────────────────────────
        // Load chunk from disk
        // ───────────────────────────────────────────────────────
        // In real application: read diffraction patterns from file
        // For now: generate random complex data
        for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
            h_data[streamIdx][i].x = rand() / (float)RAND_MAX;
            h_data[streamIdx][i].y = rand() / (float)RAND_MAX;
        }

        // ───────────────────────────────────────────────────────
        // Pipeline: Transfer → Matrix Mult → FFT → Transfer back
        // ───────────────────────────────────────────────────────

        // Transfer input to GPU (async)
        cudaMemcpyAsync(d_data[streamIdx], h_data[streamIdx],
                       chunkBytes, cudaMemcpyHostToDevice, stream);

        // Operation 1: Matrix multiplication using cuBLAS (async)
        // Apply transformation to all images in chunk
        cublasCgemmStridedBatched(
            cublasHandle,
            CUBLAS_OP_N, CUBLAS_OP_N,
            N, N, N,
            &alpha,
            d_data[streamIdx], N, N*N,
            d_data[streamIdx], N, N*N,  // Multiply with itself (demo)
            &beta,
            d_temp[streamIdx], N, N*N,
            CHUNK_SIZE);

        // Copy result back to d_data for FFT
        cudaMemcpyAsync(d_data[streamIdx], d_temp[streamIdx],
                       chunkBytes, cudaMemcpyDeviceToDevice, stream);

        // Operation 2: 2D FFT using cuFFT (async)
        // Transform all images to frequency domain
        cufftExecC2C(fftPlan, d_data[streamIdx], d_data[streamIdx],
                    CUFFT_FORWARD);

        // Transfer results back to host (async)
        cudaMemcpyAsync(h_result[streamIdx], d_data[streamIdx],
                       chunkBytes, cudaMemcpyDeviceToHost, stream);

        // ───────────────────────────────────────────────────────
        // Save results from previous chunk
        // ───────────────────────────────────────────────────────
        if (chunk >= NUM_STREAMS) {
            int prevChunk = chunk - NUM_STREAMS;
            cudaStreamSynchronize(stream);

            // Save results: write to disk
            printf("Chunk %d processed (1000 images: MatMul + FFT)\n",
                   prevChunk);
        }
    }

    // ═══════════════════════════════════════════════════════════
    // STEP 6: Wait for final chunks
    // ═══════════════════════════════════════════════════════════
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamSynchronize(streams[i]);
        printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
    }

    printf("\nAll 100,000 images processed successfully!\n");
    printf("Each image: Matrix multiplication + 2D FFT\n");

    // ═══════════════════════════════════════════════════════════
    // STEP 7: Cleanup
    // ═══════════════════════════════════════════════════════════
    cufftDestroy(fftPlan);
    cublasDestroy(cublasHandle);

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamDestroy(streams[i]);
        cudaFree(d_data[i]);
        cudaFree(d_temp[i]);
        cudaFreeHost(h_data[i]);
        cudaFreeHost(h_result[i]);
    }

    return 0;
}

Compilation:

nvcc -o tutorial10 tutorial10.cu -lcublas -lcufft

Key points:

  • Link both libraries: -lcublas -lcufft

  • Both libraries share the same CUDA stream

  • Operations execute sequentially on the stream (transfer → cuBLAS → cuFFT → transfer)

  • No custom kernels needed

cuFFT batched operations

Similar to cuBLAS, cuFFT provides batched operations for processing multiple FFTs efficiently:

Function

Description

cufftPlanMany

Creates a plan for batched FFT operations

cufftExecC2C

Complex-to-complex FFT (forward or inverse)

cufftExecR2C

Real-to-complex FFT (forward)

cufftExecC2R

Complex-to-real FFT (inverse)

cufftSetStream

Associates FFT plan with CUDA stream

Plan parameters:

  • rank: Dimensionality (1D, 2D, or 3D)

  • n[]: Size of each dimension

  • batch: Number of FFTs to perform

  • CUFFT_C2C: Transform type (complex-to-complex, real-to-complex, etc.)

Can I call cuBLAS/cuFFT from within custom kernels?

Short answer: No, you cannot call cuBLAS or cuFFT functions from inside your own __global__ kernel.

Why not:

cuBLAS and cuFFT functions are host functions (run on CPU) that launch their own GPU kernels internally. They cannot be called from device code (GPU kernels).

What happens between kernel launches:

Between separate kernel launches:

  1. Kernel 1 finishes → writes results to global memory (VRAM)

  2. GPU waits for all threads to complete

  3. Kernel 2 launches → reads data from global memory (VRAM)

Is the VRAM write→read a bottleneck?

Yes and no, depending on the operation.

The concern is real: Writing to VRAM and reading back from it takes time. For a \(1024 \times 1024\) complex matrix (8 MB):

  • Write time: ~0.01 ms (at ~800 GB/s VRAM bandwidth)

  • Read time: ~0.01 ms

  • Total memory traffic: ~0.02 ms

Here’s the key insight: Fusion reduces memory traffic

Separate kernels (Tutorial 10):
─────────────────────────────────────────────────────────────
Kernel 1 (cuBLAS): Read input (8 MB) → Compute → Write output (8 MB)
Kernel 2 (cuFFT):  Read input (8 MB) → Compute → Write output (8 MB)

Total VRAM traffic: 32 MB (4 transfers)
Memory time: ~0.04 ms (32 MB at 800 GB/s)

Hypothetical fused kernel:
─────────────────────────────────────────────────────────────
Fused kernel: Read input (8 MB) → MatMul → FFT → Write output (8 MB)

Total VRAM traffic: 16 MB (2 transfers)
Memory time: ~0.02 ms (16 MB at 800 GB/s)
Savings: 50% memory traffic, ~0.02 ms saved

The ptychography case is interesting: You have 100,000 images where each image undergoes multiple operations:

  1. Matrix multiplication (probe update)

  2. Forward FFT (propagate to detector)

  3. Apply constraints (amplitude/phase corrections)

  4. Inverse FFT (back to real space)

  5. Another matrix multiplication (object update)

Should you fuse these operations?

For ptychography with \(1024 \times 1024\) images, the library approach (Tutorial 10) is still optimal because:

  • Each operation is compute-intensive (FFT: ~0.5 ms, matrix mult: ~0.1 ms)

  • Memory traffic between kernels (~0.02 ms) is only 3% of total time

  • cuBLAS and cuFFT are 5-10× faster than custom implementations

  • Processing 100,000 images takes minutes either way

When fusion WOULD matter:

  • Small images: \(256 \times 256\) or smaller where computation is fast

  • Simple operations: Element-wise operations (scaling, phase shifts) rather than FFT/matrix mult

  • Tight iteration loops: Iterative refinement with 10+ operations per image per iteration

  • Memory-bound regime: When GPU is waiting on memory rather than computing

Important note about ptychography: You’re right that many ptychography operations are memory-bound! The constraint application steps (amplitude corrections, phase updates, element-wise multiplications) are typically very fast computations (~0.01 ms) but require reading and writing full images. For these operations, fusion can help significantly.

Practical recommendation for ptychography reconstruction:

The pipeline has two types of operations:

  1. Compute-intensive (use libraries, don’t fuse):

    • Matrix multiplication (probe/object updates): cuBLAS

    • Forward/inverse FFT: cuFFT

    • These dominate computation time (~0.6 ms total)

    • Memory traffic between them is negligible (~3% of time)

  2. Memory-bound (consider fusion):

    • Amplitude constraint application

    • Phase corrections

    • Element-wise multiplications

    • These are fast (~0.01 ms each) but happen many times

    • Fusing these into a single kernel reduces intermediate VRAM writes

Optimal approach:

for each iteration:
    // Use library (no fusion needed, computation dominates)
    cuBLAS: probe update              // ~0.1 ms
    cuFFT: forward transform          // ~0.5 ms

    // Fuse memory-bound operations (saves VRAM traffic)
    custom kernel: {
        apply amplitude constraint
        phase correction
        element-wise operations
    }                                  // ~0.01 ms total

    cuFFT: inverse transform          // ~0.5 ms
    cuBLAS: object update             // ~0.1 ms

This hybrid approach gives you the best of both worlds: optimized libraries for heavy computation, fused kernels for memory-bound operations.

When the savings DON’T matter (like Tutorial 10):

For our \(1024 \times 1024\) matrices:

  • cuBLAS computation time: ~0.1 ms (memory traffic is ~20% of total time)

  • cuFFT computation time: ~0.5 ms (FFT is compute-intensive)

  • Memory transfer time: ~0.02 ms (only ~3% of total time)

The computation dominates, so the intermediate VRAM write→read is negligible compared to the actual cuBLAS and cuFFT work.

Bottom line: For large, compute-intensive operations like \(1024 \times 1024\) matrix multiplication and FFT, the intermediate VRAM traffic between kernels is a small fraction of total time. The benefit of using highly optimized libraries far outweighs the cost of the extra memory traffic.

How can I fuse kernels?

If you need to combine operations in a single kernel launch, you have two options:

  1. Write custom fused kernel: Implement both operations yourself in one kernel

__global__ void fusedMatMulAndFFT(float* input, float* output, int N) {
    // Your custom matrix multiplication code
    // + Your custom FFT code
    // Both in one kernel - no intermediate global memory writes
}

Pros: No intermediate memory traffic, single launch overhead

Cons: Much harder to implement, lose cuBLAS/cuFFT optimizations

  1. Use device-side libraries (limited availability)

CUDA provides some device-callable math libraries (like curand for random numbers), but cuBLAS and cuFFT are not available as device functions.

For Tutorial 10’s use case:

The sequential approach (separate cuBLAS and cuFFT calls) is actually fine because:

  • Data stays in GPU memory between calls (no CPU←→GPU transfer)

  • Kernel launch overhead is microseconds (negligible compared to computation)

  • You get highly optimized implementations from both libraries

  • The stream ensures operations execute in order without CPU intervention

When to worry about kernel launch overhead:

  • Processing very small matrices (< 64×64) where launch overhead dominates

  • Launching thousands of tiny kernels in a loop

  • Operations that need custom fusion (e.g., applying constraints between matrix ops)

For our 1024×1024 images, cuBLAS + cuFFT as separate calls is the right choice!

Real-world application: Ptychography

This pattern directly applies to ptychography reconstruction:

  1. Load diffraction pattern: Read from detector (1024×1024 complex array)

  2. Apply probe update: Matrix multiplication with probe function (cuBLAS)

  3. Propagate to detector: 2D FFT to frequency domain (cuFFT)

  4. Apply constraints: Can add custom kernel if needed

  5. Inverse FFT: Transform back to real space (cuFFT)

  6. Update object: Another matrix operation (cuBLAS)

Processing 100,000 diffraction patterns with this pipeline achieves high-resolution image reconstruction in minutes instead of hours.

Multi-library pipeline benefits

Why combine libraries:

  1. No reinventing the wheel: Use highly optimized implementations

  2. Composability: Libraries work together through CUDA streams

  3. Maintainability: Updates benefit all operations automatically

  4. Performance: Each library is tuned for its specific operation

Common library combinations:

  • cuBLAS + cuFFT: Signal processing, ptychography, convolution

  • cuBLAS + cuDNN: Deep learning (matrix ops + neural network layers)

  • cuFFT + Thrust: FFT processing with sorting/reduction

  • cuSolver + cuBLAS: Linear system solving with matrix operations

Summary

Tutorial 10 demonstrates production-scale multi-operation pipelines:

  • Tutorials 1-6: GPU fundamentals

  • Tutorial 7: Large-scale processing with custom kernels

  • Tutorials 8-9: Single library optimization (cuBLAS)

  • Tutorial 10: Multi-library pipeline (cuBLAS + cuFFT)

Key takeaways:

  1. Library composition: Multiple CUDA libraries work together via streams

  2. No custom kernels: Standard operations use optimized library implementations

  3. Real-world pattern: Typical for scientific computing (matrix ops + FFT)

  4. Scalability: Same pattern works for any number of images with chunking

Complete workflow:

  1. Learn fundamentals with custom kernels (understand how GPUs work)

  2. Replace custom code with libraries (cuBLAS, cuFFT, cuDNN)

  3. Combine libraries in pipelines (handle complex real-world applications)

  4. Mix custom kernels with libraries (Tutorial 11 shows how)

Tutorial 11: Hybrid pipeline (custom kernels + CUDA libraries)

Problem

Real-world applications often need a mix of custom kernels and optimized libraries. For example, in signal processing:

  • Custom operations: Domain-specific transformations (thresholding, masking, custom filters)

  • Standard operations: FFT transforms, matrix multiplications

Writing everything as custom kernels means reinventing the wheel for standard operations. Using only libraries means you cannot implement custom domain logic. The solution: combine custom CUDA kernels with CUDA libraries.

This tutorial shows a simple image filtering pipeline that:

  1. Uses custom kernel to apply a threshold mask

  2. Uses cuFFT for frequency domain transformation

  3. Uses custom kernel to apply frequency domain filtering

  4. Uses cuFFT for inverse transform back to spatial domain

Solution overview

Image filtering pipeline:

1. Apply threshold mask: Mask pixels below threshold
   → Custom kernel (domain-specific logic)

2. Transform to frequency domain: spectrum = FFT(masked_image)
   → cuFFT (optimized library)

3. Apply frequency filter: Filter high/low frequencies
   → Custom kernel (element-wise operations)

4. Transform back: filtered_image = IFFT(spectrum)
   → cuFFT (inverse transform)

Why this hybrid approach:

  • Custom kernels: Fast for simple element-wise operations and custom logic

  • cuFFT: Highly optimized FFT implementation using specialized algorithms

  • Best of both worlds: Speed of libraries + flexibility of custom code

Example: For 1024×1024 image, ~2 ms total (vs ~8 ms if implementing FFT manually)

Step 1: Define custom kernels

Custom kernel 1: Apply threshold mask

// Set pixels below threshold to zero
__global__ void applyThreshold(
    const float* input,
    float* output,
    float threshold,
    int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    // Keep pixel if above threshold, otherwise set to zero
    output[idx] = (input[idx] >= threshold) ? input[idx] : 0.0f;
}

Custom kernel 2: Apply frequency filter

// Apply low-pass filter in frequency domain
// Attenuate high frequencies (reduce noise)
__global__ void frequencyFilter(
    cufftComplex* spectrum,
    int width,
    int height,
    float cutoff_frequency)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height) return;

    // Calculate distance from center (DC component)
    int center_x = width / 2;
    int center_y = height / 2;

    float dx = (float)(x - center_x) / width;
    float dy = (float)(y - center_y) / height;
    float distance = sqrtf(dx * dx + dy * dy);

    // Low-pass filter: Attenuate based on distance
    float attenuation = 1.0f / (1.0f + powf(distance / cutoff_frequency, 4.0f));

    int idx = y * width + x;
    spectrum[idx].x *= attenuation;
    spectrum[idx].y *= attenuation;
}

Custom kernel 3: Normalize after IFFT

// cuFFT doesn't normalize, so we need to divide by N
__global__ void normalize(
    cufftComplex* data,
    int N,
    float scale)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    data[idx].x *= scale;
    data[idx].y *= scale;
}

Step 2: Set up cuFFT library

#include <cuda_runtime.h>
#include <cufft.h>
#include <stdio.h>

int main() {
    // ═══════════════════════════════════════════════════════════
    // Configuration
    // ═══════════════════════════════════════════════════════════
    const int WIDTH = 1024;
    const int HEIGHT = 1024;
    const int TOTAL_SIZE = WIDTH * HEIGHT;

    const float THRESHOLD = 0.3f;       // Threshold for masking
    const float CUTOFF_FREQ = 0.1f;     // Low-pass filter cutoff

    // ═══════════════════════════════════════════════════════════
    // Initialize cuFFT
    // ═══════════════════════════════════════════════════════════

    // Create 2D FFT plan for complex-to-complex transform
    cufftHandle plan_forward, plan_inverse;
    cufftPlan2d(&plan_forward, HEIGHT, WIDTH, CUFFT_C2C);
    cufftPlan2d(&plan_inverse, HEIGHT, WIDTH, CUFFT_C2C);

    // Create CUDA stream for ordering operations
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // Associate FFT plans with stream
    cufftSetStream(plan_forward, stream);
    cufftSetStream(plan_inverse, stream);

    printf("Image filtering pipeline\n");
    printf("Image size: %d×%d\n", WIDTH, HEIGHT);
    printf("Threshold: %.2f\n", THRESHOLD);
    printf("Cutoff frequency: %.2f\n", CUTOFF_FREQ);

    // ═══════════════════════════════════════════════════════════
    // Allocate GPU memory
    // ═══════════════════════════════════════════════════════════
    float *d_input_real, *d_output_real;
    cufftComplex *d_complex_input, *d_spectrum, *d_filtered;

    cudaMalloc(&d_input_real, TOTAL_SIZE * sizeof(float));
    cudaMalloc(&d_output_real, TOTAL_SIZE * sizeof(float));
    cudaMalloc(&d_complex_input, TOTAL_SIZE * sizeof(cufftComplex));
    cudaMalloc(&d_spectrum, TOTAL_SIZE * sizeof(cufftComplex));
    cudaMalloc(&d_filtered, TOTAL_SIZE * sizeof(cufftComplex));

    // Load input image (for demo, we'll use synthetic data)
    // In real application: cudaMemcpy from host
    // ... load image data ...

Step 3: Implement the hybrid pipeline

    // ═══════════════════════════════════════════════════════════
    // Set up kernel launch configuration
    // ═══════════════════════════════════════════════════════════

    // For 1D kernels (threshold, normalize)
    int threadsPerBlock_1D = 256;
    int blocks_1D = (TOTAL_SIZE + threadsPerBlock_1D - 1) / threadsPerBlock_1D;

    // For 2D kernel (frequency filter)
    dim3 threadsPerBlock_2D(16, 16);
    dim3 blocks_2D((WIDTH + 15) / 16, (HEIGHT + 15) / 16);

    // ═══════════════════════════════════════════════════════════
    // Pipeline execution (Custom + cuFFT + Custom + cuFFT)
    // ═══════════════════════════════════════════════════════════

    printf("\nProcessing pipeline:\n");

    // ───────────────────────────────────────────────────────────
    // STEP 1: Apply threshold (custom kernel)
    // ───────────────────────────────────────────────────────────
    printf("  [1/4] Applying threshold mask...\n");
    applyThreshold<<<blocks_1D, threadsPerBlock_1D, 0, stream>>>(
        d_input_real,
        d_output_real,
        THRESHOLD,
        TOTAL_SIZE
    );

    // Convert to complex for FFT (real part = masked data, imag = 0)
    // In practice, you'd have a kernel for this; simplified here
    cudaMemcpy(d_complex_input, d_output_real,
               TOTAL_SIZE * sizeof(float), cudaMemcpyDeviceToDevice);

    // ───────────────────────────────────────────────────────────
    // STEP 2: Forward FFT (cuFFT library)
    // ───────────────────────────────────────────────────────────
    printf("  [2/4] Computing FFT...\n");
    cufftExecC2C(plan_forward, d_complex_input, d_spectrum, CUFFT_FORWARD);

    // ───────────────────────────────────────────────────────────
    // STEP 3: Apply frequency filter (custom kernel)
    // ───────────────────────────────────────────────────────────
    printf("  [3/4] Applying frequency domain filter...\n");
    frequencyFilter<<<blocks_2D, threadsPerBlock_2D, 0, stream>>>(
        d_spectrum,
        WIDTH,
        HEIGHT,
        CUTOFF_FREQ
    );

    // ───────────────────────────────────────────────────────────
    // STEP 4: Inverse FFT (cuFFT library)
    // ───────────────────────────────────────────────────────────
    printf("  [4/4] Computing inverse FFT...\n");
    cufftExecC2C(plan_inverse, d_spectrum, d_filtered, CUFFT_INVERSE);

    // Normalize (cuFFT doesn't normalize by default)
    float scale = 1.0f / (WIDTH * HEIGHT);
    normalize<<<blocks_1D, threadsPerBlock_1D, 0, stream>>>(
        d_filtered,
        TOTAL_SIZE,
        scale
    );

    // Wait for all operations to complete
    cudaStreamSynchronize(stream);

    printf("\nPipeline complete!\n");

    // Copy result back to host if needed
    // cudaMemcpy(host_output, d_filtered, ...);

    // ═══════════════════════════════════════════════════════════
    // Cleanup
    // ═══════════════════════════════════════════════════════════
    cufftDestroy(plan_forward);
    cufftDestroy(plan_inverse);
    cudaStreamDestroy(stream);

    cudaFree(d_input_real);
    cudaFree(d_output_real);
    cudaFree(d_complex_input);
    cudaFree(d_spectrum);
    cudaFree(d_filtered);

    return 0;
}

Compilation:

nvcc -o tutorial11 tutorial11.cu -lcufft

Key points about hybrid pipelines

1. All operations share the same stream

// Set library to use our stream
cufftSetStream(plan_forward, stream);
cufftSetStream(plan_inverse, stream);

// Custom kernels also use the same stream
myKernel<<<blocks, threads, 0, stream>>>(...);

This ensures operations execute in order without extra synchronization between custom kernels and library calls.

2. Data stays on GPU throughout

GPU Memory (no CPU transfers during processing):
────────────────────────────────────────────────────────
Custom kernel → cuFFT → Custom kernel → cuFFT → Custom kernel
     ↓            ↓          ↓            ↓           ↓
All operations read/write GPU VRAM (fast, ~800 GB/s)

No CPU ←→ GPU transfers (would be slow, ~16 GB/s PCIe)

Everything happens on GPU from input to output.

3. Custom kernels for simple operations, libraries for complex ones

Operation

Complexity

Implementation

Thresholding

Simple

Custom kernel

Frequency filtering

Simple

Custom kernel

Normalization

Simple

Custom kernel

2D FFT

Complex

cuFFT library

4. Flexibility of custom kernels

Custom kernels let you implement domain-specific logic:

// Your custom logic (not available in any library)
if (input[idx] >= threshold) {
    output[idx] = input[idx];  // Keep value
} else {
    output[idx] = 0.0f;        // Mask out
}

Libraries provide optimized standard operations, custom kernels provide flexibility.

Performance comparison

For 1024×1024 images:

Approach

Time

Notes

CPU (NumPy + SciPy)

~50 ms

Single-threaded

Custom kernels only

~8 ms

Manual FFT implementation (slow)

Libraries only

N/A

Cannot implement custom logic

Hybrid (Tutorial 11)

~2 ms

Best of both worlds

Breakdown for hybrid approach:

  • Threshold kernel: ~0.1 ms

  • Forward FFT (cuFFT): ~0.8 ms

  • Frequency filter kernel: ~0.2 ms

  • Inverse FFT (cuFFT): ~0.8 ms

  • Normalize kernel: ~0.1 ms

  • Total: ~2 ms (25× faster than CPU)

Why the hybrid approach works

Principle 1: Use the right tool for each operation

  • Custom kernels: Quick to write for simple logic (thresholding, masking, element-wise ops)

  • cuFFT: Years of optimization, specialized algorithms and hardware acceleration

Writing FFT from scratch would take weeks and still be slower than cuFFT.

Principle 2: Libraries handle complex algorithms

Manual FFT implementation:
- Cooley-Tukey algorithm
- Bit-reversal permutation
- Twiddle factor computation
- Memory access optimization
- Edge case handling
→ Weeks of work, still slower than cuFFT

cuFFT:
- One function call
- Optimized by NVIDIA engineers
- Hardware-specific tuning
→ ~0.8 ms for 1024×1024

Principle 3: Minimize synchronization

// Bad: Synchronize after every operation
myKernel1<<<...>>>();
cudaDeviceSynchronize();  // ❌ Unnecessary wait
myKernel2<<<...>>>();
cudaDeviceSynchronize();  // ❌ Unnecessary wait

// Good: Let stream handle ordering
myKernel1<<<..., stream>>>();
myKernel2<<<..., stream>>>();  // ✅ Automatically waits for kernel1
cufftExec(...);                // ✅ Waits for kernel2
cudaStreamSynchronize(stream);  // ✅ Only sync at end

Real-world applications

This hybrid pattern appears in many domains:

Image processing:

Custom kernel: Preprocessing (crop, normalize)
cuFFT: Frequency analysis
Custom kernel: Apply filters
cuFFT: Inverse transform
Custom kernel: Postprocessing

Scientific simulation:

cuBLAS: Matrix operations
Custom kernel: Apply boundary conditions
cuFFT: Transform to frequency domain
Custom kernel: Apply physics constraints
cuFFT: Transform back

Audio processing:

Custom kernel: Window function
cuFFT: Spectrogram computation
Custom kernel: Noise reduction
cuFFT: Inverse transform
Custom kernel: Normalize output

Summary

Tutorial 11 shows how to combine custom kernels with CUDA libraries:

  • Tutorials 1-6: GPU fundamentals (memory, parallelism, optimization)

  • Tutorial 7: Large-scale processing (chunking, streams)

  • Tutorials 8-9: Library usage (cuBLAS for matrix operations)

  • Tutorial 10: Multi-library pipelines (cuBLAS + cuFFT)

  • Tutorial 11: Hybrid approach (custom kernels + libraries)

Key takeaways:

  1. Mix custom and library code: Libraries for complex algorithms, custom kernels for simple logic

  2. Share CUDA streams: Ensures correct execution order between custom and library operations

  3. Keep data on GPU: All operations run on GPU, no CPU transfers

  4. Don’t reinvent the wheel: Use cuFFT instead of writing FFT from scratch

  5. Custom kernels add flexibility: Implement domain-specific operations not available in libraries

When to use hybrid approach:

  • Need standard operations (FFT, matrix multiply) plus custom logic

  • Want speed of optimized libraries plus flexibility of custom code

  • Processing pipeline requires both simple and complex operations

  • Production code where both performance and maintainability matter

This completes the CUDA tutorial series from fundamentals to hybrid pipelines combining custom kernels with optimized libraries.

Next steps

  • For GPU optimization and profiling: See GPU optimization to learn how to profile your code with NSight, identify bottlenecks, and apply kernel fusion for memory-bound operations

  • For conceptual GPU architecture: See CUDA fundamentals for fundamentals of GPU hardware and programming model

To-do

  • [ ] Benchmark for GPU

  • [ ] Random matrix generation, 1024-1024px, 100,000 of them

  • [ ] Mixed-precision example with FP16/cuBLAS

  • [ ] Multi-GPU