.. _cuda-tutorials:

CUDA tutorials
==============

This page contains progressive tutorials for learning CUDA C++ programming through practical examples. Each tutorial builds on the previous one, from basic array addition to production-ready pipelined processing with CUDA streams.

For fundamental CUDA concepts, architecture, and memory model, see :ref:`cuda`.

Tutorial progression
--------------------

.. note::
   
   These tutorials progress from simple to realistic production scenarios:
   
   0. **Memory basics (0/6)**: 1D array addition with memory management fundamentals
   1. **Single block (1/6)**: Basic 2D matrix addition with 16×16 matrix using one block
   2. **Multiple blocks (2/6)**: Scale to 256×256 matrix using 256 blocks
   3. **Batch processing (3/6)**: Process 1,000 matrices simultaneously (65.5M elements)
   4. **Chunked processing (4/6)**: Handle 100,000 matrices with limited VRAM
   5. **Pipelined streams (5/6)**: Overlap data transfer and computation for maximum performance
   6. **Shared memory (6/6)**: Matrix multiplication with tiling for 10-20× speedup

Tutorial 0: Memory management basics
-------------------------------------

Understanding how to allocate and transfer data between CPU (host) and GPU (device) is fundamental to CUDA programming. This tutorial shows a complete example of 1D array addition.

  **Memory workflow visualization**::

      Computer with CPU and GPU
      ┌──────────────────────────────────────────────────────┐
      │ CPU Side                    GPU Side                 │
      │                                                      │
      │ RAM (System Memory)         VRAM (GPU Memory)        │
      │ ┌─────────────────┐         ┌──────────────────┐     │
      │ │ h_A[1.0, 2.0]   │────────▶│ d_A[1.0, 2.0]    │     │
      │ │ h_B[3.0, 4.0]   │ memcpy  │ d_B[3.0, 4.0]    │     │
      │ └─────────────────┘         └──────────────────┘     │
      │                                      │               │
      │                              Kernel accesses         │
      │                              VRAM using d_A, d_B     │
      │                                      ▼               │
      │                             ┌──────────────────┐     │
      │                             │ 1000 GPU threads │     │
      │                             │ compute in VRAM  │     │
      │                             └──────────────────┘     │
      └──────────────────────────────────────────────────────┘
      
      Workflow:
      1. Allocate h_A in RAM (CPU memory)
      2. Allocate d_A in VRAM (GPU memory) using cudaMalloc
      3. Copy h_A → d_A using cudaMemcpy (RAM to VRAM, happens once)
      4. Launch kernel with d_A pointer (just passing an address, not copying data)
      5. Kernel threads access d_A directly in VRAM (fast, no copying)

Here is a pedagogical example of adding two arrays:

  .. code-block:: cpp

      // Problem: We want 1000 GPU threads to add two arrays
      // Arrays must be in VRAM, and threads need to know where to find them
      
      __global__ void addArrays(float* A, float* B, float* C, int N)
      {
          int i = threadIdx.x;
          if (i < N) {
              C[i] = A[i] + B[i];  // Accesses data in VRAM
          }
      }

      int main()
      {
          int N = 1000;
          size_t size = N * sizeof(float);  // 1000 × 4 bytes = 4000 bytes
          
          // Step 1: Allocate and initialize arrays in CPU RAM
          float *h_A = (float*)malloc(size);  // h_ = host (CPU)
          float *h_B = (float*)malloc(size);
          float *h_C = (float*)malloc(size);
          
          // malloc(size) behind the scenes:
          // 1. Find 4000 bytes of free RAM
          // 2. Return starting address of the allocated memory (e.g., 0x5000)
          // 3. Store that address in h_A
          
          for (int i = 0; i < N; i++) {
              h_A[i] = 1.0f;
              h_B[i] = 2.0f;
          }
          
          // Step 2: Allocate arrays in GPU VRAM
          float *d_A, *d_B, *d_C;  // d_ = device (GPU)
          
          cudaMalloc(&d_A, size);  // Behind the scenes:
          // 1. Find 4000 bytes of free VRAM
          // 2. Get starting address of that VRAM block (e.g., 0x7f8a4b000000)
          // 3. Write that starting address into d_A (on CPU)
          // Result: d_A points to first element of the array in VRAM
          //
          // Summary:
          // d_A is a pointer variable defined as `float* d_A`
          // => d_A    = VRAM address (where array starts in GPU memory)
          // => *d_A   = data at that VRAM address (first element value)
          // => &d_A   = CPU RAM address (where d_A variable itself lives)
          
          cudaMalloc(&d_B, size);
          cudaMalloc(&d_C, size);
          
          // Step 3: Copy data from RAM to VRAM (once)
          cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
          cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
          
          // Step 4: Launch kernel with VRAM pointers
          addArrays<<<1, N>>>(d_A, d_B, d_C, N);
          // Threads access data directly in VRAM
          
          // Step 5: Copy result back to RAM
          cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
          
          // Step 6: Cleanup
          cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
          free(h_A); free(h_B); free(h_C);
      }

Here is a visual:

    ::

      CPU Memory (RAM)              GPU Memory (VRAM)
      ┌────────────────┐            ┌──────────────────┐
      │ Address:0x1000 │            │ Address:0x7f8a...│
      │ ┌────────────┐ │            │ ┌──────────────┐ │
      │ │ d_A        │ │            │ │ [1.5, 2.3...]│ │
      │ │ = 0x7f8a...│─┼─points to─▶│ └──────────────┘ │
      │ └────────────┘ │            │        ↑         │
      │       ↑        │            │     *d_A = 1.5   │
      │    &d_A        │            └──────────────────┘
      └────────────────┘
      
      &d_A = 0x1000 (address of pointer on CPU)
      d_A  = 0x7f8a... (VRAM address stored in d_A)
      *d_A = 1.5 (data in VRAM)


.. dropdown:: Why pass pointers to kernels?
  :icon: info
  :color: info

  The problem is that GPUs have separate memory (VRAM) from CPUs (RAM). Your data must be in GPU memory before a kernel can process it. But how do you tell the kernel where to find that data in VRAM? If you tried to pass entire arrays by value, the program would copy gigabytes every kernel launch, which defeats the purpose of having data in fast VRAM.
  
  The solution: First copy data from CPU RAM to GPU VRAM once using ``cudaMemcpy``. Then pass pointers (GPU memory addresses in VRAM) to kernels. A pointer is just a small number (8 bytes) that tells threads where to find the array in VRAM. Thousands of threads can then work on the same dataset in VRAM without any copying. The naming convention uses ``d_`` prefix for device (GPU/VRAM) pointers and ``h_`` prefix for host (CPU/RAM) pointers.


.. dropdown:: Why doesn't the kernel just know the VRAM addresses automatically?
  :icon: info
  :color: info

  Kernels are reusable functions, not hardcoded to specific data.
  
  If addresses were hardcoded, you could only use each kernel once. By passing pointers as parameters, one kernel can process unlimited different datasets - same reason C++ functions take parameters instead of hardcoding values.

  .. code-block:: cpp

      // Same kernel, different data!
      __global__ void addArrays(float* A, float* B, float* C, int N) {
          int i = threadIdx.x;
          C[i] = A[i] + B[i];  // Works on ANY arrays you pass
      }
      
      int main() {
          // Scenario 1: Physics simulation
          float *d_physics_A, *d_physics_B, *d_physics_C;
          cudaMalloc(&d_physics_A, size);
          cudaMalloc(&d_physics_B, size);
          cudaMalloc(&d_physics_C, size);
          addArrays<<<1, N>>>(d_physics_A, d_physics_B, d_physics_C, N);
          
          // Scenario 2: Same kernel, different data for ML
          float *d_ml_A, *d_ml_B, *d_ml_C;
          cudaMalloc(&d_ml_A, size);
          cudaMalloc(&d_ml_B, size);
          cudaMalloc(&d_ml_C, size);
          addArrays<<<1, N>>>(d_ml_A, d_ml_B, d_ml_C, N);
          // ▲ Same kernel code, different VRAM addresses!
          
          // Scenario 3: Use outputs from previous kernels as inputs
          addArrays<<<1, N>>>(d_physics_C, d_ml_C, d_final, N);
      }


Tutorial 1: 2D matrix addition with single block (16×16)
--------------------------------------------------------

Now we progress from 1D arrays to 2D matrices. We organize threads in a 2D layout (16×16) within a single block, where each thread computes one matrix element.

Here is the complete working example:

  .. code-block:: cpp

      #include <cuda_runtime.h>
      #include <stdio.h>

      __global__ void matAdd(float* A, float* B, float* C, int N)
      {
          int i = threadIdx.x;
          int j = threadIdx.y;
          int index = i * N + j;
          
          if (i < N && j < N) {
              C[index] = A[index] + B[index];
          }
      }

      int main()
      {
          int N = 16;
          size_t size = N * N * sizeof(float);
          
          // Host allocation and initialization
          float *h_A = (float*)malloc(size);
          float *h_B = (float*)malloc(size);
          float *h_C = (float*)malloc(size);
          
          for (int i = 0; i < N * N; i++) {
              h_A[i] = 1.0f;
              h_B[i] = 2.0f;
          }
          
          // Device allocation
          float *d_A, *d_B, *d_C;
          cudaMalloc(&d_A, size);
          cudaMalloc(&d_B, size);
          cudaMalloc(&d_C, size);
          
          // Copy to GPU
          cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
          cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
          
          // Launch kernel
          dim3 threadsPerBlock(16, 16);
          dim3 numBlocks(1, 1);
          matAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
          
          // Copy result back
          cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
          
          // Cleanup
          cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
          free(h_A); free(h_B); free(h_C);
          
          return 0;
      }

  **Thread organization (1 block with 16×16 threads)**::

      Single block with 16×16 = 256 threads
      ┌────────────────────────────────────────┐
      │ Thread  Thread  Thread  ...  Thread    │
      │ (0,0)   (1,0)   (2,0)        (15,0)    │
      │                                        │
      │ Thread  Thread  Thread  ...  Thread    │
      │ (0,1)   (1,1)   (2,1)        (15,1)    │
      │                                        │
      │   ...     ...     ...    ...   ...     │
      │                                        │
      │ Thread  Thread  Thread  ...  Thread    │
      │ (0,15)  (1,15)  (2,15)       (15,15)   │
      └────────────────────────────────────────┘
      
      Each thread (i,j) computes: C[i][j] = A[i][j] + B[i][j]
      
      Key insight: Thread layout matches data layout!
      - Thread (0,0) handles matrix element [0][0]
      - Thread (5,7) handles matrix element [5][7]
      - Thread (15,15) handles matrix element [15][15]

  **The kernel function**:

  .. code-block:: cpp

      __global__ void matAdd(float* A, float* B, float* C, int N)
      {
          // Get thread position in 2D block
          int i = threadIdx.x;  // row index (0-15)
          int j = threadIdx.y;  // column index (0-15)
          
          // Convert 2D (i,j) to 1D memory index
          int index = i * N + j;
          // Why? Memory is 1D, but we think in 2D
          // Example: Element [2][3] is at position 2×16 + 3 = 35
          
          // Bounds check (important for general code)
          if (i < N && j < N) {
              C[index] = A[index] + B[index];
          }
      }

  **What each thread does**::

      Thread (0,0): index = 0×16 + 0 = 0   → C[0]  = A[0]  + B[0]   (element [0][0])
      Thread (0,1): index = 0×16 + 1 = 1   → C[1]  = A[1]  + B[1]   (element [0][1])
      Thread (1,0): index = 1×16 + 0 = 16  → C[16] = A[16] + B[16]  (element [1][0])
      Thread (2,3): index = 2×16 + 3 = 35  → C[35] = A[35] + B[35]  (element [2][3])
      
      All 256 threads execute this same code in parallel!
      Each thread just has different threadIdx values.

  **Host code setup**:

  .. code-block:: cpp

      int main()
      {
          // Define problem size
          int N = 16;  // 16×16 matrix
          size_t size = N * N * sizeof(float);  // 16×16×4 = 1,024 bytes
          
          // Allocate host (CPU) memory
          float *h_A = (float*)malloc(size);  // h_ prefix = host memory
          float *h_B = (float*)malloc(size);
          float *h_C = (float*)malloc(size);
          
          // Initialize matrices with sample values
          for (int i = 0; i < N * N; i++) {
              h_A[i] = 1.0f;  // Matrix A filled with 1.0
              h_B[i] = 2.0f;  // Matrix B filled with 2.0
          }
          // Expected result: C should be filled with 3.0

  **Memory visualization after initialization**::

      CPU RAM:
      h_A: [1.0, 1.0, 1.0, ..., 1.0]  (256 elements × 4 bytes = 1,024 bytes)
      h_B: [2.0, 2.0, 2.0, ..., 2.0]  (256 elements × 4 bytes = 1,024 bytes)
      h_C: [?, ?, ?, ..., ?]           (uninitialized, will hold results)

  **GPU memory allocation**:

  .. code-block:: cpp

          // Declare device (GPU) pointers
          float *d_A, *d_B, *d_C;  // d_ prefix = device memory
          
          // Allocate GPU memory
          cudaMalloc(&d_A, size);  // Allocate 1,024 bytes in VRAM for A
          cudaMalloc(&d_B, size);  // Allocate 1,024 bytes in VRAM for B
          cudaMalloc(&d_C, size);  // Allocate 1,024 bytes in VRAM for C

  **What happens with cudaMalloc**::

      CPU side:                          GPU side:
      ┌─────────────────┐                ┌──────────────────┐
      │ d_A (pointer)   │───points to───▶│ VRAM block       │
      │ value: 0x7f8a...│                │ [empty, 1KB]     │
      └─────────────────┘                └──────────────────┘
      
      cudaMalloc(&d_A, 1024):
      1. Find 1,024 free bytes in GPU VRAM
      2. Return VRAM address (e.g., 0x7f8a4b000000)
      3. Store that address in CPU variable d_A
      
      Result: d_A contains a VRAM address (lives on CPU, points to GPU)

  **Data transfer to GPU**:

  .. code-block:: cpp

          // Copy data from CPU RAM to GPU VRAM
          cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
          cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

  **Data flow visualization**::

      Before cudaMemcpy:
      CPU RAM                     GPU VRAM
      h_A: [1.0, 1.0, ...]        d_A: [?, ?, ...]  (empty)
      h_B: [2.0, 2.0, ...]        d_B: [?, ?, ...]  (empty)
      
      After cudaMemcpy:
      CPU RAM                     GPU VRAM
      h_A: [1.0, 1.0, ...]        d_A: [1.0, 1.0, ...]  (copied!)
      h_B: [2.0, 2.0, ...]        d_B: [2.0, 2.0, ...]  (copied!)
      
      This is a synchronous operation: CPU waits until copy completes
      Time taken: ~1-10 microseconds for 1 KB over PCIe

  **Launching the kernel**:

  .. code-block:: cpp

          // Configure launch parameters
          dim3 threadsPerBlock(16, 16);  // 16×16 = 256 threads per block
          dim3 numBlocks(1, 1);           // Just 1 block total
          
          // Launch kernel
          matAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);

  **Understanding dim3**::

      dim3 is a CUDA type for 3D dimensions (x, y, z)
      
      dim3 threadsPerBlock(16, 16);
      - Creates 16×16 = 256 threads in a 2D grid within one block
      - Equivalent to: dim3 threadsPerBlock(16, 16, 1);
      - Each thread gets unique (threadIdx.x, threadIdx.y)
      
      dim3 numBlocks(1, 1);
      - Creates just 1 block
      - For now, we only need 1 block (256 threads enough for 16×16 matrix)

  **What happens during kernel launch**::

      GPU Execution:
      ┌────────────────────────────────────────┐
      │ Block (0,0) with 256 threads           │
      │                                        │
      │ Thread(0,0)  Thread(1,0)  ...         │
      │    ↓            ↓           ↓          │
      │ C[0]=A[0]+B[0]  C[1]=...   ...        │
      │                                        │
      │ All execute simultaneously!            │
      │ Time: ~1 microsecond for 256 adds     │
      └────────────────────────────────────────┘
      
      CPU: Continues immediately (kernel launch is asynchronous)
      GPU: Schedules 256 threads, executes addition, writes to d_C

  **Getting results back**:

  .. code-block:: cpp

          // Copy result from GPU VRAM to CPU RAM
          cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

  **Data flow**::

      Before this line:
      GPU VRAM                    CPU RAM
      d_C: [3.0, 3.0, ...]        h_C: [?, ?, ...]
      
      After cudaMemcpy:
      GPU VRAM                    CPU RAM
      d_C: [3.0, 3.0, ...]        h_C: [3.0, 3.0, ...]  (copied!)
      
      This is synchronous: CPU waits for:
      1. Kernel to finish computing
      2. Data transfer to complete
      
      Now h_C contains the results, we can use them on CPU!

  **Cleanup**:

  .. code-block:: cpp

          // Free GPU memory
          cudaFree(d_A);
          cudaFree(d_B);
          cudaFree(d_C);
          
          // Free CPU memory
          free(h_A);
          free(h_B);
          free(h_C);
      }

  **Why cleanup is important**::

      cudaFree(d_A):
      - Releases 1,024 bytes of GPU VRAM
      - Makes that memory available for other GPU operations
      - GPU memory is limited (typically 4-24 GB)
      
      free(h_A):
      - Releases 1,024 bytes of CPU RAM
      - Standard C memory management
      
      Without cleanup: Memory leak! Resources stay allocated until program ends.


.. dropdown:: 2D to 1D memory layout visualization
  :icon: light-bulb
  :color: success

  ::

      Consider a 2D matrix B[3][4]:
      Row 0: B[0][0]  B[0][1]  B[0][2]  B[0][3]
      Row 1: B[1][0]  B[1][1]  B[1][2]  B[1][3]
      Row 2: B[2][0]  B[2][1]  B[2][2]  B[2][3]

      Stored in 1D memory:
      Index:  0   1   2   3   4   5   6   7   8   9  10  11
            ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
      A[]   │0,0│0,1│0,2│0,3│1,0│1,1│1,2│1,3│2,0│2,1│2,2│2,3│
            └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
            └───────Row 0───────┘└───────Row 1───────┘└───────Row 2───────┘

      Formula: A[i * N + j] where N = 4 (width)

      Example mappings:
      B[0][1]  →  A[0 × 4 + 1]  =  A[1]   (row 0, column 1)
      B[1][0]  →  A[1 × 4 + 0]  =  A[4]   (row 1, column 0)
      B[2][3]  →  A[2 × 4 + 3]  =  A[11]  (row 2, column 3)

      How it works:
      i × N      →  skip i complete rows (each row is N elements)
      + j        →  move j positions within the current row


.. dropdown:: Why do you need boundary checks in this 2D matrix addition kernel?
  :icon: info
  :color: success

  The problem is that you must launch whole blocks, not individual threads.

  **Example with N=100 and block size 16×16**::

      Problem: Need 100×100 threads (one per matrix element)
      Block size: 16×16 threads per block
      
      Calculate blocks needed:
      blocks_x = ⌈100 / 16⌉ = ⌈6.25⌉ = 7 blocks
      blocks_y = ⌈100 / 16⌉ = ⌈6.25⌉ = 7 blocks
      
      Threads launched:
      total_x = 7 blocks × 16 threads = 112 threads
      total_y = 7 blocks × 16 threads = 112 threads
      Total = 112 × 112 = 12,544 threads
      
      But your matrix is only 100×100 = 10,000 elements!
      Extra threads = 12,544 - 10,000 = 2,544 threads doing nothing

  **Visual representation of the thread grid**::

      Matrix (100×100)              Thread grid (112×112)
      ┌──────────────┐              ┌──────────────┬──┐
      │              │              │              │XX│ ← Extra threads
      │   Valid      │              │   Valid      │XX│   (threads 100-111)
      │   region     │              │   region     │XX│
      │   100×100    │              │   100×100    │XX│
      │              │              │              │XX│
      │              │              │              │XX│
      └──────────────┘              ├──────────────┼──┤
                                    │XXXXXXXXXXXXXX│XX│ ← Extra row of threads
                                    └──────────────┴──┘   (rows 100-111)

      Without boundary check: Threads with i≥100 or j≥100 access invalid memory
      With boundary check: if (i < N && j < N) prevents out-of-bounds access

  **Code pattern**::

      int i = blockIdx.x * blockDim.x + threadIdx.x;  // Could be 0-111
      int j = blockIdx.y * blockDim.y + threadIdx.y;  // Could be 0-111
      
      if (i < N && j < N)  // Only process if within 0-99 range
          C[i * N + j] = A[i * N + j] + B[i * N + j];

  When you do not need boundary checks: When problem size is guaranteed to be a multiple of block size (for example, N=256 with 16×16 blocks gives exactly 16×16 blocks with no remainder).

What's next?
  We are using a single block containing 256 threads to add two 16×16 matrices. What if we want to add larger matrices, like 256×256? See the next tutorial on using multiple blocks.

Tutorial 2: 2D matrix addition with multiple blocks (256x256)
-------------------------------------------------------------

Recall block refers to a group of threads that can cooperate via shared memory. A grid is a collection of blocks launched by a single kernel.

For larger matrices like 256×256, we divide the matrix into smaller 16×16 chunks, where each chunk is processed by one block. This means we need 16 blocks in the X direction and 16 blocks in the Y direction, giving us 16 × 16 = 256 blocks total. Each block contains 16 × 16 = 256 threads that process the elements in that chunk.

  **2D block organization visualization**::

      Grid (multiple blocks arranged in 2D)
      ┌─────────┬─────────┬─────────┐
      │ Block   │ Block   │ Block   │
      │ (0,0)   │ (1,0)   │ (2,0)   │
      ├─────────┼─────────┼─────────┤
      │ Block   │ Block   │ Block   │
      │ (0,1)   │ (1,1)   │ (2,1)   │
      ├─────────┼─────────┼─────────┤
      │ Block   │ Block   │ Block   │
      │ (0,2)   │ (1,2)   │ (2,2)   │
      └─────────┴─────────┴─────────┘

      Each block contains 16×16 threads:
      Block (0,0) expanded:
      ┌──────────────────────────────┐
      │ T(0,0) T(1,0) ... T(15,0)    │
      │ T(0,1) T(1,1) ... T(15,1)    │
      │   ...    ...   ...   ...     │
      │ T(0,15) ...      T(15,15)    │
      └──────────────────────────────┘
      
      256 threads per block

Here is a sample implementation:

  .. code-block:: cpp

      // Kernel definition
      __global__ void MatAdd(float* d_A, float* d_B, float* d_C, int N)
      {
          // Calculate global thread indices
          int i = blockIdx.x * blockDim.x + threadIdx.x;
          int j = blockIdx.y * blockDim.y + threadIdx.y;
          
          // Boundary check (prevents out-of-bounds access when N not divisible by block size)
          if (i < N && j < N) {
              int index = i * N + j;  // Convert 2D to 1D index
              d_C[index] = d_A[index] + d_B[index];
          }
      }

      int main()
      {
          ...
          // 2D thread organization
          dim3 threadsPerBlock(16, 16);     // x=16, y=16, z=1
          dim3 numBlocks(N/16, N/16, 1);    // Explicit z dimension
          MatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
          ...
      }

.. dropdown:: Why use multiple blocks for large matrices?
  :icon: flame
  :color: danger

  GPUs achieve parallelism by distributing blocks across multiple streaming multiprocessors (SMs).

  **Comparison for 256×256 matrix (65,536 elements)**::

      Option 1: One block with 1024 threads (maximum per block)
      - Launch: 1 block × 1024 threads = 1,024 threads
      - Problem: Need 65,536 threads (one per element)
      - Result: Must run kernel 64 times sequentially or use complex indexing
      - Parallelism: Only 1 SM utilized (out of 82 for RTX 3090, 108 for A100)
      
      Why only 1 SM?
      • 1 block = assigned to exactly 1 SM (blocks never split across SMs)
      • That SM has 32 CUDA cores working
      • Other 81 SMs sit completely idle
      • Massive waste of GPU resources
      
      Option 2: 256 blocks with 256 threads each
      - Launch: 256 blocks × 256 threads = 65,536 threads
      - All elements processed in parallel
      - Parallelism: All 82 SMs working (256 blocks distributed across 82 SMs)
      
      How blocks distribute:
      • Each SM gets multiple blocks: 256 blocks ÷ 82 SMs ≈ 3 blocks per SM
      • Each SM processes its assigned blocks' warps concurrently
      • All SMs work simultaneously → ~82× more parallelism
      
      Result: Option 2 is dramatically faster (~50-80× in practice)

Tutorial 3: Batch processing multiple matrices
-----------------------------------------------

Process 1,000 matrices of 256×256 elements simultaneously using a 3D grid configuration where the Z dimension represents the batch index.

  **Problem setup**::

      Batch size: 1,000 matrices
      Matrix size: 256×256 elements each
      Total elements: 1,000 × 256 × 256 = 65,536,000 elements
      
      Memory layout (flattened):
      Matrix 0: [elements 0 to 65,535]
      Matrix 1: [elements 65,536 to 131,071]
      Matrix 2: [elements 131,072 to 196,607]
      ...
      Matrix 999: [elements 65,470,464 to 65,535,999]

  **3D grid organization**::

      Grid dimensions:
      - X dimension: 16 blocks (for 256 columns ÷ 16)
      - Y dimension: 16 blocks (for 256 rows ÷ 16)
      - Z dimension: 1,000 blocks (one per matrix in batch)
      
      Total blocks = gridDim.x × gridDim.y × gridDim.z
                   = 16 × 16 × 1,000
                   = 256,000 blocks
      
      Threads per block = blockDim.x × blockDim.y × blockDim.z
                        = 16 × 16 × 1
                        = 256 threads
      
      Total threads = Total blocks × Threads per block
                    = 256,000 × 256
                    = 65,536,000 threads

  **Warp organization inside each block**::

      Since each block has 256 threads and warps contain 32 threads:
      
      Warps per block = 256 threads ÷ 32 threads/warp = 8 warps
      
      Total warps in grid = 256,000 blocks × 8 warps/block
                          = 2,048,000 warps
      
      These 2,048,000 warps get distributed across the GPU's streaming 
      multiprocessors (SMs). Each SM executes warps in groups, scheduling 
      them to hide memory latency. On an A100 GPU with 108 SMs:
      
      Warps per SM ≈ 2,048,000 warps ÷ 108 SMs ≈ 18,963 warps/SM
      
      The SM's warp scheduler executes these warps over time, switching 
      between warps whenever one stalls on memory access.

  **Thread-to-element mapping**
  
  Each thread computes one element in one matrix. Think of each block as 
  processing a 16×16 chunk of elements within a specific matrix::

      Single matrix (256×256) divided into 16×16 chunks:
      
      Matrix 0:                    ← blockIdx.z = 0
      ┌────────┬────────┬─────┬────────┐
      │ Block  │ Block  │ ... │ Block  │  ← blockIdx.y = 0
      │ (0,0)  │ (1,0)  │     │ (15,0) │
      │ 16×16  │ 16×16  │     │ 16×16  │
      ├────────┼────────┼─────┼────────┤
      │ Block  │ Block  │ ... │ Block  │  ← blockIdx.y = 1
      │ (0,1)  │ (1,1)  │     │ (15,1) │
      ├────────┼────────┼─────┼────────┤
      │  ...   │  ...   │ ... │  ...   │
      ├────────┼────────┼─────┼────────┤
      │ Block  │ Block  │ ... │ Block  │  ← blockIdx.y = 15
      │ (0,15) │ (1,15) │     │(15,15) │
      └────────┴────────┴─────┴────────┘
         ↑        ↑              ↑
      bx=0     bx=1          bx=15

      Inside each block, 16×16 threads process the elements:
      
      Block (3, 5, 0):  ← Processing chunk at column 3, row 5, matrix 0
      ┌──┬──┬──┬───┬──┐
      │T │T │T │...│T │  ← threadIdx.y = 0
      │00│10│20│   │F0│
      ├──┼──┼──┼───┼──┤
      │T │T │T │...│T │  ← threadIdx.y = 1
      │01│11│21│   │F1│
      ├──┼──┼──┼───┼──┤
      │..│..│..│...│..│
      ├──┼──┼──┼───┼──┤
      │T │T │T │...│T │  ← threadIdx.y = 15
      │0F│1F│2F│   │FF│
      └──┴──┴──┴───┴──┘
       ↑  ↑  ↑      ↑
      tx=0 1  2    15
      
      Important: threadIdx always ranges from 0 to (blockDim - 1)
      - threadIdx.x ranges from 0 to 15 (always resets for each block)
      - threadIdx.y ranges from 0 to 15 (always resets for each block)
      - Every block has its own local thread numbering starting at (0,0)
      - To get the global position, we combine blockIdx and threadIdx

  **Coordinate calculation**::

      For a thread at (threadIdx.x, threadIdx.y) in block (blockIdx.x, blockIdx.y, blockIdx.z):
      
      // Which matrix in the batch
      int batch_idx = blockIdx.z;          // Range: 0 to 999
      
      // Position within that matrix
      int row = blockIdx.y * 16 + threadIdx.y;  // Range: 0 to 255
      int col = blockIdx.x * 16 + threadIdx.x;  // Range: 0 to 255
      
      Example: Thread at threadIdx(5, 7) in block(3, 10, 42):
      - Processing matrix 42
      - Row = 10 × 16 + 7 = 167
      - Col = 3 × 16 + 5 = 53
      - Computing element at position (167, 53) in matrix 42

Here is the complete implementation:

  .. code-block:: cpp

      // Kernel for batch matrix addition
      __global__ void batchMatAdd(float* d_A, float* d_B, float* d_C, 
                                   int N, int batch_size)
      {
          // Which matrix in the batch
          int batch_idx = blockIdx.z;
          
          // Position within this matrix
          int row = blockIdx.y * blockDim.y + threadIdx.y;
          int col = blockIdx.x * blockDim.x + threadIdx.x;
          
          // Boundary check
          if (row < N && col < N && batch_idx < batch_size) {
              // Calculate flat index: skip to correct matrix, then to element
              int matrix_offset = batch_idx * N * N;
              int element_offset = row * N + col;
              int index = matrix_offset + element_offset;
              
              d_C[index] = d_A[index] + d_B[index];
          }
      }

      int main()
      {
          const int N = 256;              // Matrix dimension
          const int batch_size = 1000;    // Number of matrices
          
          // Total elements for all matrices
          size_t total_size = batch_size * N * N * sizeof(float);
          
          // Allocate host memory
          float* h_A = (float*)malloc(total_size);
          float* h_B = (float*)malloc(total_size);
          float* h_C = (float*)malloc(total_size);
          
          // Initialize with sample data
          for (int i = 0; i < batch_size * N * N; i++) {
              h_A[i] = 1.0f;
              h_B[i] = 2.0f;
          }
          
          // Allocate device memory
          float *d_A, *d_B, *d_C;
          cudaMalloc(&d_A, total_size);
          cudaMalloc(&d_B, total_size);
          cudaMalloc(&d_C, total_size);
          
          // Copy data to device
          cudaMemcpy(d_A, h_A, total_size, cudaMemcpyHostToDevice);
          cudaMemcpy(d_B, h_B, total_size, cudaMemcpyHostToDevice);
          
          // Configure 3D grid: 16×16×1000 blocks, each with 16×16 threads
          dim3 threadsPerBlock(16, 16);
          dim3 numBlocks(N/16, N/16, batch_size);
          
          // Launch kernel
          batchMatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N, batch_size);
          
          // Copy results back
          cudaMemcpy(h_C, d_C, total_size, cudaMemcpyDeviceToHost);
          
          // Verify (check first and last matrix)
          printf("First element: %f (expected 3.0)\n", h_C[0]);
          printf("Last element: %f (expected 3.0)\n", h_C[batch_size * N * N - 1]);
          
          // Cleanup
          cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
          free(h_A); free(h_B); free(h_C);
      }

.. dropdown:: How does GPU handle 256,000 blocks?
  :icon: info
  :color: info

  **Block assignment to SMs**::

      GPU: NVIDIA A100 (108 SMs)
      Total blocks: 256,000 blocks
      Total warps: 2,048,000 warps (each block has 8 warps)
      
      Block assignment (initial wave):
      256,000 blocks ÷ 108 SMs ≈ 2,370 blocks per SM
      
      Block to SM mapping:
      SM 0: Assigned blocks 0, 108, 216, ... (2,370 blocks total)
      SM 1: Assigned blocks 1, 109, 217, ... (2,370 blocks total)
      ...
      SM 107: Assigned blocks 107, 215, 323, ...

  **How SMs actually execute: warp-level scheduling**::

      Important: SMs don't execute "blocks" - they execute warps!
      
      Each block (256 threads) contains 8 warps (32 threads each).
      
      Inside SM 0 (each SM operates independently):
      ┌─────────────────────────────────────────┐
      │ SM 0 Warp Scheduler                     │
      │                                         │
      │ Warps assigned to THIS SM:              │
      │ - Block 0, Warp 0   (ready)   ← execute │
      │ - Block 0, Warp 1   (stalled) ← skip    │ 
      │ - Block 0, Warp 2   (ready)   ← execute │
      │ - ...                                   │
      │ - Block 108, Warp 0 (ready)   ← execute │
      │ - Block 108, Warp 1 (ready)   ← execute │
      │ - ...                                   │
      └─────────────────────────────────────────┘
      
      Each SM has its own warp scheduler that:
      1. Manages only the warps from blocks assigned to THIS SM
      2. Every few clock cycles, picks ready warps (not stalled on memory)
      3. Executes those warps on this SM's CUDA cores
      4. When a warp stalls (memory access), switches to another ready warp
      
      Key point: Each SM independently switches between its own warps.
      The 108 SMs on A100 all do this in parallel, each managing their 
      own set of blocks and warps.
      
  **Timeline**::

      The GPU doesn't process blocks sequentially. Instead:
      
      - Blocks are assigned to SMs as resources become available
      - Within each SM, warps from multiple blocks execute concurrently
      - Warp scheduler switches between warps every few cycles
      - When a block finishes, SM gets assigned a new block
      
      This happens automatically—no explicit management needed!


Tutorial 4: Processing large batches with limited VRAM
-------------------------------------------------------

The previous example processed 1,000 matrices. What if you need to process 100,000 matrices but your GPU VRAM can only hold 1,000 at a time? You process them in chunks using a loop.

  **Problem setup**::

      Given:
      - Matrix size: 256×256 elements per matrix
      - Total matrices: 100,000 matrices (entire dataset)
      - Chunk size: 1,000 matrices (GPU VRAM limit per chunk)
      
      Calculate:
      - Total elements: 100,000 matrices × 256 × 256 = 6,553,600,000 elements
      - Elements per chunk: 1,000 matrices × 256 × 256 = 65,536,000 elements
      - Number of chunks: 100,000 matrices ÷ 1,000 matrices/chunk = 100 chunks
      
      Solution: Process in 100 chunks of 1,000 matrices each
      
      Chunk 0: Matrices 0-999
      Chunk 1: Matrices 1,000-1,999
      Chunk 2: Matrices 2,000-2,999
      ...
      Chunk 99: Matrices 99,000-99,999

  **Implementation strategy**::

      For each chunk:
      1. Copy 1,000 matrices from host (CPU) to device (GPU)
      2. Launch kernel to process these 1,000 matrices
      3. Copy results back from device to host
      4. Repeat for next chunk
      
      This is called "streaming" or "pipelined processing"

Here is the complete implementation:

  .. code-block:: cpp

      #include <cuda_runtime.h>
      #include <cstdio>

      // Same kernel as before (processes up to 1,000 matrices)
      __global__ void batchMatAdd(float* d_A, float* d_B, float* d_C, 
                                   int N, int batch_size)
      {
          int batch_idx = blockIdx.z;
          int row = blockIdx.y * blockDim.y + threadIdx.y;
          int col = blockIdx.x * blockDim.x + threadIdx.x;
          
          if (row < N && col < N && batch_idx < batch_size) {
              int matrix_offset = batch_idx * N * N;
              int element_offset = row * N + col;
              int index = matrix_offset + element_offset;
              
              d_C[index] = d_A[index] + d_B[index];
          }
      }

      int main()
      {
          const int N = 256;                    // Matrix dimension
          const int total_matrices = 100000;    // Total dataset size
          const int chunk_size = 1000;          // GPU memory limit
          const int num_chunks = total_matrices / chunk_size;
          
          // Size for one chunk
          size_t chunk_bytes = chunk_size * N * N * sizeof(float);
          
          // Allocate host memory for entire dataset
          float* h_A_all = (float*)malloc(total_matrices * N * N * sizeof(float));
          float* h_B_all = (float*)malloc(total_matrices * N * N * sizeof(float));
          float* h_C_all = (float*)malloc(total_matrices * N * N * sizeof(float));
          
          // Initialize data (would typically load from file)
          for (int i = 0; i < total_matrices * N * N; i++) {
              h_A_all[i] = 1.0f;
              h_B_all[i] = 2.0f;
          }
          
          // Allocate device memory (only enough for one chunk)
          float *d_A, *d_B, *d_C;
          cudaMalloc(&d_A, chunk_bytes);
          cudaMalloc(&d_B, chunk_bytes);
          cudaMalloc(&d_C, chunk_bytes);
          
          // Grid configuration for 1,000 matrices
          dim3 threadsPerBlock(16, 16);
          dim3 numBlocks(N/16, N/16, chunk_size);
          
          printf("Processing %d matrices in %d chunks of %d\n", 
                 total_matrices, num_chunks, chunk_size);
          
          // Process each chunk
          for (int chunk = 0; chunk < num_chunks; chunk++) {
              // Calculate starting position in the full dataset
              // Chunk 0: offset = 0 * 1000 * 256 * 256 = 0 (start at matrix 0)
              // Chunk 1: offset = 1 * 1000 * 256 * 256 = 65,536,000 (start at matrix 1000)
              // Chunk 2: offset = 2 * 1000 * 256 * 256 = 131,072,000 (start at matrix 2000)
              int offset = chunk * chunk_size * N * N;
              
              printf("Processing chunk %d/%d (matrices %d-%d)\n",
                     chunk + 1, num_chunks, 
                     chunk * chunk_size, (chunk + 1) * chunk_size - 1);
              
              // Copy chunk from host to device
              cudaMemcpy(d_A, h_A_all + offset, chunk_bytes, cudaMemcpyHostToDevice);
              cudaMemcpy(d_B, h_B_all + offset, chunk_bytes, cudaMemcpyHostToDevice);
              
              // Process this chunk
              batchMatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N, chunk_size);
              
              // Copy results back
              cudaMemcpy(h_C_all + offset, d_C, chunk_bytes, cudaMemcpyDeviceToHost);
          }
          
          // Verify results
          bool success = true;
          for (int i = 0; i < total_matrices * N * N; i++) {
              if (h_C_all[i] != 3.0f) {
                  success = false;
                  break;
              }
          }
          printf("Result: %s\n", success ? "PASS" : "FAIL");
          
          // Cleanup
          cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
          free(h_A_all); free(h_B_all); free(h_C_all);
          
          return 0;
      }

Tutorial 5: Pipelined processing with CUDA streams
---------------------------------------------------

Tutorial 4 processes chunks **sequentially** - each chunk must fully complete (copy to GPU → compute → copy back) before the next chunk begins. This leaves the GPU idle during data transfers and the CPU waiting during computation.

Tutorial 5 uses **CUDA streams** to process multiple chunks **concurrently** - while one chunk computes on the GPU, another chunk's data is being copied in, and a third chunk's results are being copied out. This dramatically improves hardware utilization.

  **Key concept**::

      Without streams (previous example):
      ════════════════════════════════════════════════════════════
      Chunk 0: [H→D]---[Compute]---[D→H]
      Chunk 1:                           [H→D]---[Compute]---[D→H]
      Chunk 2:                                                     [H→D]---[Compute]---[D→H]
      
      GPU idle during copies! CPU waits for GPU!

      With streams (this example):
      ════════════════════════════════════════════════════════════
      Stream 0: [H→D]---[Compute]---[D→H]
      Stream 1:      [H→D]---[Compute]---[D→H]
      Stream 2:           [H→D]---[Compute]---[D→H]
      
      Operations overlap! Much better utilization!

  **Why asynchronous operations matter**::

      The regular ``cudaMemcpy`` is blocking - when you call it, your CPU waits 
      until the entire transfer completes before moving to the next line of code. 
      Similarly, kernel launches with the default stream wait for all previous 
      operations to finish.
      
      This creates idle time:
      - While data copies from CPU to GPU, the GPU compute units sit idle
      - While GPU computes, the CPU cannot prepare the next chunk
      - While results copy back, both CPU and GPU wait
      
      ``cudaMemcpyAsync`` is non-blocking - it returns immediately after queuing 
      the transfer, letting your CPU continue executing code. Combined with 
      streams, this allows multiple chunks to be processed concurrently. While 
      one chunk computes in stream 0, another chunk can copy data in via 
      stream 1, and a third chunk can copy results out via stream 2 - all 
      happening simultaneously.

  **Requirements for overlapping**::

      1. Use ``cudaMemcpyAsync`` instead of cudaMemcpy (non-blocking transfers)
      2. Use pinned (page-locked) host memory with cudaMallocHost (enables DMA)
      3. Use multiple CUDA streams to manage concurrent operations
      4. Launch kernels in different streams (allows parallel execution)

Complete implementation
~~~~~~~~~~~~~~~~~~~~~~~

Here is the full code for Tutorial 5. Read through it first to see the overall structure, then continue to the section-by-section breakdown below.

  **Key parameters**::

      Given (same dataset as Tutorial 4):
      - Matrix size: 256×256 elements per matrix
      - Total matrices: 100,000 matrices (entire dataset)
      - Chunk size: 1,000 matrices (amount processed per chunk)
      - Number of streams: 4 streams (new in Tutorial 5!)
      
      Calculate:
      - Total chunks: 100,000 ÷ 1,000 = 100 chunks
      - Chunks per stream: 100 chunks ÷ 4 streams = 25 chunks/stream
      - Concurrent chunks: 4 chunks processing simultaneously (one per stream)
      
      Key difference from Tutorial 4:
      Tutorial 4 processes 1 chunk at a time sequentially (100 iterations)
      Tutorial 5 processes 4 chunks at a time concurrently (still 100 iterations, 
      but chunks overlap in execution)

  .. code-block:: cpp

      #include <cuda_runtime.h>
      #include <cstdio>

      __global__ void batchMatAdd(float* d_A, float* d_B, float* d_C, 
                                   int N, int batch_size)
      {
          int batch_idx = blockIdx.z;
          int row = blockIdx.y * blockDim.y + threadIdx.y;
          int col = blockIdx.x * blockDim.x + threadIdx.x;
          
          if (row < N && col < N && batch_idx < batch_size) {
              int matrix_offset = batch_idx * N * N;
              int element_offset = row * N + col;
              int index = matrix_offset + element_offset;
              
              d_C[index] = d_A[index] + d_B[index];
          }
      }

      int main()
      {
          const int N = 256;
          const int total_matrices = 100000;
          const int chunk_size = 1000;
          const int num_chunks = total_matrices / chunk_size;
          const int num_streams = 4;
          
          size_t chunk_bytes = chunk_size * N * N * sizeof(float);
          
          // Allocate pinned host memory
          float *h_A_all, *h_B_all, *h_C_all;
          cudaMallocHost(&h_A_all, total_matrices * N * N * sizeof(float));
          cudaMallocHost(&h_B_all, total_matrices * N * N * sizeof(float));
          cudaMallocHost(&h_C_all, total_matrices * N * N * sizeof(float));
          
          // Initialize data
          for (int i = 0; i < total_matrices * N * N; i++) {
              h_A_all[i] = 1.0f;
              h_B_all[i] = 2.0f;
          }
          
          // Allocate device memory for each stream
          float *d_A[num_streams], *d_B[num_streams], *d_C[num_streams];
          cudaStream_t streams[num_streams];
          
          for (int i = 0; i < num_streams; i++) {
              cudaMalloc(&d_A[i], chunk_bytes);
              cudaMalloc(&d_B[i], chunk_bytes);
              cudaMalloc(&d_C[i], chunk_bytes);
              cudaStreamCreate(&streams[i]);
          }
          
          dim3 threadsPerBlock(16, 16);
          dim3 numBlocks(N/16, N/16, chunk_size);
          
          // Process chunks using streams
          for (int chunk = 0; chunk < num_chunks; chunk++) {
              int stream_idx = chunk % num_streams;
              int offset = chunk * chunk_size * N * N;
              
              cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes, 
                            cudaMemcpyHostToDevice, streams[stream_idx]);
              cudaMemcpyAsync(d_B[stream_idx], h_B_all + offset, chunk_bytes, 
                            cudaMemcpyHostToDevice, streams[stream_idx]);
              
              batchMatAdd<<<numBlocks, threadsPerBlock, 0, streams[stream_idx]>>>(
                  d_A[stream_idx], d_B[stream_idx], d_C[stream_idx], N, chunk_size);
              
              cudaMemcpyAsync(h_C_all + offset, d_C[stream_idx], chunk_bytes, 
                            cudaMemcpyDeviceToHost, streams[stream_idx]);
          }
          
          // Wait for all streams to complete
          for (int i = 0; i < num_streams; i++) {
              cudaStreamSynchronize(streams[i]);
          }
          
          // Verify results
          bool success = true;
          for (int i = 0; i < total_matrices * N * N; i++) {
              if (h_C_all[i] != 3.0f) {
                  success = false;
                  break;
              }
          }
          printf("Result: %s\n", success ? "PASS" : "FAIL");
          
          // Cleanup
          for (int i = 0; i < num_streams; i++) {
              cudaFree(d_A[i]); cudaFree(d_B[i]); cudaFree(d_C[i]);
              cudaStreamDestroy(streams[i]);
          }
          cudaFreeHost(h_A_all); cudaFreeHost(h_B_all); cudaFreeHost(h_C_all);
          
          return 0;
      }

Code breakdown
~~~~~~~~~~~~~~

Now let's examine each section in detail to understand how streams enable concurrent processing.

**Section 1: Allocate pinned memory**

  .. code-block:: cpp

      // Tutorial 4 used: malloc(total_matrices * N * N * sizeof(float))
      // Tutorial 5 uses: cudaMallocHost for pinned memory
      
      float *h_A_all, *h_B_all, *h_C_all;
      cudaMallocHost(&h_A_all, total_matrices * N * N * sizeof(float));
      cudaMallocHost(&h_B_all, total_matrices * N * N * sizeof(float));
      cudaMallocHost(&h_C_all, total_matrices * N * N * sizeof(float));

Here we use pinned memery. This is required for asynchronous.


.. dropdown:: What is pinned memory?
  :icon: question
  :color: info

  **Important: Pinned memory is HOST (CPU) memory, not GPU memory!**

  Pinned memory (also called page-locked memory) is a special type of CPU RAM 
  that the operating system guarantees will stay at a fixed physical location 
  and never be swapped to disk.

  **Memory location clarification**::

      Computer Architecture:
      ┌─────────────────────────────────────────────────────────┐
      │ CPU                              GPU                    │
      │                                                         │
      │ RAM (Host Memory)                VRAM (Device Memory)   │
      │ ┌─────────────────────┐          ┌──────────────────┐   │
      │ │ Regular malloc()    │          │ cudaMalloc()     │   │
      │ │ (pageable)          │          │ (always pinned)  │   │
      │ ├─────────────────────┤          └──────────────────┘   │
      │ │ cudaMallocHost()    │                ↑                │
      │ │ (pinned/page-locked)│────DMA────────┘                 │
      │ └─────────────────────┘       Fast async transfer       │
      │         ↑                                               │
      │    Lives in CPU RAM, not GPU VRAM!                      │
      └─────────────────────────────────────────────────────────┘

  **Regular host memory (pageable, malloc)**::

      float* data = (float*)malloc(1000 * sizeof(float));
      
      Where it lives: CPU RAM (pageable memory)
      
      What happens:
      - OS gives you virtual memory addresses in CPU RAM
      - Physical location in CPU RAM can change over time
      - OS can move pages to disk (swap) if RAM is low
      - When you access the data, OS might need to swap it back from disk
      
      This is good: Lets you use more memory than physically available
      This is bad: GPU cannot directly access memory that might move or swap

  **Pinned host memory (page-locked, cudaMallocHost)**::

      float* data;
      cudaMallocHost(&data, 1000 * sizeof(float));
      
      Where it lives: CPU RAM (pinned/page-locked memory)
      
      What happens:
      - Memory is allocated at a fixed physical CPU RAM address
      - OS promises this CPU RAM will never move to a different address
      - OS promises this CPU RAM will never swap to disk
      - GPU can directly read from this CPU RAM via DMA (Direct Memory Access)
      
      This is good: GPU can transfer from CPU RAM without OS intervention (fast, async)
      This is bad: Reduces CPU RAM available for other programs (limited resource)

.. dropdown:: What problem does non-pinned memory solve?
  :icon: question
  :color: info

  OS uses virtual memory to let programs use more memory than 
  physically available in RAM. The OS divides memory into fixed-size blocks 
  called pages (typically 4 KB each). When RAM gets full, the OS moves less-used 
  pages from RAM to disk storage. This process is called swapping or paging 
  (paging = moving pages between RAM and disk). This lets your computer 
  run many programs simultaneously even with limited RAM.

  When you access swapped-out memory, the OS transparently swaps it back from 
  disk to RAM. This is invisible to your program but creates unpredictable delays. 
  For pageable memory (memory divided into pages that can move), allocated with 
  malloc, the OS has full control and can swap pages to disk at any time based 
  on memory pressure.

  For GPU transfers, this creates problems because the GPU needs stable physical 
  RAM addresses. If memory is on disk, the GPU cannot access it until the OS 
  swaps it back. Pinned memory prevents swapping by telling the OS "keep these 
  pages in RAM permanently" - the OS locks those pages in physical RAM and never 
  moves them to disk, giving the GPU reliable direct access via DMA.

**Section 2: Create multiple device buffers**

  .. code-block:: cpp

      // Tutorial 4: One pointer (float *d_A) pointing to one GPU memory allocation
      // Tutorial 5: Array of pointers, each pointing to separate GPU memory
      
      // Declare arrays of pointers (with num_streams = 4, this creates 4-element arrays)
      // d_A is a 4-element array: d_A[0], d_A[1], d_A[2], d_A[3]
      // Each element is a pointer that will point to separate GPU memory
      float *d_A[num_streams], *d_B[num_streams], *d_C[num_streams];
      
      // Array to hold 4 stream objects (execution contexts)
      cudaStream_t streams[num_streams];
      
      // Create memory allocations and streams
      for (int i = 0; i < num_streams; i++) {
          // Allocate GPU memory for stream i's input A
          // &d_A[i] returns CPU RAM address where pointer d_A[i] is stored
          // cudaMalloc writes GPU VRAM address into d_A[i]
          cudaMalloc(&d_A[i], chunk_bytes);
        
          // Allocate GPU memory for stream i's input B
          cudaMalloc(&d_B[i], chunk_bytes);
  
          // Allocate GPU memory for stream i's output C
          cudaMalloc(&d_C[i], chunk_bytes);
          
          // Create execution stream i
          // Each stream is an independent queue of GPU operations
          cudaStreamCreate(&streams[i]);
      }

**Why multiple buffers?** If all streams shared one set of buffers, they would overwrite each other's data. Each stream needs its own workspace in GPU memory to process chunks independently. This multiplies VRAM usage by the number of streams, but enables true concurrency.

.. dropdown:: What is a buffer and why is it called that?
  :icon: question
  :color: info

  A buffer is just memory. We call it a "buffer" because it temporarily holds 
  data flowing between systems that operate at different speeds (CPU ↔ GPU).

  **Why buffers are critical for streaming**::

      Without multiple buffers:
      - Stream 0 copies data into d_A
      - Stream 1 tries to copy into d_A → OVERWRITES stream 0's data!
      - Result: Data corruption, wrong results
      
      With multiple buffers (one per stream):
      - Stream 0 uses d_A[0], d_B[0], d_C[0]
      - Stream 1 uses d_A[1], d_B[1], d_C[1]
      - Stream 2 uses d_A[2], d_B[2], d_C[2]
      - Stream 3 uses d_A[3], d_B[3], d_C[3]
      - Result: Each stream has its own workspace, no conflicts!

  This is why Tutorial 5 needs ``num_streams`` sets of buffers. Each stream 
  operates independently on its own data, enabling true concurrent processing. 
  Without separate buffers, streams cannot overlap their operations.

**Section 3: Process chunks asynchronously**

Recall what we're working with:

- **Total dataset**: 100,000 matrices (each 256×256 elements)
- **Chunk size**: 1,000 matrices per chunk
- **Total chunks**: 100 chunks to process (100,000 ÷ 1,000)
- **Number of streams**: 4 streams for concurrent processing

Terminology reminder:

- **Chunk**: A portion of data (1,000 matrices). What we're processing.
- **Stream**: An execution context on the GPU. How we process it concurrently.

The loop processes all 100 chunks by assigning each chunk to one of the 4 streams in round-robin fashion. Multiple streams process different chunks simultaneously.

  .. code-block:: cpp

      // Process all 100 chunks, assigning each to a stream in round-robin fashion
      for (int chunk = 0; chunk < num_chunks; chunk++) {
          // Determine which stream to use (cycles through 0, 1, 2, 3, 0, 1, ...)
          // Chunk 0→Stream 0, Chunk 1→Stream 1, Chunk 4→Stream 0 (reuse), etc.
          int stream_idx = chunk % num_streams;
          
          // Calculate starting element offset (not byte offset!) for this chunk
          // "offset" tells us where this chunk's data starts in the h_A_all array
          // Chunk 0: offset = 0 × 1000 × 256 × 256 = 0 (start at element 0)
          // Chunk 1: offset = 1 × 1000 × 256 × 256 = 65,536,000 (skip first 1000 matrices)
          // Chunk 2: offset = 2 × 1000 × 256 × 256 = 131,072,000 (skip first 2000 matrices)
          // This is an element index, not bytes. We use it with pointer arithmetic.
          int offset = chunk * chunk_size * N * N;
          
          // Copy input matrix A for this chunk from CPU RAM to GPU VRAM
          // Source: h_A_all + offset (pinned CPU RAM, starting at chunk's data)
          // Destination: d_A[stream_idx] (GPU VRAM buffer for this stream)
          // Returns immediately (async), CPU continues to next line
          cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes, 
                        cudaMemcpyHostToDevice, streams[stream_idx]);
          
          // Copy input matrix B for this chunk from CPU RAM to GPU VRAM
          // Uses same stream, so this waits for previous operation in this stream
          cudaMemcpyAsync(d_B[stream_idx], h_B_all + offset, chunk_bytes, 
                        cudaMemcpyHostToDevice, streams[stream_idx]);
          
          // Launch kernel to process this chunk in GPU VRAM
          // Reads from: d_A[stream_idx], d_B[stream_idx] (inputs in VRAM)
          // Writes to: d_C[stream_idx] (output in VRAM)
          // Fourth argument (0) is shared memory size (not used here)
          // Last argument specifies which stream to use
          batchMatAdd<<<numBlocks, threadsPerBlock, 0, streams[stream_idx]>>>(
              d_A[stream_idx], d_B[stream_idx], d_C[stream_idx], N, chunk_size);
          
          // Copy result back from GPU VRAM to CPU RAM
          // Source: d_C[stream_idx] (GPU VRAM with computed results)
          // Destination: h_C_all + offset (pinned CPU RAM, write at chunk's location)
          // Returns immediately (async), loop continues to process next chunk
          cudaMemcpyAsync(h_C_all + offset, d_C[stream_idx], chunk_bytes, 
                        cudaMemcpyDeviceToHost, streams[stream_idx]);
      }

  **Offset and pointer arithmetic visualization**::

      h_A_all array (100,000 matrices in CPU RAM):
      ┌──────────────────┬──────────────────┬──────────────────┬─────┬──────────────────┐
      │  Chunk 0         │  Chunk 1         │  Chunk 2         │ ... │  Chunk 99        │
      │  1000 matrices   │  1000 matrices   │  1000 matrices   │     │  1000 matrices   │
      │  65,536,000 elem │  65,536,000 elem │  65,536,000 elem │     │  65,536,000 elem │
      └──────────────────┴──────────────────┴──────────────────┴─────┴──────────────────┘
       ↑                  ↑                  ↑                         ↑
       offset=0           offset=65,536,000  offset=131,072,000        offset=6,488,064,000
       h_A_all+0          h_A_all+65536000   h_A_all+131072000         h_A_all+6488064000
      
      When chunk=1:
      offset = 1 × 1000 × 256 × 256 = 65,536,000
      
      cudaMemcpyAsync(d_A[stream_idx], h_A_all + offset, chunk_bytes, ...)
                                       ↑
                                       Pointer arithmetic!
                                       h_A_all is float*, so +offset means:
                                       "skip 65,536,000 float elements"
                                       = skip 65,536,000 × 4 bytes
                                       = skip 262,144,000 bytes
                                       = start at chunk 1's data
      
      Key point: When you add to a pointer in C/C++, the compiler automatically 
      multiplies by sizeof(type). So h_A_all + offset means "start offset floats 
      into the array", not "start offset bytes into memory".

Here, ``cudaMemcpyAsync`` returns immediately, letting the loop continue. Operations within a stream execute in order (copy → compute → copy), but different streams execute concurrently. While stream 0 computes, stream 1 copies data in, and stream 2 copies results out.

  **What actually happens: CPU queues work, GPU executes it**::

      CPU behavior (very fast, microseconds per iteration):
      - Loop runs through all 100 chunks rapidly
      - Each iteration just QUEUES operations (doesn't wait for them)
      - `cudaMemcpyAsync` returns immediately (just adds to stream's queue)
      - Kernel launch returns immediately (just adds to stream's queue)
      - CPU finishes the entire loop in milliseconds
      - Then CPU continues to other work (or waits at cudaStreamSynchronize)
      
      GPU behavior (slow, processes when ready):
      - Receives queued operations in each stream
      - Executes operations in each stream sequentially (H→D, compute, D→H)
      - BUT executes different streams concurrently
      - Takes seconds to process all 100 chunks
      
      Timeline example (to scale):
      
      Time →    0ms          1ms                                    2000ms (2 seconds)
                ▼            ▼                                      ▼
      CPU:      |████| Loop finishes instantly! → [CPU free to do other work]
                     ↑
                     All 100 chunks queued
      
      GPU:      |─────────────────────────────────────────────────────────────────|
      Stream 0: |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|─D→H─|... (chunks 0,4,8,12...)
      Stream 1:     |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|─D→H─|... (chunks 1,5,9,13...)
      Stream 2:         |─H→D─|─Compute─|─D→H─|─H→D─|─Compute─|... (chunks 2,6,10,14...)
      Stream 3:             |─H→D─|─Compute─|─D→H─|─H→D─|─Compute| (chunks 3,7,11,15...)
                            ↑ GPU crunching numbers for ~2 seconds total
      
      What happens in practice:
      1. for loop runs (100 iterations): ~1 millisecond
         - Each iteration just queues 3 async operations per stream
         - No waiting, just scheduling work
         - Loop exits immediately after queuing all 100 chunks
      
      2. GPU processes queued work: ~2000 milliseconds (2 seconds)
         - Works through each stream's queue sequentially
         - Processes 4 streams concurrently
         - Stream 0 processes chunks 0→4→8→12→...→96 (25 chunks total)
         - Stream 1 processes chunks 1→5→9→13→...→97 (25 chunks total)
         - Stream 2 processes chunks 2→6→10→14→...→98 (25 chunks total)
         - Stream 3 processes chunks 3→7→11→15→...→99 (25 chunks total)

.. dropdown:: How to choose the number of streams?
  :icon: flame
  :color: danger

  The optimal number of streams depends on the balance between copy time and compute time for your workload.

  **Theoretical guideline (simplified rule of thumb)**::

      Goal: Keep GPU compute fully utilized (never idle waiting for data)
      
      To achieve full overlap, while GPU computes one chunk, you need enough 
      streams to be copying data for upcoming chunks.
      
      Derivation:r
      - During compute time T, GPU is busy computing
      - During that same time T, we need to finish copying for the next chunk
      - Copy time for one chunk = C (includes both H→D and D→H)
      - Number of chunks we can copy during T = T / C
      - We need at least T / C streams to keep pipeline full
      
      Formula: num_streams ≥ C / T (this keeps compute continuously fed)
      
      However, the formula num_streams ≥ 2 × (C / T) + 1 adds safety margin for:
      - Separate H→D and D→H copy engines (factor of 2)
      - One extra stream for buffer (the +1)
      
      Example:
      - Copy H→D: 10 ms
      - Compute:  50 ms  
      - Copy D→H: 10 ms
      - Total copy: C = 20 ms, Compute: T = 50 ms
      
      Basic formula: 20 / 50 = 0.4 → Need at least 1 stream
      Conservative formula: 2 × (20 / 50) + 1 = 1.8 → Round up to 2 streams
      
      Reality check: This is a rough guideline, not a law! 
      - Assumes perfect overlap (rarely happens in practice)
      - Ignores PCIe bandwidth limits, copy engine contention
      - Best approach: Start with 4 streams, then measure and tune

  **Practical considerations**::

      1. VRAM limitation (most important):
         - Each stream needs its own buffers (d_A, d_B, d_C)
         - More streams = more VRAM usage
         - Example: 4 streams × 3 buffers × 1000 matrices × 256² × 4 bytes = 3 GB
         - If VRAM is limited, reduce num_streams or chunk_size
      
      2. Compute-bound vs memory-bound:
         - Compute-bound (kernel time >> copy time): Use 2-4 streams
         - Memory-bound (copy time >> kernel time): Use 4-8 streams
         - Balanced workload: Use 3-4 streams
      
      3. Diminishing returns:
         - Beyond 4-8 streams, speedup plateaus
         - More streams = more synchronization overhead
         - GPU hardware has limited concurrent copy engines (typically 2)
      
      4. Hardware constraints:
         - PCIe lanes limit concurrent transfers
         - GPU copy engines (typically 2: one H→D, one D→H)
         - SM occupancy and scheduling capacity

  **Typical recommendations**::

      - Start with 4 streams (good balance for most workloads)
      - If compute-heavy: Try 2-3 streams
      - If memory-bound: Try 6-8 streams
      - Always measure performance with your specific workload
      - Watch VRAM usage: nvidia-smi to monitor memory consumption

  **How to test**::

      Run your code with different num_streams values (2, 4, 8, 16) and measure:
      - Total execution time (lower is better)
      - VRAM usage (must fit in available memory)
      - GPU utilization (nvidia-smi dmon -s u)
      
      Choose the smallest num_streams that gives near-maximum speedup.


.. dropdown:: What speedup can we expect with streaming?
  :icon: info
  :color: info

  **Ideal speedup calculation**::

      Assume for one chunk:
      - Copy H→D: 10 ms
      - Compute:  50 ms
      - Copy D→H: 10 ms
      - Total:    70 ms per chunk

      Without streams (100 chunks):
      Total time = 100 × 70 ms = 7,000 ms = 7 seconds

      With streams (4 concurrent):
      Pipeline fills up after first 4 chunks, then processes 4 at a time.
      
      After pipeline fills:
      - 4 chunks complete every ~50 ms (limited by slowest operation)
      - 100 chunks ÷ 4 streams × 50 ms ≈ 1,250 ms
      - Plus initial fill time: ~200 ms
      - Total: ~1,450 ms = 1.45 seconds
      
      Speedup: 7.0 / 1.45 ≈ 4.8× faster!

  **Real-world considerations**::

      - PCIe bandwidth limits (16 GB/s typical)
      - GPU compute capability and occupancy
      - Memory allocation overhead
      - Synchronization costs
      
      Typical real speedup: 2-3× (still very significant!)


**Section 4: Synchronize and verify**

  .. code-block:: cpp

      // Wait for all streams to finish before reading results
      for (int i = 0; i < num_streams; i++) {
          cudaStreamSynchronize(streams[i]);
      }
      
      // Now it's safe to verify results in h_C_all
      bool success = true;
      for (int i = 0; i < total_matrices * N * N; i++) {
          if (h_C_all[i] != 3.0f) {
              success = false;
              break;
          }
      }

Why synchronize? Because all operations are asynchronous, the loop completes before the GPU finishes processing. Without synchronization, you would read ``h_C_all`` while async copies are still writing to it, causing race conditions. cudaStreamSynchronize blocks the CPU until that stream completes all queued operations.

Tutorial 6: Matrix multiplication with shared memory (1024×1024)
-----------------------------------------------------------------

In previous tutorials, we did matrix addition where each thread computes `C[i][j] = A[i][j] + B[i][j]`. This is simple: read two values, add them, write one value. Matrix multiplication is very different and much more complex.

**Key difference: Each thread does MUCH more work!**

- **Matrix addition**: Thread reads 2 values, does 1 addition, writes 1 value
- **Matrix multiplication (N=1024)**: Thread reads 2048 values, does 1024 multiplications and 1024 additions, writes 1 value

For matrix multiplication, computing just ONE output element C[i][j] requires:
- Reading an entire row from A (1024 values)
- Reading an entire column from B (1024 values)  
- Computing 1024 multiplications and summing them up

This creates a performance bottleneck that we'll solve step by step using shared memory.

**Problem size**: We'll work with 1024×1024 matrices throughout this tutorial. This large size makes the performance issues obvious and demonstrates why shared memory optimization is essential.

Understanding matrix multiplication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's start with what matrix multiplication actually does:

  **Matrix multiplication formula**::

      C[i][j] = Σ(A[i][k] × B[k][j]) for k = 0 to N-1
      
      Example for N=4:
      
      Matrix A (4×4):        Matrix B (4×4):        Matrix C (4×4):
      ┌           ┐          ┌           ┐          ┌           ┐
      │ 1  2  3  4│          │ 5  6  7  8│          │ 70  ...   │
      │ 5  6  7  8│    ×     │ 9 10 11 12│    =     │ ...  ...  │
      │ 9 10 11 12│          │13 14 15 16│          │ ...  ...  │
      │13 14 15 16│          │17 18 19 20│          │ ...  ...  │
      └           ┘          └           ┘          └           ┘
      
      Computing C[0][0]:
      C[0][0] = A[0][0]×B[0][0] + A[0][1]×B[1][0] + A[0][2]×B[2][0] + A[0][3]×B[3][0]
              = 1×5 + 2×9 + 3×13 + 4×17
              = 5 + 18 + 39 + 68
              = 130
      
      Key observation: Computing one element requires reading an entire row of A 
      and an entire column of B (N elements each).

Step 1: Simple implementation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's start with the most straightforward way to implement matrix multiplication on GPU. Each thread computes one output element.

  .. code-block:: cpp

      // Naive matrix multiplication kernel (slow - for educational purposes)
      __global__ void matMulNaive(float* A, float* B, float* C, int N)
      {
          // Calculate which element this thread computes
          int row = blockIdx.y * blockDim.y + threadIdx.y;
          int col = blockIdx.x * blockDim.x + threadIdx.x;
          
          if (row < N && col < N) {
              float sum = 0.0f;
              
              // Each thread reads entire row of A and column of B from global memory
              for (int k = 0; k < N; k++) {
                  // Every access goes to slow global memory (400-800 cycle latency!)
                  sum += A[row * N + k] * B[k * N + col];
              }
              
              C[row * N + col] = sum;
          }
      }

      int main()
      {
          int N = 1024;  // 1024×1024 matrices
          size_t size = N * N * sizeof(float);
          
          // Allocate and initialize host matrices
          float *h_A = (float*)malloc(size);
          float *h_B = (float*)malloc(size);
          float *h_C = (float*)malloc(size);
          
          // Initialize matrices (for example, with random values)
          for (int i = 0; i < N * N; i++) {
              h_A[i] = rand() / (float)RAND_MAX;
              h_B[i] = rand() / (float)RAND_MAX;
          }
          
          // Allocate device matrices
          float *d_A, *d_B, *d_C;
          cudaMalloc(&d_A, size);
          cudaMalloc(&d_B, size);
          cudaMalloc(&d_C, size);
          
          // Copy data to device
          cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
          cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
          
          // Launch kernel with 16×16 thread blocks
          dim3 threadsPerBlock(16, 16);
          dim3 numBlocks((N + 15) / 16, (N + 15) / 16);
          // Why (N + 15) / 16 instead of N / 16?
          // This is ceiling division: ensures we have enough blocks for any N
          // Examples:
          //   N = 1024: (1024 + 15) / 16 = 1039 / 16 = 64 blocks ✓
          //   N = 1020: (1020 + 15) / 16 = 1035 / 16 = 64 blocks (need 64, not 63!)
          //   N = 1000: (1000 + 15) / 16 = 1015 / 16 = 63 blocks (need 63, not 62!)
          // Formula: (N + TILE_SIZE - 1) / TILE_SIZE always rounds UP
          
          matMulNaive<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
          
          // Copy result back
          cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
          
          // Cleanup
          cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
          free(h_A); free(h_B); free(h_C);
      }

  **What each thread does (visual explanation)**::

      Example: Thread computing C[5][3] (row 5, column 3)
      
      Matrix A (1024×1024):              Matrix B (1024×1024):
      Row 0                              Col 0  Col 3 (target)
      Row 1                              │      │
      Row 2                              │      │
      Row 3                              │      │
      Row 4                              │      │
      Row 5 (target) ►►►►►►►►►►►►►►►     │      ▼
      ┌─────────────────────────────┐    ┌──────▓▓▓────┐
      │                             │    │      ▓▓▓    │
      │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│    │      ▓▓▓    │
      │                             │    │      ▓▓▓    │
      │                             │    │      ▓▓▓    │
      │                             │    │      ▓▓▓    │
      │                             │    │      ▓▓▓    │
      └─────────────────────────────┘    └──────▓▓▓────┘
       1024 elements (shaded)              1024 elements (shaded)
       from A[5][0] to A[5][1023]          from B[0][3] to B[1023][3]
      
      Computation:
      C[5][3] = A[5][0]×B[0][3] + A[5][1]×B[1][3] + ... + A[5][1023]×B[1023][3]
              = (1024 multiplications) + (1024 additions)
              = Reads 2048 values from global memory
      
      Grid configuration:
      - 64×64 = 4,096 blocks (covers entire 1024×1024 output)
      - 16×16 = 256 threads per block
      - Total: 1,048,576 threads, each computing one output element

This works correctly and is similar to how we've written previous kernels. But there's a performance problem we need to understand.

Step 2: Understanding the performance problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's analyze what happens when this kernel runs. Focus on one block computing a 16×16 chunk of the output matrix C.

  **Memory access pattern visualization**::

      Block computing C[0:16][0:16] (256 threads) for 1024×1024 matrices
      
      Thread (0,0) computing C[0][0]:
      - Reads A[0][0], A[0][1], A[0][2], ..., A[0][1023]     (1024 reads)
      - Reads B[0][0], B[1][0], B[2][0], ..., B[1023][0]     (1024 reads)
      - Total: 2048 reads from global memory
      
      Thread (0,1) computing C[0][1]:
      - Reads A[0][0], A[0][1], A[0][2], ..., A[0][1023]     (same 1024 reads!)
      - Reads B[0][1], B[1][1], B[2][1], ..., B[1023][1]     (1024 reads)
      - Total: 2048 reads from global memory
      
      Notice: Thread (0,0) and Thread (0,1) BOTH read the same A[0][:] row!
      
      All 16 threads in row 0 read the same A[0][:] data → 16× redundant reads
      All 16 threads in col 0 read the same B[:][0] data → 16× redundant reads

  **The redundancy problem**::

      For a 16×16 block (256 threads), N=1024 matrices:
      
      Each thread: 2048 global memory reads
      Total reads by all threads: 256 × 2048 = 524,288 reads
      
      But how much unique data do we actually need?
      - Block needs 16 rows from A: 16 × 1024 = 16,384 elements
      - Block needs 16 cols from B: 1024 × 16 = 16,384 elements
      - Total unique data: 32,768 elements
      
      Redundancy factor: 524,288 / 32,768 = 16×
      
      We're reading the same data 16 times! This is wasteful.

  **Why is this a problem?**::

      Global memory (VRAM) is slow:
      - Latency: ~400-800 cycles per access
      - For 2048 reads per thread: 2048 × 400 = 819,200 cycles just waiting for memory!
      
      Even with 1000 threads computing in parallel, memory is the bottleneck.

.. dropdown:: What do you mean by "cycle"?
  :icon: question
  :color: info

  A **cycle** (or clock cycle) is the basic unit of time in a processor. It's one tick of the GPU's internal clock.

  **Simple analogy**::

      Think of a cycle like a heartbeat:
      - Your heart beats at ~60 beats per minute
      - GPU's clock "beats" at 1-2 GHz (1-2 billion beats per second!)
      - Each beat (cycle) is when the GPU can do one basic operation
  
  **GPU clock speed**::

      Modern GPU: 1.5 GHz clock speed
      - 1.5 GHz = 1,500,000,000 cycles per second
      - Each cycle = 1/1,500,000,000 seconds ≈ 0.67 nanoseconds
      - Extremely fast, but memory is still much slower!
  
  **Why cycles matter for memory**::

      When we say "400 cycles latency":
      - GPU requests data from global memory
      - Must wait 400 clock cycles before data arrives
      - During those 400 cycles, GPU could do 400 other operations!
      - This is why memory latency is expensive
      
      Shared memory at 20 cycles:
      - Only wait 20 clock cycles for data
      - 20× faster than 400 cycles
      - GPU wastes less time waiting, does more useful work
  
  **Real-world timing**::

      At 1.5 GHz GPU clock:
      - 400 cycles = 400 × 0.67 ns ≈ 267 nanoseconds (global memory)
      - 20 cycles = 20 × 0.67 ns ≈ 13 nanoseconds (shared memory)
      
      Even though nanoseconds sound fast, in GPU time this is huge!
      A GPU can execute billions of operations per second, so every 
      nanosecond counts.

Step 3: Introducing shared memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To solve the redundancy problem, we need a faster place to store data that **multiple threads can reuse**. This is called **shared memory**.

  **What is shared memory?**::

      GPU memory hierarchy (fastest to slowest):
      
      1. Registers (per thread)
         - Fastest: ~1 cycle latency
         - Tiny: ~255 registers per thread
         - Private to each thread
      
      2. Shared memory (per block)
         - Fast: ~20-40 cycle latency (10-20× faster than global memory!)
         - Small: ~48-96 KB per SM
         - Shared by all threads in a block (NOT divided by warp!)
         - All threads in the block can access any part of it
         - Explicitly managed by programmer
      
      3. Global memory (all threads)
         - Slow: ~400-800 cycle latency
         - Large: GBs of VRAM
         - Accessible by all threads
         - Where cudaMalloc allocates

  **Key insight**::

      If we load data from global memory into shared memory once,
      then all 256 threads in the block can reuse it from shared memory!
      
      Instead of:
      - 256 threads each reading A[0][k] from global memory (256× redundant)
      
      We can:
      - One thread loads A[0][k] into shared memory once
      - 256 threads read from shared memory (16× faster per access!)

.. dropdown:: Is shared memory divided per warp?
  :icon: question
  :color: info

  No! Shared memory is shared by the **entire block**, not divided among warps.

  **How shared memory works**::

      Block with 256 threads (8 warps × 32 threads):
      - Total shared memory allocated: 2 KB (two 16×16 tiles)
      - Each warp does NOT get 2KB ÷ 8 = 256 bytes
      - ALL 256 threads can access the FULL 2 KB
      
      Think of it like a shared whiteboard:
      - NOT: Each warp gets their own section of the whiteboard
      - YES: All threads can read/write anywhere on the whiteboard
  
  **Memory access pattern**::

      __shared__ float tileA[16][16];  // 1 KB total
      
      Block has 256 threads (warps 0-7):
      - Thread in warp 0 can read tileA[15][15]
      - Thread in warp 7 can read tileA[0][0]
      - Thread in warp 3 can read tileA[8][12]
      - Any thread can access any element!
      
      Shared memory is a single pool accessible by all threads in the block.
  
  **Why this matters**::

      If shared memory WERE divided per warp:
      - Warp 0 could only see 1/8 of the data
      - Warp 7 could only see their 1/8 of the data
      - Defeats the purpose! (We want ALL threads to reuse the SAME data)
      
      Because it's shared across the entire block:
      - All 256 threads can reuse the same loaded tiles
      - Perfect for matrix multiplication (all threads need same rows/columns)
      - This is what makes shared memory so powerful!
  
  **Important note**::

      While shared memory is accessible by all threads in a block, execution 
      happens in warps (32 threads at a time). So when multiple warps access 
      shared memory, the GPU schedules them appropriately to avoid conflicts.

But there's a catch: shared memory is small (48-96 KB). We can't fit entire 1024×1024 matrices. The solution is **tiling**.

.. dropdown:: What are typical shared memory sizes?
  :icon: flame
  :color: danger

  Shared memory size varies by GPU architecture and is much smaller than global memory (VRAM).

  **Shared memory per Streaming Multiprocessor (SM)**::

      GPU Architecture         Shared Memory per SM
      ─────────────────────────────────────────────
      Kepler (GTX 700)         48 KB
      Maxwell (GTX 900)        48 KB or 96 KB (configurable)
      Pascal (GTX 1000)        64 KB or 96 KB (configurable)
      Volta (V100)             96 KB
      Turing (RTX 2000)        64 KB
      Ampere (RTX 3000, A100)  100 KB (A100), 100 KB (RTX 3090)
      Ada Lovelace (RTX 4000)  100 KB
      Hopper (H100)            228 KB
  
  **Why so small?**::

      Global memory (VRAM):     8-80 GB (billions of bytes)
      Shared memory per SM:     48-228 KB (thousands of bytes)
      
      Ratio: ~1,000,000:1 difference!
      
      Shared memory is:
      - On-chip (physically on the GPU die, not external VRAM chips)
      - Extremely fast (20-40 cycle latency vs 400-800 for global memory)
      - Limited by chip area (silicon real estate is expensive)
      - Shared among ALL blocks running on that SM
  
  **Practical implications**::

      If your GPU has 96 KB shared memory per SM:
      - One 1024×1024 float matrix = 1024 × 1024 × 4 bytes = 4 MB
      - Can't fit even a tiny fraction in shared memory!
      - Must process in smaller tiles (e.g., 16×16 = 1 KB per tile)
      
      If you try to allocate too much:
      - Kernel launch fails with "out of resources" error
      - Or fewer blocks can run per SM (lower occupancy = slower)
      
      This is why tiling is essential for large matrices!

Step 4: Tiling strategy
~~~~~~~~~~~~~~~~~~~~~~~~

Since shared memory is too small for full matrices, we break the computation into smaller pieces called tiles. Let's build this understanding step by step.

4.1 The problem: Shared memory is too small
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To compute one 16×16 output tile C[0:16][0:16] from 1024×1024 matrices::
      we need data from both A and B:
      
      Visual showing what data is needed::
      
                Matrix A (1024×1024)                    Matrix B (1024×1024)
                ┌──────────────────────────────┐        ┌──────────────────────────────┐
                │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        Rows    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        0-15    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        (16     │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
        rows)   │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
                │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
                │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
                │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
                │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
                │                              │        │      ...     │               │
                │                              │        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│               │
                └──────────────────────────────┘        └──────────────────────────────┘
                    ALL 1024 columns                       Cols 0-15 (16 cols)
                        ↓                                  ALL 1024 rows
                                                                ↓
                                                      
                                    Output: C[0:16][0:16]
                                    ┌──────────────┐
                                    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│  16×16 tile
                                    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│  (256 elements)
                                    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
                                    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
                                    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
                                    └──────────────┘
      
      The math: Each element C[i][j] = Σ(A[i][k] × B[k][j]) for k = 0 to 1023
      
      Key insight: We need ALL 1024 values from each strip to compute the final result!
      - For C[0][0]: Need A[0][0] through A[0][1023] AND B[0][0] through B[1023][0]
      - For C[0][1]: Need A[0][0] through A[0][1023] AND B[0][1] through B[1023][1]
      - Cannot skip any values - all 1024 terms are needed for the sum
      
      Data requirements:
      - Need 16 rows of A: 16 × 1024 elements = 16,384 floats = 65 KB
      - Need 16 cols of B: 1024 × 16 elements = 16,384 floats = 65 KB
      - Total: 130 KB needed to keep in shared memory simultaneously
      
      But shared memory available: Only 48-96 KB per SM!
      
      130 KB > 96 KB → Doesn't fit! We need a different approach.

4.2 The solution: Break into smaller tiles
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Instead of loading all 1024 elements at once, we break them into manageable chunks::
      
      The key idea: C[i][j] = Σ(A[i][k] × B[k][j]) for k = 0 to 1023
      
      Break the 1024-wide strips into 64 sequential chunks of width 16:
      
      Matrix A (one row, 1024 elements total):
      ┌──────┬──────┬──────┬─────────┬──────┐
      │Tile 0│Tile 1│Tile 2│   ...   │Tile63│
      │  0-15│16-31 │32-47 │         │1008+ │
      └──────┴──────┴──────┴─────────┴──────┘
       1 KB   1 KB   1 KB             1 KB
      
      Matrix B (one column, 1024 elements total):
      ┌──────┐
      │Tile 0│  ← elements 0-15
      │  0-15│
      ├──────┤
      │Tile 1│  ← elements 16-31
      │16-31 │
      ├──────┤
      │Tile 2│  ← elements 32-47
      │32-47 │
      ├──────┤
      │  ... │
      ├──────┤
      │Tile63│  ← elements 1008-1023
      │1008+ │
      └──────┘
       Each tile = 1 KB
      
      Sequential processing (one iteration at a time):
      
      Iteration 0:  Load A[i][0:15]    and B[0:15][j]    → Compute partial sum (2 KB)
      Iteration 1:  Load A[i][16:31]   and B[16:31][j]   → Add to sum (2 KB)
      Iteration 2:  Load A[i][32:47]   and B[32:47][j]   → Add to sum (2 KB)
      ...
      Iteration 63: Load A[i][1008:1023] and B[1008:1023][j] → Final sum (2 KB)

4.3 How tile sharing works between threads
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section shows how all 256 threads in a block cooperate and share data through iterations.

In iteration 0, the ENTIRE BLOCK loads TWO 16×16 tiles into shared memory::
      
      Shared memory (accessible by all 256 threads in the block):
      ┌─────────────────────┐  ┌─────────────────────┐
      │     tileA (16×16)   │  │     tileB (16×16)   │
      │   A[0:15][0:15]     │  │   B[0:15][0:15]     │
      │                     │  │                     │
      │   ONE copy shared   │  │   ONE copy shared   │
      │   by 256 threads    │  │   by 256 threads    │
      └─────────────────────┘  └─────────────────────┘
           1 KB                      1 KB
      
      Now all 256 threads work IN PARALLEL using these shared tiles:
      
      Thread (0,0): Uses row 0 of tileA, col 0 of tileB
                    → Computes partial sum for C[0][0]
                    → (2.1×1.5) + (3.4×2.3) + ... = 87.4
      
      Thread (0,1): Uses row 0 of tileA, col 1 of tileB
                    → Computes partial sum for C[0][1]
                    → (2.1×4.2) + (3.4×1.8) + ... = 73.2
      
      Thread (5,7): Uses row 5 of tileA, col 7 of tileB
                    → Computes partial sum for C[5][7]
                    → (6.3×2.7) + (4.1×5.9) + ... = 95.8
      
      Thread (15,15): Uses row 15 of tileA, col 15 of tileB
                      → Computes partial sum for C[15][15]
                      → (3.8×7.2) + (9.1×2.4) + ... = 81.3
      
      The magic: All 256 threads read from the SAME tiles in shared memory!
      - Tiles loaded ONCE from global memory (slow)
      - Accessed MANY TIMES by different threads (fast - shared memory)
      - No redundant loading!

Iteration 1: Block loads NEW tiles into shared memory (overwrites old data) ::

      ┌─────────────────────┐  ┌─────────────────────┐
      │     tileA (16×16)   │  │     tileB (16×16)   │
      │  A[0:15][16:31]     │  │  B[16:31][0:15]     │
      │                     │  │                     │
      │   NEW data from     │  │   NEW data from     │
      │   columns 16-31     │  │   rows 16-31        │
      └─────────────────────┘  └─────────────────────┘
      
      Same threads, NEW tiles, ACCUMULATE to existing sums:
      
      Thread (0,0): Uses row 0 of NEW tileA, col 0 of NEW tileB
                    → Adds to sum: 87.4 + [(5.7×3.2) + (2.8×6.1) + ...] = 87.4 + 64.8
                    → Current sum for C[0][0] = 152.2
      
      Thread (0,1): Uses row 0 of NEW tileA, col 1 of NEW tileB
                    → Adds to sum: 73.2 + [(5.7×1.9) + (2.8×4.5) + ...] = 73.2 + 58.6
                    → Current sum for C[0][1] = 131.8
      
      Thread (5,7): Uses row 5 of NEW tileA, col 7 of NEW tileB
                    → Adds to sum: 95.8 + [(3.4×8.1) + (7.2×3.3) + ...] = 95.8 + 71.4
                    → Current sum for C[5][7] = 167.2
      
      Thread (15,15): Uses row 15 of NEW tileA, col 15 of NEW tileB
                      → Adds to sum: 81.3 + [(6.1×4.7) + (8.3×2.1) + ...] = 81.3 + 69.5
                      → Current sum for C[15][15] = 150.8
      
      Pattern continues for 62 more iterations (iteration 2, 3, ..., 63)
      Each iteration: Load 2 new tiles → All threads compute → Accumulate sums


      After iteration 63, all threads have computed their full 1024-term sum::

      - Thread (0,0) has final C[0][0] = 2,847.
      - Thread (0,1) has final C[0][1] = 3,012.5
      - Thread (5,7) has final C[5][7] = 2,691.8
      - Thread (15,15) has final C[15][15] = 2,934.1
      
      Result: Each iteration uses only 2 KB, well within 48-96 KB shared memory limit!
      After 64 iterations, all 256 output elements are complete.

Let's follow ONE thread computing ONE output element through the entire process::

      Thread (3,5) in Block (0,0) computes C[3][5].
      
      The formula: C[3][5] = A[3][0]×B[0][5] + A[3][1]×B[1][5] + ... + A[3][1023]×B[1023][5]
                           = 1024 multiplication-and-add operations
      
      Thread (3,5)'s private variable:
      float sum = 0.0f;  // Lives in this thread's register, NOT shared!
      
      ─────────────────────────────────────────────────────────────────
      ITERATION 0: Process elements k = 0 to 15
      ─────────────────────────────────────────────────────────────────
      
      Step 1: ALL threads cooperatively load tiles (each thread loads 2 elements total)
              Thread (3,5) loads TWO elements:
              - For tileA: tileA[3][5] = A[3][5]
              - For tileB: tileB[3][5] = B[3][5]
              
              Thread (0,0) loads TWO elements:
              - For tileA: tileA[0][0] = A[0][0]
              - For tileB: tileB[0][0] = B[0][0]
              
              Thread (7,2) loads TWO elements:
              - For tileA: tileA[7][2] = A[7][2]
              - For tileB: tileB[7][2] = B[7][2]
              
              ... all 256 threads load, filling both tiles (512 elements total)
              
              Result: Shared memory now contains:
              tileA[0:15][0:15] = A[0:15][0:15]
              tileB[0:15][0:15] = B[0:15][0:15]
      
      Visual: What Thread (3,5) accesses across iterations
      
      Matrix A (1024×1024) - Thread (3,5) always uses row 3:
      Row 0
      Row 1
      Row 2
      Row 3 ►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►►
      ┌─────┬─────┬─────┬───┬─────┐
      │▓▓▓▓▓│     │     │...│     │  ▓ = Iter 0: A[3][0:15]
      │ 0-15│16-31│32-47│   │1008+│    Loads into tileA[3][5] = A[3][5]
      └─────┴─────┴─────┴───┴─────┘
      │     │▓▓▓▓▓│     │...│     │  ▓ = Iter 1: A[3][16:31]
      │     │16-31│     │   │     │    Loads into tileA[3][5] = A[3][21]
      └─────┴─────┴─────┴───┴─────┘
      │     │     │▓▓▓▓▓│...│     │  ▓ = Iter 2: A[3][32:47]
      │     │     │32-47│   │     │    Loads into tileA[3][5] = A[3][37]
      └─────┴─────┴─────┴───┴─────┘
                    ... continues for 64 iterations
      │     │     │     │...│▓▓▓▓▓│  ▓ = Iter 63: A[3][1008:1023]
      │     │     │     │   │1008+│    Loads into tileA[3][5] = A[3][1013]
      └─────┴─────┴─────┴───┴─────┘
      
      Matrix B (1024×1024) - Thread (3,5) always uses column 5:
           Col 5 (target)
              │
              ▼
      ┌──────▓▓▓──────┐  ▓ = Iter 0: B[0:15][5]
      │      ▓▓▓      │    Loads into tileB[3][5] = B[3][5]
      │      0-15     │
      ├──────▓▓▓──────┤  ▓ = Iter 1: B[16:31][5]
      │      ▓▓▓      │    Loads into tileB[3][5] = B[19][5]
      │     16-31     │
      ├──────▓▓▓──────┤  ▓ = Iter 2: B[32:47][5]
      │      ▓▓▓      │    Loads into tileB[3][5] = B[35][5]
      │     32-47     │
      ├──────▓▓▓──────┤
      │      ...      │    ... continues for 64 iterations
      ├──────▓▓▓──────┤  ▓ = Iter 63: B[1008:1023][5]
      │      ▓▓▓      │    Loads into tileB[3][5] = B[1011][5]
      │    1008+      │
      └──────▓▓▓──────┘
      
      Key insight: Thread (3,5) accesses row 3 and column 5 progressively:
      - Same tile position [3][5] every iteration
      - Different global matrix positions each iteration
      - Marches through all 1024 elements across 64 chunks
      
      Step 2: Synchronize (__syncthreads()) - wait for all tiles to be loaded
      
      Step 3: Thread (3,5) computes using row 3 of tileA and column 5 of tileB:
              sum = 0.0f
              for (int k = 0; k < 16; k++)
                  sum += tileA[3][k] * tileB[k][5];
              
              Expands to:
              sum = A[3][0]×B[0][5] + A[3][1]×B[1][5] + ... + A[3][15]×B[15][5]
              sum = 3.2×1.5 + 2.7×4.3 + ... + 5.1×2.9
              sum = 87.4
      
      Thread (3,5)'s state: sum = 87.4 (stored in register, private to this thread)
      
      ─────────────────────────────────────────────────────────────────
      ITERATION 1: Process elements k = 16 to 31
      ─────────────────────────────────────────────────────────────────
      
      Now we need to compute the next 16 products (k = 16 to 31), so we need
      the next chunk of data from global matrices A and B.
      
      Step 1: ALL threads cooperatively load NEW tiles from global memory
              Thread (3,5) loads TWO elements (one for each tile):
              
              For tileA: tileA[3][5] = A[3][21]
              - Same tile position [3][5], but from column 21 of global matrix A
              - Formula: A[threadIdx.y][16 + threadIdx.x] = A[3][16 + 5] = A[3][21]
              
              For tileB: tileB[3][5] = B[19][5]
              - Same tile position [3][5], but from row 19 of global matrix B
              - Formula: B[16 + threadIdx.y][threadIdx.x] = B[16 + 3][5] = B[19][5]
              
              All 256 threads load their assigned elements...
              
              Result: Shared memory NOW contains:
              tileA[0:15][0:15] = A[0:15][16:31] (NEW data from global A, overwrites old!)
              tileB[0:15][0:15] = B[16:31][0:15] (NEW data from global B, overwrites old!)
      
      Step 2: Synchronize again
      
      Step 3: Thread (3,5) computes using NEW tiles, ACCUMULATES to existing sum:
              for (int k = 0; k < 16; k++)
                  sum += tileA[3][k] * tileB[k][5];
              
              Expands to:
              sum += A[3][16]×B[16][5] + A[3][17]×B[17][5] + ... + A[3][31]×B[31][5]
              sum = 87.4 + (4.1×3.2 + 6.3×1.8 + ... + 2.5×4.7)
              sum = 87.4 + 64.2 = 151.6
      
      Thread (3,5)'s state: sum = 151.6 (accumulating!)
      
      Key insight: Thread (3,5) uses row 3 from tileA and column 5 from tileB.
                   Same positions in the tiles, but tiles contain DIFFERENT data!
      
      ─────────────────────────────────────────────────────────────────
      ITERATIONS 2-62: Same pattern continues...
      ─────────────────────────────────────────────────────────────────
      
      Iteration 2: Load A[0:15][32:47] and B[32:47][0:15]
                   Thread (3,5) adds 16 more products, sum = 219.3
      
      Iteration 3: Load A[0:15][48:63] and B[48:63][0:15]
                   Thread (3,5) adds 16 more products, sum = 288.7
      
      ... (continues for 60 more iterations)
      
      Iteration 62: Load A[0:15][992:1007] and B[992:1007][0:15]
                    Thread (3,5) adds 16 more products, sum = 2,783.5
      
      ─────────────────────────────────────────────────────────────────
      ITERATION 63: Final iteration, k = 1008 to 1023
      ─────────────────────────────────────────────────────────────────
      
      Step 1: Load final tiles: A[0:15][1008:1023] and B[1008:1023][0:15]
      Step 2: Synchronize
      Step 3: Thread (3,5) adds final 16 products:
              sum += A[3][1008]×B[1008][5] + ... + A[3][1023]×B[1023][5]
              sum = 2,783.5 + 68.9 = 2,852.4
      
      Step 4: Write final result: C[3][5] = 2,852.4
      
      ─────────────────────────────────────────────────────────────────
      SUMMARY: What happened?
      ─────────────────────────────────────────────────────────────────
      
      Thread (3,5)'s perspective:
      - Started with sum = 0
      - In each of 64 iterations:
        * Helped load 1 element into shared memory
        * Waited for all threads to finish loading
        * Used its designated row/column from the shared tiles
        * Added 16 products to its private sum
      - After 64 iterations: sum = 2,852.4 = final C[3][5]
      
      Key points:
      1. Each thread has its OWN private sum (register variable)
      2. Shared tiles contain DIFFERENT data each iteration
      3. Each thread uses the SAME positions (row/col) but DIFFERENT data
      4. All threads work in lockstep through 64 iterations
      5. Shared memory is reused 64 times, but each thread's sum keeps growing

4.4 Putting it all together
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Workspace organization::
    - Grid: 64×64 = 4,096 blocks launched
    - Each block: 16×16 = 256 threads
    - Total threads: 1,048,576 threads (one per output element)
    
    Each block's job:
    - Computes one 16×16 output tile (256 elements of C)
    - Processes 128 input tiles sequentially (64 from A, 64 from B)
    - Uses 2 KB shared memory, reuses it 64 times
    
    Each thread's job:
    - Computes ONE output element
    - Participates in loading tiles (1 element per load)
    - Accumulates partial sums across 64 iterations
    
    Key advantage: 
    - 16× reduction in global memory reads (shared memory reuse)
    - 10-20× faster memory access (shared vs global memory)
    - Overall: 10-20× speedup!

.. dropdown:: Why use 16×16 tiles (2 KB) when we have 48-96 KB available?
  :icon: question
  :color: info

  With 48-96 KB available, couldn't we use 32×32 or 64×64 tiles? Yes, but there are trade-offs!

  **Tile size options**::

      Available shared memory: 48 KB (conservative estimate)
      
      16×16 tiles: 2 × 16 × 16 × 4 = 2 KB per block
      - Can fit: 48 KB / 2 KB = 24 blocks per SM
      - Threads per block: 16 × 16 = 256 threads
      - Total threads: 24 blocks × 256 = 6,144 threads per SM
      
      32×32 tiles: 2 × 32 × 32 × 4 = 8 KB per block
      - Can fit: 48 KB / 8 KB = 6 blocks per SM
      - Threads per block: 32 × 32 = 1,024 threads (maximum per block!)
      - Total threads: 6 blocks × 1,024 = 6,144 threads per SM
      
      64×64 tiles: 2 × 64 × 64 × 4 = 32 KB per block
      - Can fit: 48 KB / 32 KB = 1 block per SM
      - Threads per block: 64 × 64 = 4,096 threads (exceeds 1,024 limit!)
      - Result: Cannot launch! Maximum threads per block is 1,024
  
  **The trade-offs**::

      Smaller tiles (16×16):
      + More blocks per SM (better occupancy)
      + Better latency hiding (more warps to switch between)
      + More flexible (works on older GPUs)
      - More loop iterations (64 tiles for 1024×1024)
      - Less data reuse per tile
      
      Larger tiles (32×32):
      + Fewer loop iterations (better for large matrices)
      + More data reuse (each element used 32 times vs 16)
      - Fewer blocks per SM (lower occupancy)
      - Hits thread limit (1,024 threads per block)
      - Requires more shared memory
  
  **Why 16×16 is a good choice**::

      1. Pedagogical: Easy to understand and visualize
      2. Portable: Works on all modern GPUs (even those with 48 KB shared memory)
      3. Balanced: Good occupancy without hitting thread limits
      4. Production-ready: cuBLAS actually uses multiple tile sizes depending on:
         - Matrix dimensions
         - GPU architecture
         - Data types (float vs double)
         - Available shared memory
      
      For learning, 16×16 is perfect. For production, use cuBLAS (auto-tuned)!

  **Visual representation of tiling for 1024×1024**::
      
      Matrix A (1024×1024)              Matrix B (1024×1024)
      ┌──────────────────────┐            ┌──────────────────────┐
      │ Tile  Tile  ... Tile │            │ Tile  Tile  ... Tile │
      │ A₀₀   A₀₁      A₀,₆₃ │            │ B₀₀   B₀₁      B₀,₆₃ │
      ├──────────────────────┤            ├──────────────────────┤
      │ Tile  Tile  ... Tile │            │ Tile  Tile  ... Tile │
      │ A₁₀   A₁₁      A₁,₆₃ │            │ B₁₀   B₁₁      B₁,₆₃ │
      │  ...   ...      ...  │            │  ...   ...      ...  │
      │ Tile  Tile  ... Tile │            │ Tile  Tile  ... Tile │
      │ A₆₃,₀ A₆₃,₁    A₆₃,₆₃│            │ B₆₃,₀ B₆₃,₁    B₆₃,₆₃│
      └──────────────────────┘            └──────────────────────┘
      
      Each matrix divided into 64×64 = 4,096 tiles (each tile is 16×16)
      
      Each tile is 16×16 elements (256 floats = 1 KB)
      Total tiles per matrix: 64×64 = 4,096 tiles
      Total tiles to process: 256 / 16 = 16 tiles along the K dimension

  **How tiling reduces memory traffic**::

      Without tiling (naive approach):
      - Thread (0,0) reads A[0][0] from global memory
      - Thread (0,1) reads A[0][0] from global memory (redundant!)
      - Thread (0,2) reads A[0][0] from global memory (redundant!)
      - ... all 16 threads in row 0 read A[0][0] separately
      - Total: 16 reads from slow global memory
      
      With tiling (shared memory approach):
      - One thread loads A[0][0] into shared memory (1 global read)
      - Thread (0,0) reads from shared memory (fast!)
      - Thread (0,1) reads from shared memory (fast!)
      - Thread (0,2) reads from shared memory (fast!)
      - ... all 16 threads read from fast shared memory
      - Total: 1 read from global memory, 16 reads from shared memory
      
      Result: 16× less global memory traffic + each access is 10-20× faster!

Step 5: Implementation with shared memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now let's implement the tiled matrix multiplication with extensive inline comments showing exactly what happens.

.. code-block:: cpp

    // ═════════════════════════════════════════════════════════════════════════════
    // TILE SIZE CONFIGURATION
    // ═════════════════════════════════════════════════════════════════════════════
    // Tile size: 16×16 = 256 threads per block = 256 elements per tile
    // Each tile = 256 floats × 4 bytes = 1 KB
    // Why 16? Common choice balancing occupancy and shared memory usage
    #define TILE_SIZE 16

    __global__ void matMulTiled(float* A, float* B, float* C, int N)
    {
        // ═════════════════════════════════════════════════════════════════════════
        // SHARED MEMORY DECLARATION (2 KB total per block)
        // ═════════════════════════════════════════════════════════════════════════
        // __shared__ = visible to all 256 threads in this block
        // Allocated once per block, reused across all 64 iterations
        // Speed: 20-40 cycle latency (vs 400-800 cycles for global memory)
        // Size limit: 48-96 KB per SM (shared among all active blocks)
        __shared__ float tileA[TILE_SIZE][TILE_SIZE];  // 16×16 = 256 floats = 1 KB
        __shared__ float tileB[TILE_SIZE][TILE_SIZE];  // 16×16 = 256 floats = 1 KB
        // Total: 2 KB allocated in shared memory
        
        // ═════════════════════════════════════════════════════════════════════════
        // THREAD IDENTIFICATION: Calculate which output element this thread computes
        // ═════════════════════════════════════════════════════════════════════════
        // Each thread in the grid computes ONE element of output matrix C
        // row = which row of C this thread computes
        // col = which column of C this thread computes
        //
        // Formula: global_position = block_position × block_size + thread_position
        int row = blockIdx.y * TILE_SIZE + threadIdx.y;
        int col = blockIdx.x * TILE_SIZE + threadIdx.x;
        //
        // Example: Thread (3,5) in Block (10,20) for N=1024:
        //   row = blockIdx.y × 16 + threadIdx.y = 20 × 16 + 3 = 323
        //   col = blockIdx.x × 16 + threadIdx.x = 10 × 16 + 5 = 165
        //   → This thread computes C[323][165]
        
        // ═════════════════════════════════════════════════════════════════════════
        // ACCUMULATOR: Stores partial sum for this thread's output element
        // ═════════════════════════════════════════════════════════════════════════
        // Lives in register (private to this thread, fastest memory)
        // Persists across all iterations, accumulating partial dot products
        // For N=1024: Accumulates 1024 products across 64 iterations
        float sum = 0.0f;
        //
        // Example: Thread (3,5) computing C[323][165]:
        //   Start:        sum = 0.0
        //   Iteration 0:  sum = 87.4   (after adding 16 products)
        //   Iteration 1:  sum = 152.2  (after adding 16 more products)
        //   Iteration 2:  sum = 218.7
        //   ...
        //   Iteration 63: sum = 2847.3 (final result after 1024 products total)
        
        // ═════════════════════════════════════════════════════════════════════════
        // CALCULATE NUMBER OF ITERATIONS needed to cover all K elements
        // ═════════════════════════════════════════════════════════════════════════
        // C[i][j] = A[i][0]*B[0][j] + A[i][1]*B[1][j] + ... + A[i][N-1]*B[N-1][j]
        // That's N products to compute, but we process TILE_SIZE at a time
        // So we need ceiling(N / TILE_SIZE) iterations
        int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;
        //
        // Example: N=1024, TILE_SIZE=16:
        //   numTiles = (1024 + 15) / 16 = 1039 / 16 = 64 iterations
        //   Iteration 0: Process k = 0 to 15
        //   Iteration 1: Process k = 16 to 31
        //   Iteration 2: Process k = 32 to 47
        //   ...
        //   Iteration 63: Process k = 1008 to 1023
        
        // ═════════════════════════════════════════════════════════════════════════
        // MAIN LOOP: Iterate through tiles along the K dimension
        // ═════════════════════════════════════════════════════════════════════════
        // Each iteration:
        //   1. Load one 16×16 tile from A (256 elements)
        //   2. Load one 16×16 tile from B (256 elements)
        //   3. Compute 16 multiply-adds using the loaded tiles
        //   4. Accumulate results into sum
        for (int t = 0; t < numTiles; t++) {
            
            // ─────────────────────────────────────────────────────────────────────
            // STEP 1: COOPERATIVE LOADING FROM MATRIX A
            // ─────────────────────────────────────────────────────────────────────
            // All 256 threads work together to load one tile from matrix A
            // Each thread loads exactly ONE element from global memory
            // Together they fill the entire 16×16 shared memory tile
            //
            // For Thread (ty, tx): Loads A[row][t×16+tx] → tileA[ty][tx]
            //   - row stays CONSTANT across all iterations (always same row of A)
            //   - column CHANGES each iteration (moves right through row of A)
            int aRow = row;
            int aCol = t * TILE_SIZE + threadIdx.x;
            //
            // Example: Thread (3,5) in Block (10,20), iteration t=1:
            //   aRow = 323 (same for all 64 iterations - always row 323 of A)
            //   aCol = 1 × 16 + 5 = 21
            //   Loads: tileA[3][5] = A[323][21]
            //
            // All threads in this block cooperatively load (iteration t=1):
            //   Thread (0,0): A[320][16] → tileA[0][0]
            //   Thread (0,1): A[320][17] → tileA[0][1]
            //   Thread (3,5): A[323][21] → tileA[3][5]
            //   Thread (15,15): A[335][31] → tileA[15][15]
            //   Result: tileA contains A[320:335][16:31]
            //
            // Boundary checking: Needed when N is not multiple of TILE_SIZE
            if (aRow < N && aCol < N) {
                tileA[threadIdx.y][threadIdx.x] = A[aRow * N + aCol];
            } else {
                tileA[threadIdx.y][threadIdx.x] = 0.0f;  // Zero padding
            }
            
            // ─────────────────────────────────────────────────────────────────────
            // STEP 2: COOPERATIVE LOADING FROM MATRIX B
            // ─────────────────────────────────────────────────────────────────────
            // All 256 threads work together to load one tile from matrix B
            // Each thread loads exactly ONE element from global memory
            //
            // For Thread (ty, tx): Loads B[t×16+ty][col] → tileB[ty][tx]
            //   - row CHANGES each iteration (moves down through column of B)
            //   - column stays CONSTANT across all iterations (always same col of B)
            int bRow = t * TILE_SIZE + threadIdx.y;
            int bCol = col;
            //
            // Example: Thread (3,5) in Block (10,20), iteration t=1:
            //   bRow = 1 × 16 + 3 = 19
            //   bCol = 165 (same for all 64 iterations - always column 165 of B)
            //   Loads: tileB[3][5] = B[19][165]
            //
            // All threads in this block cooperatively load (iteration t=1):
            //   Thread (0,0): B[16][160] → tileB[0][0]
            //   Thread (0,1): B[16][161] → tileB[0][1]
            //   Thread (3,5): B[19][165] → tileB[3][5]
            //   Thread (15,15): B[31][175] → tileB[15][15]
            //   Result: tileB contains B[16:31][160:175]
            if (bRow < N && bCol < N) {
                tileB[threadIdx.y][threadIdx.x] = B[bRow * N + bCol];
            } else {
                tileB[threadIdx.y][threadIdx.x] = 0.0f;  // Zero padding
            }
            
            // ─────────────────────────────────────────────────────────────────────
            // STEP 3: SYNCHRONIZATION BARRIER - Wait for all threads to finish loading
            // ─────────────────────────────────────────────────────────────────────
            // CRITICAL: Must ensure ALL threads complete loading before ANY thread
            // starts computing, otherwise we'd read incomplete/corrupted data!
            //
            // What __syncthreads() does:
            //   - Blocks execution until ALL 256 threads in this block reach here
            //   - Fast threads wait for slow threads
            //   - Only when last thread arrives do all threads proceed together
            //
            // Example timeline (times in arbitrary units):
            //   Time 0:   All threads start loading
            //   Time 50:  Thread (0,0) finishes, reaches __syncthreads(), WAITS
            //   Time 100: Thread (5,7) finishes, reaches __syncthreads(), WAITS
            //   Time 150: Thread (10,12) finishes, reaches __syncthreads(), WAITS
            //   Time 200: Thread (15,15) finishes (slowest), reaches __syncthreads()
            //   Time 201: ALL threads released, proceed to computation ✓
            //
            // Without this sync:
            //   Thread (0,0) loads fast → starts computing → reads tileA[8][9]
            //   But Thread (8,9) hasn't finished loading yet!
            //   Result: Thread (0,0) reads garbage data → WRONG ANSWER!
            __syncthreads();
            
            // ─────────────────────────────────────────────────────────────────────
            // STEP 4: COMPUTE PARTIAL DOT PRODUCT using shared memory tiles
            // ─────────────────────────────────────────────────────────────────────
            // Each thread computes 16 multiply-add operations
            // Uses one ROW from tileA and one COLUMN from tileB
            // All 16 operations read from FAST shared memory (20-40 cycles)
            //
            // For Thread (ty, tx): Computes using tileA[ty][0:15] and tileB[0:15][tx]
            //
            // Example: Thread (3,5) in Block (10,20), iteration t=1:
            //   Uses row 3 of tileA → contains A[323][16:31]
            //   Uses col 5 of tileB → contains B[16:31][165]
            //
            //   k=0:  sum += tileA[3][0] × tileB[0][5]
            //         sum += A[323][16] × B[16][165]  (global positions)
            //
            //   k=1:  sum += tileA[3][1] × tileB[1][5]
            //         sum += A[323][17] × B[17][165]
            //
            //   k=2:  sum += tileA[3][2] × tileB[2][5]
            //         sum += A[323][18] × B[18][165]
            //
            //   ... (13 more operations)
            //
            //   k=15: sum += tileA[3][15] × tileB[15][5]
            //         sum += A[323][31] × B[31][165]
            //
            //   After this loop (iteration 1 complete):
            //     sum = (previous value) + (16 new products)
            //     If sum was 87.4, now sum ≈ 152.2
            //
            // KEY OPTIMIZATION: All 16 reads from shared memory (fast!)
            //   Without tiling: 16 reads from global memory (slow!)
            //   Speedup: ~10-20× per access
            for (int k = 0; k < TILE_SIZE; k++) {
                sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
            }
            
            // ─────────────────────────────────────────────────────────────────────
            // STEP 5: SYNCHRONIZATION BARRIER - Wait before loading next tile
            // ─────────────────────────────────────────────────────────────────────
            // CRITICAL: Must ensure ALL threads finish computing before ANY thread
            // starts loading next tile, otherwise we'd overwrite data still in use!
            //
            // What happens without this sync (causes bugs!):
            //   Time 0:   All threads computing with current tiles
            //   Time 50:  Thread (0,0) finishes → starts iteration t=2
            //           → Loads new data → OVERWRITES tileA[0][0]
            //   Time 100: Thread (15,15) still computing with iteration t=1
            //           → Tries to read tileA[0][0] → Gets WRONG data!
            //           → Result is CORRUPTED!
            //
            // With this sync (correct behavior):
            //   Time 0:   All threads computing
            //   Time 50:  Thread (0,0) finishes, reaches __syncthreads(), WAITS
            //   Time 100: Thread (10,12) finishes, reaches __syncthreads(), WAITS
            //   Time 150: Thread (15,15) finishes, reaches __syncthreads()
            //   Time 151: ALL threads proceed to next iteration together ✓
            //           → Safe to overwrite shared memory now
            __syncthreads();
        }
        // Loop ends: After 64 iterations, each thread has accumulated 1024 products
        
        // ═════════════════════════════════════════════════════════════════════════
        // WRITE FINAL RESULT to global memory
        // ═════════════════════════════════════════════════════════════════════════
        // After all iterations complete, sum contains the final dot product
        // Write it to the corresponding position in output matrix C
        //
        // Boundary check: Some threads may be outside matrix bounds
        //   Example: N=1020, block has threads up to position 1023
        //   Threads computing row/col >= 1020 shouldn't write (out of bounds)
        //
        // Example: Thread (3,5) in Block (10,20):
        //   row = 323, col = 165
        //   sum = 2847.3 (accumulated over 64 iterations)
        //   Writes: C[323 × 1024 + 165] = 2847.3
        //           C[323][165] = 2847.3 ✓
        //
        // Summary of what this thread accomplished:
        //   Iteration 0:  Added A[323][0:15] · B[0:15][165]     → sum = 87.4
        //   Iteration 1:  Added A[323][16:31] · B[16:31][165]   → sum = 152.2
        //   Iteration 2:  Added A[323][32:47] · B[32:47][165]   → sum = 218.7
        //   ...
        //   Iteration 63: Added A[323][1008:1023] · B[1008:1023][165] → sum = 2847.3
        //   Total: 1024 multiply-add operations = complete dot product
        if (row < N && col < N) {
            C[row * N + col] = sum;
        }
    }

Host code and kernel launch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: cpp

    int main()
    {
        int N = 1024;
        size_t size = N * N * sizeof(float);
        
        // Allocate and initialize host matrices
        float *h_A = (float*)malloc(size);
        float *h_B = (float*)malloc(size);
        float *h_C = (float*)malloc(size);
        
        for (int i = 0; i < N * N; i++) {
            h_A[i] = rand() / (float)RAND_MAX;
            h_B[i] = rand() / (float)RAND_MAX;
        }
        
        // Allocate device matrices
        float *d_A, *d_B, *d_C;
        cudaMalloc(&d_A, size);
        cudaMalloc(&d_B, size);
        cudaMalloc(&d_C, size);
        
        // Copy to device
        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
        
        // Launch optimized kernel with tiling
        // Each block: 16×16 = 256 threads
        // Grid: 64×64 = 4,096 blocks (for N=1024)
        // Total: 1,048,576 threads computing 1,048,576 output elements
        dim3 threadsPerBlock(TILE_SIZE, TILE_SIZE);
        dim3 numBlocks((N + TILE_SIZE - 1) / TILE_SIZE, 
                       (N + TILE_SIZE - 1) / TILE_SIZE);
        
        matMulTiled<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
        
        // Copy result back
        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
        
        // Cleanup
        cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
        free(h_A); free(h_B); free(h_C);
        
        return 0;
    }

Why this optimization works
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The tiled implementation achieves 10-20× speedup by exploiting the memory hierarchy.

**Memory reuse pattern:**

For N=1024, each thread needs to access 1024 elements from A and 1024 from B. Without tiling, these 2048 accesses would all hit slow global memory (400-800 cycles each). With tiling, we break this into 64 iterations:

- Each iteration: Load 2 KB into shared memory (one slow access)
- Then perform 16 operations using shared memory (16 fast accesses)
- Memory access ratio: 4,096 shared reads : 256 global reads = 16:1 reuse

**Performance breakdown per block:**

.. code-block:: text

    Without tiling (naive):
      Each thread: 1024 reads from A (global) + 1024 reads from B (global)
      256 threads × 2048 global reads × 400 cycles = 209 million cycles
    
    With tiling:
      64 iterations × (256 global reads + 4096 shared reads)
      64 × (256×400 + 4096×30) ≈ 14 million cycles
    
    Speedup: 209M / 14M ≈ 15× faster

**Why cooperative loading matters:**

When all 256 threads in a block load simultaneously, the GPU can coalesce adjacent memory accesses into fewer transactions. Thread (0,0) loads A[row][0], Thread (0,1) loads A[row][1], etc. These adjacent accesses combine into a single memory transaction, maximizing bandwidth utilization.

**Why synchronization is critical:**

The two ``__syncthreads()`` barriers prevent data races. After loading, all threads must wait before computing to ensure the complete tile is available. After computing, all threads must wait before loading the next tile to prevent overwriting data still in use. Without these barriers, fast threads would corrupt data needed by slower threads.

.. dropdown:: Why not use bigger tiles?
  :icon: question
  :color: info

  Shared memory is limited (48-96 KB per SM), so tile size involves trade-offs:

  - **16×16 tiles (2 KB)**: High occupancy, many blocks per SM, good latency hiding, moderate reuse
  - **32×32 tiles (8 KB)**: Better reuse (32 elements reused 32 times), fewer blocks per SM
  - **64×64 tiles (32 KB)**: Excellent reuse, but may limit parallelism significantly
  
  Optimal size depends on GPU architecture, matrix dimensions, and occupancy needs. Our 16×16 choice balances memory usage with parallelism. For production, use cuBLAS which auto-tunes for your hardware.

.. dropdown:: How does memory coalescing work here?
  :icon: flame 
  :color: warning

  Memory coalescing combines adjacent thread accesses into fewer transactions, maximizing bandwidth.

  **Good (coalesced)**: Thread 0 reads A[0], Thread 1 reads A[1], Thread 2 reads A[2]. The GPU combines these into one 128-byte transaction serving 32 threads. Bandwidth: Full utilization.

  **Bad (non-coalesced)**: Thread 0 reads A[0], Thread 1 reads A[32], Thread 2 reads A[64]. Each triggers a separate 128-byte transaction but uses only 4 bytes. Bandwidth: 32× slower.

  **Our implementation**: When loading tiles, Thread (0,0) loads A[row][t×16+0], Thread (0,1) loads A[row][t×16+1], Thread (0,2) loads A[row][t×16+2]. These adjacent accesses naturally coalesce, achieving optimal memory bandwidth.

Summary
~~~~~~~

This tutorial series progressed from basic concepts to advanced optimization:

- **Tutorials 0-2**: Memory management and basic parallelism
- **Tutorials 3-4**: Scaling with batching and chunking
- **Tutorial 5**: Overlapping transfers with streams
- **Tutorial 6**: Memory hierarchy optimization with tiling
- **Tutorial 7**: Combining streams and tiling for large batches

The core principle: move data closer to computation. By exploiting the memory hierarchy (registers → shared memory → global memory) and overlapping operations with streams, we achieve significant speedups over naive implementations. For production code, use cuBLAS library, which implements additional optimizations achieving near-peak performance.

For conceptual material on GPU architecture and CUDA fundamentals, see :ref:`cuda`.

Tutorial 7: Batch processing with streams and tiling (100k images of 1024×1024)
-------------------------------------------------------------------------------

Process 100,000 matrices (1024×1024 each) using both tiling and streams to maximize GPU utilization.

Problem: Large batch matrix operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You have 100,000 images (1024×1024 pixels each) and need to apply a transformation like convolution, filtering, or matrix multiplication. Processing them one at a time wastes GPU resources. Processing all at once exceeds memory limits.

.. note::
   **Terminology**: In this tutorial, "batch" refers to a group of matrices processed together in one iteration (same as "chunk" in Tutorial 4). We use "batch" here because it's more common in machine learning and image processing contexts. The concept is identical: dividing a large dataset into smaller groups that fit in memory.

**Challenge:**

- Total data: 100,000 matrices × 1024×1024 × 4 bytes = 400 GB
- GPU memory: Typically 8-24 GB
- Need to process in batches while keeping GPU busy

**Solution approach:**

Combine Tutorial 5 (streams) and Tutorial 6 (tiling):

1. Divide 100,000 matrices into batches that fit in GPU memory
2. Use multiple streams to overlap CPU→GPU transfer, computation, and GPU→CPU transfer
3. Within each computation, use tiled matrix operations for efficiency
4. Pipeline the batches to maximize throughput

Step 1: Understanding the pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We'll process matrices in batches of 1,000, using 3 streams for triple buffering:

.. code-block:: text

    Time →
    
    Stream 0: [H→D batch 0] [Compute batch 0] [D→H batch 0]
    Stream 1:               [H→D batch 1]     [Compute batch 1] [D→H batch 1]
    Stream 2:                                 [H→D batch 2]     [Compute batch 2] [D→H batch 2]
    
    Result: All three stages happen simultaneously across different batches
Step 2: Memory organization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For 100,000 matrices in batches of 1,000, we need a realistic memory strategy:

.. code-block:: text

    Storage (disk/database): 100,000 matrices (400 GB total)
      ├─ Load batches on-demand as needed
      └─ Save processed results back to disk
    
    Host memory (pinned, for fast transfer):
      Stream 0 buffers: input (4 GB) + output (4 GB) = 8 GB
      Stream 1 buffers: input (4 GB) + output (4 GB) = 8 GB
      Stream 2 buffers: input (4 GB) + output (4 GB) = 8 GB
      Total: 24 GB (realistic for modern systems)
    
    GPU memory (device):
      Stream 0: input (4 GB) + output (4 GB) = 8 GB
      Stream 1: input (4 GB) + output (4 GB) = 8 GB
      Stream 2: input (4 GB) + output (4 GB) = 8 GB
      Total: 24 GB (fits on 32 GB GPU, tight on 16 GB GPU)
    
    Data flow:
      Disk → Pinned buffer → GPU → Pinned buffer → Disk
      
    Key: Only 3 batches in memory at once, cycling through all 100 batches

Step 3: Implementation strategy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: cpp

    // Configuration
    #define N 1024              // Matrix dimension
    #define TILE_SIZE 16        // Tile size for shared memory
    #define CHUNK_SIZE 1000     // Matrices per chunk
    #define NUM_CHUNKS 100      // Total chunks (100,000 / 1,000)
    #define NUM_STREAMS 3       // Triple buffering
    
    // Tiled kernel: Apply 3×3 Gaussian blur filter
    // Why tiling helps: Each output pixel needs 9 input pixels
    // With shared memory: Load once, reuse 9 times by different threads
    // Without shared memory: Each thread would load from slow global memory
    __global__ void processMatrixTiled(float* input, float* output, int N)
    {
        // Shared memory tile (with halo for neighboring pixels)
        __shared__ float tile[TILE_SIZE + 2][TILE_SIZE + 2];
        
        // Calculate global position
        int row = blockIdx.y * TILE_SIZE + threadIdx.y;
        int col = blockIdx.x * TILE_SIZE + threadIdx.x;
        
        // Load tile cooperatively (including 1-pixel border)
        int tileRow = threadIdx.y + 1;  // +1 for halo
        int tileCol = threadIdx.x + 1;
        
        if (row < N && col < N) {
            tile[tileRow][tileCol] = input[row * N + col];
        } else {
            tile[tileRow][tileCol] = 0.0f;
        }
        
        // Load halo (border pixels needed for filter)
        if (threadIdx.y == 0 && row > 0) {
            tile[0][tileCol] = input[(row-1) * N + col];  // Top
        }
        if (threadIdx.y == TILE_SIZE-1 && row < N-1) {
            tile[TILE_SIZE+1][tileCol] = input[(row+1) * N + col];  // Bottom
        }
        if (threadIdx.x == 0 && col > 0) {
            tile[tileRow][0] = input[row * N + (col-1)];  // Left
        }
        if (threadIdx.x == TILE_SIZE-1 && col < N-1) {
            tile[tileRow][TILE_SIZE+1] = input[row * N + (col+1)];  // Right
        }
        
        __syncthreads();
        
        // Apply 3×3 Gaussian blur using shared memory
        // Gaussian kernel weights (normalized):
        //   [1 2 1]
        //   [2 4 2] / 16
        //   [1 2 1]
        if (row < N && col < N) {
            float result = 
                tile[tileRow-1][tileCol-1] * 1.0f +  // Top-left
                tile[tileRow-1][tileCol]   * 2.0f +  // Top
                tile[tileRow-1][tileCol+1] * 1.0f +  // Top-right
                tile[tileRow][tileCol-1]   * 2.0f +  // Left
                tile[tileRow][tileCol]     * 4.0f +  // Center
                tile[tileRow][tileCol+1]   * 2.0f +  // Right
                tile[tileRow+1][tileCol-1] * 1.0f +  // Bottom-left
                tile[tileRow+1][tileCol]   * 2.0f +  // Bottom
                tile[tileRow+1][tileCol+1] * 1.0f;   // Bottom-right
            
            output[row * N + col] = result / 16.0f;  // Normalize
        }
    }
    
    int main()
    {
        // Total: 100,000 matrices
        int totalMatrices = NUM_CHUNKS * CHUNK_SIZE;
        size_t matrixSize = N * N * sizeof(float);
        size_t chunkBytes = CHUNK_SIZE * matrixSize;  // 4 GB per chunk
        
        // ═══════════════════════════════════════════════════════════
        // STEP 1: Allocate pinned host memory for NUM_STREAMS chunks
        // ═══════════════════════════════════════════════════════════
        // Realistic: Only allocate enough for in-flight chunks
        // With 3 streams: 3 chunks in (12 GB) + 3 chunks out (12 GB) = 24 GB
        float *h_input[NUM_STREAMS], *h_output[NUM_STREAMS];
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaMallocHost(&h_input[i], chunkBytes);   // 4 GB per stream
            cudaMallocHost(&h_output[i], chunkBytes);  // 4 GB per stream
        }
        // Total host memory: 3 × (4 GB + 4 GB) = 24 GB (realistic!)
        
        // Larger dataset stored in regular pageable memory or on disk
        // For this example, we'll simulate loading from disk chunk by chunk
        
        // ═══════════════════════════════════════════════════════════
        // STEP 2: Allocate device memory (one set per stream)
        // ═══════════════════════════════════════════════════════════
        float *d_input[NUM_STREAMS], *d_output[NUM_STREAMS];
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaMalloc(&d_input[i], chunkBytes);   // 4 GB per stream
            cudaMalloc(&d_output[i], chunkBytes);  // 4 GB per stream
        }
        // Total device memory: 3 × (4 GB + 4 GB) = 24 GB
        
        // ═══════════════════════════════════════════════════════════
        // STEP 3: Create streams
        // ═══════════════════════════════════════════════════════════
        cudaStream_t streams[NUM_STREAMS];
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamCreate(&streams[i]);
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 4: Process chunks with pipelining
        // ═══════════════════════════════════════════════════════════
        // Grid configuration for 1000 matrices per chunk
        dim3 threadsPerBlock(TILE_SIZE, TILE_SIZE);  // 16×16 = 256 threads
        dim3 blocksPerMatrix((N + TILE_SIZE - 1) / TILE_SIZE,
                             (N + TILE_SIZE - 1) / TILE_SIZE);  // 64×64 blocks
        
        for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
            // Use streams in round-robin fashion
            int streamIdx = chunk % NUM_STREAMS;
            cudaStream_t stream = streams[streamIdx];
            
            // ───────────────────────────────────────────────────────
            // Load chunk from disk/pageable memory into pinned buffer
            // ───────────────────────────────────────────────────────
            // In real application: read from file, database, or network
            // For now, simulate by generating data on-the-fly
            for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
                h_input[streamIdx][i] = rand() / (float)RAND_MAX;
            }
            
            // ───────────────────────────────────────────────────────
            // Pipeline three operations on this stream:
            // ───────────────────────────────────────────────────────
            
            // 1. Transfer input chunk to GPU (async)
            cudaMemcpyAsync(d_input[streamIdx], h_input[streamIdx], 
                           chunkBytes, cudaMemcpyHostToDevice, stream);
            
            // 2. Process all matrices in chunk (async on stream)
            for (int m = 0; m < CHUNK_SIZE; m++) {
                size_t matOffset = m * N * N;
                processMatrixTiled<<<blocksPerMatrix, threadsPerBlock, 0, stream>>>(
                    d_input[streamIdx] + matOffset,
                    d_output[streamIdx] + matOffset,
                    N
                );
            }
            
            // 3. Transfer results back to host (async)
            cudaMemcpyAsync(h_output[streamIdx], d_output[streamIdx],
                           chunkBytes, cudaMemcpyDeviceToHost, stream);
            
            // ───────────────────────────────────────────────────────
            // Save results from previous chunk (if available)
            // ───────────────────────────────────────────────────────
            if (chunk >= NUM_STREAMS) {
                // Previous chunk on this stream has completed
                int prevChunk = chunk - NUM_STREAMS;
                cudaStreamSynchronize(stream);
                
                // Save results: write to disk, database, or network
                // For now, just acknowledge completion
                printf("Chunk %d processed (1000 matrices)\n", prevChunk);
            }
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 5: Wait for final chunks and save results
        // ═══════════════════════════════════════════════════════════
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamSynchronize(streams[i]);
            // Save final results
            printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
        }
        
        // Cleanup
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamDestroy(streams[i]);
            cudaFree(d_input[i]);
            cudaFree(d_output[i]);
            cudaFreeHost(h_input[i]);
            cudaFreeHost(h_output[i]);
        }
        
        return 0;
    }

Summary
~~~~~~~

Tutorial 7 combines the best techniques from the series:

- **Tiling (Tutorial 6)**: Each matrix uses shared memory for 10-20× speedup
- **Streams (Tutorial 5)**: Multiple chunks pipeline to hide transfer latency
- **Chunking (Tutorials 3-4)**: Process data in manageable chunks

**Key takeaways:**

1. **Memory hierarchy**: Use shared memory (fast) for repeated access
2. **Asynchronous execution**: Overlap transfers and computation with streams
3. **Resource management**: Allocate GPU memory per stream, reuse across chunks
4. **Pipeline depth**: 3 streams achieve good balance between complexity and performance

This pattern scales to:
- Image processing (filters, transformations)
- Deep learning inference (batch predictions)
- Scientific computing (simulations, data analysis)
- Video processing (frame-by-frame operations)

For conceptual material on GPU architecture and CUDA fundamentals, see :ref:`cuda`.

Tutorial 8: Using cuBLAS for production matrix multiplication
--------------------------------------------------------------

Problem
~~~~~~~

In Tutorials 1-7, we learned how GPU parallelism works by implementing matrix multiplication from scratch. We progressed from naive implementations to optimized tiled versions with shared memory. However, for production code, you should use **cuBLAS** (CUDA Basic Linear Algebra Subprograms), NVIDIA's highly optimized library.

**Why use cuBLAS?**

1. **Performance**: 2-5× faster than hand-written kernels (uses advanced optimizations)
2. **Maintained**: NVIDIA updates it for new GPU architectures
3. **Tensor cores**: Automatically uses specialized hardware when available
4. **Less code**: Single function call vs hundreds of lines of kernel code

**When to write custom kernels:**

- Non-standard operations (custom transformations, domain-specific algorithms)
- Fused operations (combine multiple operations to reduce memory traffic)
- Research/experimentation (testing new algorithms)

**When to use cuBLAS:**

- Standard linear algebra (matrix multiplication, vector operations, factorizations)
- Production code (reliability and performance matter)
- Scientific applications: electron microscopy (ptychography reconstruction uses extensive matrix operations and FFTs), molecular dynamics simulations, climate modeling
- Deep learning (neural network layers are matrix multiplications)

This tutorial shows how to use cuBLAS for the same 1024×1024 matrix multiplication we've been working with. These operations appear repeatedly in computational pipelines: ptychography reconstructs high-resolution images through iterative matrix updates and Fourier transforms, processing thousands of diffraction patterns where each iteration involves matrix multiplications that cuBLAS accelerates significantly.

Solution
~~~~~~~~

Instead of writing kernel code with shared memory tiling (Tutorial 6), we call a single cuBLAS function. The library handles all optimizations internally.

**What cuBLAS does behind the scenes:**

When you call ``cublasSgemm``, the library:

1. **Analyzes your GPU architecture**: Detects compute capability, memory hierarchy, and available features
2. **Selects optimal algorithm**: Chooses tile sizes, blocking strategies, and memory access patterns
3. **Uses specialized hardware**: Activates tensor cores on modern GPUs (Volta/Turing/Ampere) for massive speedup
4. **Manages memory efficiently**: Handles shared memory allocation, register usage, and bank conflicts
5. **Launches optimized kernels**: Uses multiple kernel variants depending on matrix dimensions

For example, on an RTX 3090:

- Detects tensor cores available
- Uses 256×128 tiles (much larger than our 16×16 in Tutorial 6)
- Employs warp-level matrix operations
- Achieves 140+ GFLOPS vs our 27 GFLOPS

**The cuBLAS GEMM function:**

.. code-block:: cpp

    // GEMM: General Matrix-Matrix multiplication
    // Computes: C = alpha * op(A) * op(B) + beta * C
    // where op(X) can be X or X^T (transpose)
    
    cublasSgemm(handle,        // cuBLAS context
                transA,        // CUBLAS_OP_N (no transpose) or CUBLAS_OP_T
                transB,        // CUBLAS_OP_N or CUBLAS_OP_T
                m, n, k,       // Matrix dimensions
                &alpha,        // Pointer to scaling factor
                A, lda,        // Matrix A and leading dimension
                B, ldb,        // Matrix B and leading dimension
                &beta,         // Pointer to scaling factor
                C, ldc);       // Matrix C (output) and leading dimension

**Concrete example:** Multiply two 1024×1024 matrices

.. code-block:: cpp

    // We want: C = A × B where A, B, C are all 1024×1024
    
    const int N = 1024;
    const float alpha = 1.0f;  // Scaling factor for A × B (1.0 = no scaling)
    const float beta = 0.0f;   // Scaling factor for C (0.0 = overwrite C)
                               // If beta=1.0, would compute C = A×B + C (accumulate)
    
    // Matrix dimensions for standard multiplication:
    // A is 1024×1024 (m × k), B is 1024×1024 (k × n), C is 1024×1024 (m × n)
    int m = N, n = N, k = N;
    
    // CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C defined in cublas_v2.h
    // CUBLAS_OP_N = no transpose (use matrix as-is)
    // CUBLAS_OP_T = transpose matrix
    // CUBLAS_OP_C = conjugate transpose (for complex matrices)
    cublasSgemm(handle,           // cuBLAS library context (created once, reused)
                CUBLAS_OP_N,      // Don't transpose A
                CUBLAS_OP_N,      // Don't transpose B
                m, n, k,          // Dimensions: C(m×n) = A(m×k) × B(k×n)
                &alpha,           // Pointer to alpha scaling factor
                d_A, N,           // Matrix A on device, leading dimension N
                d_B, N,           // Matrix B on device, leading dimension N
                &beta,            // Pointer to beta scaling factor
                d_C, N);          // Matrix C on device (output), leading dimension N
    
    // Leading dimension (lda, ldb, ldc) = stride between rows in memory
    // For contiguous row-major matrices, this equals the row width (N)

**Column-major vs row-major storage:**

cuBLAS follows Fortran convention (column-major), while C/C++ uses row-major:

.. code-block:: text

    Row-major (C/C++):           Column-major (Fortran):
    Matrix [0 1 2]               Matrix [0 1 2]
           [3 4 5]                      [3 4 5]
    
    Stored as: 0,1,2,3,4,5       Stored as: 0,3,1,4,2,5
    (rows consecutive)           (columns consecutive)

However, for our purposes, we don't need to do anything special. When we pass row-major matrices to cuBLAS with `CUBLAS_OP_N` (no transpose), cuBLAS interprets them as column-major and the computation still works correctly. The matrix layout difference is handled automatically by the library as long as we're consistent with our storage format throughout the computation.

Implementation
~~~~~~~~~~~~~~

.. code-block:: cpp

    #include <cuda_runtime.h>
    #include <cublas_v2.h>
    #include <stdio.h>
    #include <stdlib.h>

    int main() {
        // ═══════════════════════════════════════════════════════════
        // STEP 1: Setup and initialization
        // ═══════════════════════════════════════════════════════════
        const int N = 1024;
        const size_t bytes = N * N * sizeof(float);
        
        // Allocate host memory
        float *h_A = (float*)malloc(bytes);
        float *h_B = (float*)malloc(bytes);
        float *h_C = (float*)malloc(bytes);
        
        // Initialize matrices with simple values
        for (int i = 0; i < N * N; i++) {
            h_A[i] = rand() / (float)RAND_MAX;
            h_B[i] = rand() / (float)RAND_MAX;
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 2: Allocate device memory
        // ═══════════════════════════════════════════════════════════
        float *d_A, *d_B, *d_C;
        cudaMalloc(&d_A, bytes);
        cudaMalloc(&d_B, bytes);
        cudaMalloc(&d_C, bytes);
        
        // Transfer data to GPU
        cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);
        
        // ═══════════════════════════════════════════════════════════
        // STEP 3: Initialize cuBLAS
        // ═══════════════════════════════════════════════════════════
        cublasHandle_t handle;
        cublasCreate(&handle);
        
        // ═══════════════════════════════════════════════════════════
        // STEP 4: Perform matrix multiplication using cuBLAS
        // ═══════════════════════════════════════════════════════════
        // Compute C = A × B using GEMM (General Matrix-Matrix multiplication)
        
        const float alpha = 1.0f;  // Scaling factor for A × B
        const float beta = 0.0f;   // Scaling factor for C (0 means overwrite)
        
        // Simply pass our row-major matrices to cuBLAS
        // cuBLAS handles the column-major/row-major difference internally
        cublasSgemm(handle,
                    CUBLAS_OP_N, CUBLAS_OP_N,  // No transpose needed
                    N, N, N,                    // m, n, k (all N for square matrices)
                    &alpha,                     // Scaling factor alpha
                    d_A, N,                     // Matrix A, leading dimension
                    d_B, N,                     // Matrix B, leading dimension
                    &beta,                      // Scaling factor beta
                    d_C, N);                    // Matrix C, leading dimension
        
        // ═══════════════════════════════════════════════════════════
        // STEP 5: Copy result back
        // ═══════════════════════════════════════════════════════════
        cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);
        
        printf("Matrix multiplication completed\n");
        printf("Result matrix C = A × B (1024×1024)\n");
        
        // ═══════════════════════════════════════════════════════════
        // STEP 6: Cleanup
        // ═══════════════════════════════════════════════════════════
        cublasDestroy(handle);
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        free(h_A);
        free(h_B);
        free(h_C);
        
        return 0;
    }

**Compilation:**

.. code-block:: bash

    nvcc -o tutorial8 tutorial8.cu -lcublas

**Key differences from custom kernels:**

1. **No kernel code**: cuBLAS handles all GPU operations internally
2. **Column-major handling**: Swap A and B to account for storage order
3. **Handle management**: Create/destroy cuBLAS context
4. **Linking**: Need to link against cuBLAS library (``-lcublas``)

Performance comparison
~~~~~~~~~~~~~~~~~~~~~~

Typical performance for 1024×1024 matrix multiplication on modern GPUs:

.. list-table::
   :header-rows: 1
   :widths: 30 20 20 30

   * - Implementation
     - Time (ms)
     - GFLOPS
     - Notes
   * - Naive (Tutorial 1)
     - ~800 ms
     - ~2.7
     - Global memory only
   * - Tiled (Tutorial 6)
     - ~80 ms
     - ~27
     - Shared memory, 16×16 tiles
   * - cuBLAS (Tutorial 8)
     - ~15 ms
     - ~140
     - Highly optimized, tensor cores

**Why is cuBLAS faster?**

1. **Register blocking**: Uses registers for innermost loops (faster than shared memory)
2. **Warp-level primitives**: Uses specialized instructions for matrix operations
3. **Tensor cores**: On Volta/Turing/Ampere GPUs, uses dedicated matrix multiplication units
4. **Auto-tuning**: Selects optimal tile sizes and strategies based on GPU architecture
5. **Memory access patterns**: Optimized coalescing and bank conflict avoidance

cuBLAS features
~~~~~~~~~~~~~~~

Beyond basic matrix multiplication, cuBLAS provides a comprehensive suite of linear algebra operations:

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Category
     - Function
     - Description
   * - **Level 1 BLAS**
     - ``cublasSaxpy``
     - Vector addition: y = alpha*x + y
   * - 
     - ``cublasSdot``
     - Dot product between two vectors
   * - 
     - ``cublasSnrm2``
     - Euclidean norm of a vector
   * - **Level 2 BLAS**
     - ``cublasSgemv``
     - Matrix-vector multiplication
   * - 
     - ``cublasSger``
     - Outer product (rank-1 update)
   * - **Level 3 BLAS**
     - ``cublasSgemm``
     - Matrix-matrix multiplication (used in Tutorial 8)
   * - 
     - ``cublasStrsm``
     - Triangular system solve
   * - 
     - ``cublasSsyrk``
     - Symmetric rank-k update
   * - **Extensions**
     - Batched operations
     - Process multiple small matrices efficiently
   * - 
     - Strided batched
     - Regular memory patterns for batch processing
   * - 
     - Mixed precision
     - FP16, FP32, FP64, and tensor core support
   * - 
     - Complex arithmetic
     - Operations on complex-valued matrices

Summary
~~~~~~~

Tutorial 8 demonstrates production-ready GPU programming:

- **Learned through Tutorials 1-7**: Memory hierarchy, parallelism, tiling, optimization
- **Use in production (Tutorial 8)**: cuBLAS for standard operations

**Key takeaways:**

1. **Education vs production**: Learning how things work internally (custom kernels) vs using optimized libraries (cuBLAS)
2. **Performance**: cuBLAS achieves 5-10× speedup over hand-written tiled kernels
3. **Maintainability**: Library updates automatically benefit from new GPU features
4. **When to customize**: Only write custom kernels for specialized operations not in cuBLAS

**The complete GPU programming workflow:**

1. **Understand the fundamentals** (Tutorials 1-7): Memory hierarchy, parallelism, optimization
2. **Use libraries when possible** (Tutorial 8): cuBLAS, cuFFT, cuDNN, Thrust
3. **Write custom kernels only when needed**: Unique operations, fused operations, research

Tutorial 9: Batched cuBLAS for large-scale processing
------------------------------------------------------

Problem
~~~~~~~

Tutorial 7 showed how to process 100,000 matrices using custom kernels with streams and chunking. Now we combine that approach with cuBLAS for production-quality performance. Instead of writing custom matrix multiplication kernels, we use cuBLAS batched operations.

**The challenge:** 100,000 matrices of 1024×1024 each:

- Total data: 400 GB (too large for GPU memory)
- Need to process in chunks with pipelining
- Want production performance from cuBLAS

**Solution:** Combine Tutorial 7 (chunking + streams) with Tutorial 8 (cuBLAS)

- Process data in chunks of 1,000 matrices (4 GB per chunk)
- Use 3 CUDA streams for pipelining
- Use `cublasSgemmStridedBatched` for efficient batch processing

Solution
~~~~~~~~

**cuBLAS batched operations:**

cuBLAS provides two functions for processing multiple matrices:

1. **cublasSgemmBatched**: For matrices at arbitrary memory locations (array of pointers)
2. **cublasSgemmStridedBatched**: For matrices at regular intervals in memory (most efficient)

We use ``cublasSgemmStridedBatched`` because our matrices are stored contiguously:

.. code-block:: cpp

    // Process multiple matrices in one call
    cublasSgemmStridedBatched(
        handle,
        CUBLAS_OP_N, CUBLAS_OP_N,
        N, N, N,              // Matrix dimensions (all N×N)
        &alpha,
        d_A, N, N*N,          // A matrices: leading dim N, stride N*N between matrices
        d_B, N, N*N,          // B matrices: leading dim N, stride N*N between matrices
        &beta,
        d_C, N, N*N,          // C matrices: leading dim N, stride N*N between matrices
        batchCount);          // Number of matrices to multiply

**Key parameters:**

- **Stride**: Distance in elements between consecutive matrices (N*N for contiguous storage)
- **Batch count**: How many matrix multiplications to perform
- **Single kernel launch**: cuBLAS handles all matrices efficiently in one call

Implementation
~~~~~~~~~~~~~~

.. code-block:: cpp

    #include <cuda_runtime.h>
    #include <cublas_v2.h>
    #include <stdio.h>
    #include <stdlib.h>

    int main() {
        // ═══════════════════════════════════════════════════════════
        // STEP 1: Configuration
        // ═══════════════════════════════════════════════════════════
        const int N = 1024;               // Matrix size (1024×1024)
        const int CHUNK_SIZE = 1000;      // Matrices per chunk
        const int NUM_CHUNKS = 100;       // Total 100 chunks = 100,000 matrices
        const int NUM_STREAMS = 3;        // Streams for pipelining
        
        const size_t matrixSize = N * N * sizeof(float);
        const size_t chunkBytes = CHUNK_SIZE * matrixSize;  // 4 GB per chunk
        
        printf("Processing %d matrices (%d chunks of %d)\n",
               NUM_CHUNKS * CHUNK_SIZE, NUM_CHUNKS, CHUNK_SIZE);
        printf("Chunk size: %.2f GB\n", chunkBytes / (1024.0*1024.0*1024.0));
        
        // ═══════════════════════════════════════════════════════════
        // STEP 2: Allocate pinned host memory (for async transfers)
        // ═══════════════════════════════════════════════════════════
        float *h_A[NUM_STREAMS], *h_B[NUM_STREAMS], *h_C[NUM_STREAMS];
        
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaMallocHost(&h_A[i], chunkBytes);  // Input matrices A
            cudaMallocHost(&h_B[i], chunkBytes);  // Input matrices B
            cudaMallocHost(&h_C[i], chunkBytes);  // Output matrices C
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 3: Allocate device memory (one buffer per stream)
        // ═══════════════════════════════════════════════════════════
        float *d_A[NUM_STREAMS], *d_B[NUM_STREAMS], *d_C[NUM_STREAMS];
        
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaMalloc(&d_A[i], chunkBytes);
            cudaMalloc(&d_B[i], chunkBytes);
            cudaMalloc(&d_C[i], chunkBytes);
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 4: Create cuBLAS handle and streams
        // ═══════════════════════════════════════════════════════════
        cublasHandle_t handle;
        cublasCreate(&handle);
        
        cudaStream_t streams[NUM_STREAMS];
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamCreate(&streams[i]);
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 5: Process chunks with pipelining
        // ═══════════════════════════════════════════════════════════
        const float alpha = 1.0f;
        const float beta = 0.0f;
        
        for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
            int streamIdx = chunk % NUM_STREAMS;
            cudaStream_t stream = streams[streamIdx];
            
            // Set cuBLAS to use this stream
            cublasSetStream(handle, stream);
            
            // ───────────────────────────────────────────────────────
            // Load chunk from disk into pinned memory
            // ───────────────────────────────────────────────────────
            // In real application: read from file
            // For now: generate random data
            for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
                h_A[streamIdx][i] = rand() / (float)RAND_MAX;
                h_B[streamIdx][i] = rand() / (float)RAND_MAX;
            }
            
            // ───────────────────────────────────────────────────────
            // Pipeline: Transfer in → Compute → Transfer out
            // ───────────────────────────────────────────────────────
            
            // Transfer input chunk to GPU (async)
            cudaMemcpyAsync(d_A[streamIdx], h_A[streamIdx], 
                           chunkBytes, cudaMemcpyHostToDevice, stream);
            cudaMemcpyAsync(d_B[streamIdx], h_B[streamIdx], 
                           chunkBytes, cudaMemcpyHostToDevice, stream);
            
            // Perform batched matrix multiplication (async on stream)
            // Process all 1,000 matrices in the chunk with one call
            cublasSgemmStridedBatched(
                handle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                N, N, N,                    // Matrix dimensions
                &alpha,
                d_A[streamIdx], N, N*N,     // A: leading dim, stride
                d_B[streamIdx], N, N*N,     // B: leading dim, stride
                &beta,
                d_C[streamIdx], N, N*N,     // C: leading dim, stride
                CHUNK_SIZE);                // Number of matrices
            
            // Transfer results back to host (async)
            cudaMemcpyAsync(h_C[streamIdx], d_C[streamIdx],
                           chunkBytes, cudaMemcpyDeviceToHost, stream);
            
            // ───────────────────────────────────────────────────────
            // Save results from previous chunk (if available)
            // ───────────────────────────────────────────────────────
            if (chunk >= NUM_STREAMS) {
                int prevChunk = chunk - NUM_STREAMS;
                cudaStreamSynchronize(stream);
                
                // Save results: write to disk
                printf("Chunk %d processed (1000 matrices)\n", prevChunk);
            }
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 6: Wait for final chunks and save results
        // ═══════════════════════════════════════════════════════════
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamSynchronize(streams[i]);
            printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
        }
        
        printf("\nAll 100,000 matrices processed successfully!\n");
        
        // ═══════════════════════════════════════════════════════════
        // STEP 7: Cleanup
        // ═══════════════════════════════════════════════════════════
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamDestroy(streams[i]);
            cudaFree(d_A[i]);
            cudaFree(d_B[i]);
            cudaFree(d_C[i]);
            cudaFreeHost(h_A[i]);
            cudaFreeHost(h_B[i]);
            cudaFreeHost(h_C[i]);
        }
        
        cublasDestroy(handle);
        
        return 0;
    }

**Compilation:**

.. code-block:: bash

    nvcc -o tutorial9 tutorial9.cu -lcublas

**Key differences from Tutorial 7:**

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Aspect
     - Tutorial 7 (Custom kernel)
     - Tutorial 9 (cuBLAS batched)
   * - Kernel code
     - Manual tiling implementation
     - No kernel needed
   * - Batch processing
     - Loop over matrices, launch kernel per matrix
     - Single `cublasSgemmStridedBatched` call
   * - Performance
     - ~27 GFLOPS per matrix
     - ~140 GFLOPS per matrix
   * - Code complexity
     - 100+ lines of kernel code
     - Single function call
   * - Maintainability
     - Must update for new GPUs
     - Automatically optimized

Why use batched operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Performance benefits:**

1. **Reduced kernel launch overhead**: One launch for 1,000 matrices vs 1,000 launches
2. **Better GPU utilization**: cuBLAS schedules work efficiently across SMs
3. **Optimized memory access**: Library handles data layout for maximum throughput

**When to use batched operations:**

- Processing many small to medium matrices (common in deep learning, simulations)
- Matrices stored contiguously in memory (use StridedBatched)
- Need production performance without custom optimization

**When not to use:**

- Single large matrix (use regular `cublasSgemm`)
- Matrices at non-uniform memory locations (may need `cublasSgemmBatched` with pointer array)
- Custom operations not available in cuBLAS

Summary
~~~~~~~

Tutorial 9 completes the progression from basics to production:

- **Tutorials 1-6**: Learn GPU fundamentals (memory, parallelism, optimization)
- **Tutorial 7**: Apply concepts at scale (chunking, streams, custom kernels)
- **Tutorial 8**: Use optimized libraries (cuBLAS for single operation)
- **Tutorial 9**: Combine both (cuBLAS batched + chunking + streams)

**Key takeaways:**

1. **Batched operations**: Process multiple matrices efficiently with single library call
2. **Production workflow**: Chunking + streams + cuBLAS = optimal throughput
3. **Simplicity**: 5-10× speedup with less code than custom kernels
4. **Scalability**: Pattern works for millions of matrices with streaming from disk

Tutorial 10: Multi-operation pipeline (cuBLAS + cuFFT)
-------------------------------------------------------

Problem
~~~~~~~

Real applications often need multiple operations on the same data. For example, processing 100,000 images (1024×1024) where each image needs:

1. **Matrix multiplication**: Apply a transformation matrix (using cuBLAS)
2. **FFT**: Transform to frequency domain (using cuFFT)

This appears in ptychography reconstruction, where each diffraction pattern undergoes matrix updates followed by Fourier transforms. Processing 100,000 patterns requires efficient pipelining of both operations.

**The challenge:**

- 400 GB total data (too large for GPU memory)
- Two different library operations per image
- Need to maintain high GPU utilization throughout the pipeline

**Solution:** Combine cuBLAS and cuFFT in a streaming pipeline:

- Process in chunks of 1,000 images (4 GB per chunk)
- Use 3 streams for pipelining
- Apply both operations to each chunk before moving to the next

Solution
~~~~~~~~

**Understanding FFT (Fast Fourier Transform):**

The Fourier transform converts data from spatial/time domain to frequency domain. For images:

- **Spatial domain**: Pixel intensities at positions :math:`(x, y)`
- **Frequency domain**: Amplitude and phase of different spatial frequencies

**Complex numbers in CUDA:**

C and C++ don't have built-in complex number types like Python's `3+4j`. CUDA libraries define their own types to represent complex numbers.

``cuFFT`` provides ``cufftComplex`` - a struct that holds the real and imaginary parts:

.. code-block:: cpp

    // This is already defined in cufft.h when you #include <cufft.h>
    // You don't write this yourself - the library provides it
    typedef struct {
        float x;  // Real part
        float y;  // Imaginary part
    } cufftComplex;

Think of it as a simple container holding two floats. When you create a ``cufftComplex`` variable, you're allocating space for both the real and imaginary components.

**Example usage:**

.. code-block:: cpp

    #include <cufft.h>  // Provides cufftComplex type
    
    // Create the complex number 3 + 4i
    cufftComplex z;
    z.x = 3.0f;  // Real part (the "3" in 3+4i)
    z.y = 4.0f;  // Imaginary part (the "4" in 3+4i)
    
    // Calculate magnitude: |z| = sqrt(3² + 4²) = sqrt(9 + 16) = 5
    float magnitude = sqrtf(z.x * z.x + z.y * z.y);
    printf("Magnitude: %.1f\n", magnitude);  // Prints: 5.0
    
    // For an array of complex numbers (like an image):
    cufftComplex* image = (cufftComplex*)malloc(1024 * 1024 * sizeof(cufftComplex));
    // Each pixel now has .x (real) and .y (imaginary) parts

For our :math:`1024 \times 1024` image:

- **Input**: :math:`1024 \times 1024` complex values (8 MB per image: 2 floats × 4 bytes each)
- **Output**: :math:`1024 \times 1024` complex values in frequency domain (same size)
- **Operation**: 2D FFT processes rows, then columns

**Using multiple CUDA libraries together:**

Both ``cuBLAS`` and ``cuFFT`` work with CUDA streams, allowing us to pipeline operations:

.. code-block:: cpp

    // For each chunk:
    // 1. Transfer data to GPU (async on stream)
    // 2. Apply matrix multiplication using cuBLAS (async on stream)
    //    - Works on complex matrices using cublasCgemm
    // 3. Apply 2D FFT using cuFFT (async on stream)
    //    - Transforms spatial domain → frequency domain
    // 4. Transfer results back (async on stream)

**No custom kernels needed** - both libraries handle GPU optimization internally.

**What's actually running on the GPU:**

When you call ``cublasCgemmStridedBatched`` and ``cufftExecC2C``, each library launches its own highly optimized GPU kernels:

1. **cuBLAS call** → Launches multiple kernels internally:

   - Tiling kernels (like Tutorial 6, but more sophisticated)
   - Register blocking kernels
   - Tensor core kernels (on modern GPUs)
   - Each library kernel is pre-compiled and optimized

2. **cuFFT call** → Launches FFT kernels internally:

   - Cooley-Tukey algorithm kernels for row transforms
   - Additional kernels for column transforms
   - Twiddle factor calculations
   - Memory transpose operations

These kernels execute **sequentially on the same stream**:

.. code-block:: text

    Stream timeline:
    ──────────────────────────────────────────────────────────────
    [Transfer In] → [cuBLAS kernels] → [cuFFT kernels] → [Transfer Out]
                     (multiple)          (multiple)
    
    The stream ensures operations happen in order

**Key insight:** You're not writing kernels, but you're still using them! Each library function is like a black box containing dozens of optimized GPU kernels. Different CUDA libraries can share streams and work on the same data buffer, enabling complex multi-operation pipelines without writing any GPU kernel code yourself.

Implementation
~~~~~~~~~~~~~~

.. code-block:: cpp

    #include <cuda_runtime.h>
    #include <cublas_v2.h>
    #include <cufft.h>
    #include <stdio.h>
    #include <stdlib.h>

    int main() {
        // ═══════════════════════════════════════════════════════════
        // STEP 1: Configuration
        // ═══════════════════════════════════════════════════════════
        const int N = 1024;               // Image size (1024×1024)
        const int CHUNK_SIZE = 1000;      // Images per chunk
        const int NUM_CHUNKS = 100;       // Total 100 chunks = 100,000 images
        const int NUM_STREAMS = 3;        // Streams for pipelining
        
        const size_t imageSize = N * N * sizeof(cufftComplex);
        const size_t chunkBytes = CHUNK_SIZE * imageSize;
        
        printf("Processing %d images (%d chunks of %d)\n",
               NUM_CHUNKS * CHUNK_SIZE, NUM_CHUNKS, CHUNK_SIZE);
        printf("Operations per image: Matrix multiplication + FFT\n");
        printf("Chunk size: %.2f GB\n", chunkBytes / (1024.0*1024.0*1024.0));
        
        // ═══════════════════════════════════════════════════════════
        // STEP 2: Allocate pinned host memory
        // ═══════════════════════════════════════════════════════════
        cufftComplex *h_data[NUM_STREAMS];    // Input/output data
        cufftComplex *h_result[NUM_STREAMS];  // Final results
        
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaMallocHost(&h_data[i], chunkBytes);
            cudaMallocHost(&h_result[i], chunkBytes);
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 3: Allocate device memory
        // ═══════════════════════════════════════════════════════════
        cufftComplex *d_data[NUM_STREAMS];    // Working buffer on GPU
        cufftComplex *d_temp[NUM_STREAMS];    // Temporary for matrix mult
        
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaMalloc(&d_data[i], chunkBytes);
            cudaMalloc(&d_temp[i], chunkBytes);
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 4: Initialize cuBLAS and cuFFT
        // ═══════════════════════════════════════════════════════════
        // cuBLAS handle
        cublasHandle_t cublasHandle;
        cublasCreate(&cublasHandle);
        
        // cuFFT plan for batched 2D FFT
        // Process multiple 1024×1024 FFTs in one call
        cufftHandle fftPlan;
        int rank = 2;                     // 2D FFT
        int n[2] = {N, N};               // Dimensions
        int batch = CHUNK_SIZE;          // Number of FFTs per batch
        cufftPlanMany(&fftPlan, rank, n,
                      NULL, 1, N*N,       // Input layout
                      NULL, 1, N*N,       // Output layout
                      CUFFT_C2C, batch);  // Complex-to-complex, batched
        
        // Create CUDA streams
        cudaStream_t streams[NUM_STREAMS];
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamCreate(&streams[i]);
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 5: Process chunks with multi-operation pipeline
        // ═══════════════════════════════════════════════════════════
        const cufftComplex alpha = {1.0f, 0.0f};  // Scaling factor for matrix mult
        const cufftComplex beta = {0.0f, 0.0f};
        
        for (int chunk = 0; chunk < NUM_CHUNKS; chunk++) {
            int streamIdx = chunk % NUM_STREAMS;
            cudaStream_t stream = streams[streamIdx];
            
            // Set both libraries to use this stream
            cublasSetStream(cublasHandle, stream);
            cufftSetStream(fftPlan, stream);
            
            // ───────────────────────────────────────────────────────
            // Load chunk from disk
            // ───────────────────────────────────────────────────────
            // In real application: read diffraction patterns from file
            // For now: generate random complex data
            for (int i = 0; i < CHUNK_SIZE * N * N; i++) {
                h_data[streamIdx][i].x = rand() / (float)RAND_MAX;
                h_data[streamIdx][i].y = rand() / (float)RAND_MAX;
            }
            
            // ───────────────────────────────────────────────────────
            // Pipeline: Transfer → Matrix Mult → FFT → Transfer back
            // ───────────────────────────────────────────────────────
            
            // Transfer input to GPU (async)
            cudaMemcpyAsync(d_data[streamIdx], h_data[streamIdx],
                           chunkBytes, cudaMemcpyHostToDevice, stream);
            
            // Operation 1: Matrix multiplication using cuBLAS (async)
            // Apply transformation to all images in chunk
            cublasCgemmStridedBatched(
                cublasHandle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                N, N, N,
                &alpha,
                d_data[streamIdx], N, N*N,
                d_data[streamIdx], N, N*N,  // Multiply with itself (demo)
                &beta,
                d_temp[streamIdx], N, N*N,
                CHUNK_SIZE);
            
            // Copy result back to d_data for FFT
            cudaMemcpyAsync(d_data[streamIdx], d_temp[streamIdx],
                           chunkBytes, cudaMemcpyDeviceToDevice, stream);
            
            // Operation 2: 2D FFT using cuFFT (async)
            // Transform all images to frequency domain
            cufftExecC2C(fftPlan, d_data[streamIdx], d_data[streamIdx],
                        CUFFT_FORWARD);
            
            // Transfer results back to host (async)
            cudaMemcpyAsync(h_result[streamIdx], d_data[streamIdx],
                           chunkBytes, cudaMemcpyDeviceToHost, stream);
            
            // ───────────────────────────────────────────────────────
            // Save results from previous chunk
            // ───────────────────────────────────────────────────────
            if (chunk >= NUM_STREAMS) {
                int prevChunk = chunk - NUM_STREAMS;
                cudaStreamSynchronize(stream);
                
                // Save results: write to disk
                printf("Chunk %d processed (1000 images: MatMul + FFT)\n", 
                       prevChunk);
            }
        }
        
        // ═══════════════════════════════════════════════════════════
        // STEP 6: Wait for final chunks
        // ═══════════════════════════════════════════════════════════
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamSynchronize(streams[i]);
            printf("Final chunk %d processed\n", NUM_CHUNKS - NUM_STREAMS + i);
        }
        
        printf("\nAll 100,000 images processed successfully!\n");
        printf("Each image: Matrix multiplication + 2D FFT\n");
        
        // ═══════════════════════════════════════════════════════════
        // STEP 7: Cleanup
        // ═══════════════════════════════════════════════════════════
        cufftDestroy(fftPlan);
        cublasDestroy(cublasHandle);
        
        for (int i = 0; i < NUM_STREAMS; i++) {
            cudaStreamDestroy(streams[i]);
            cudaFree(d_data[i]);
            cudaFree(d_temp[i]);
            cudaFreeHost(h_data[i]);
            cudaFreeHost(h_result[i]);
        }
        
        return 0;
    }

**Compilation:**

.. code-block:: bash

    nvcc -o tutorial10 tutorial10.cu -lcublas -lcufft

**Key points:**

- Link both libraries: ``-lcublas -lcufft``
- Both libraries share the same CUDA stream
- Operations execute sequentially on the stream (transfer → cuBLAS → cuFFT → transfer)
- No custom kernels needed

cuFFT batched operations
~~~~~~~~~~~~~~~~~~~~~~~~~

Similar to cuBLAS, cuFFT provides batched operations for processing multiple FFTs efficiently:

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Function
     - Description
   * - ``cufftPlanMany``
     - Creates a plan for batched FFT operations
   * - ``cufftExecC2C``
     - Complex-to-complex FFT (forward or inverse)
   * - ``cufftExecR2C``
     - Real-to-complex FFT (forward)
   * - ``cufftExecC2R``
     - Complex-to-real FFT (inverse)
   * - ``cufftSetStream``
     - Associates FFT plan with CUDA stream

**Plan parameters:**

- ``rank``: Dimensionality (1D, 2D, or 3D)
- ``n[]``: Size of each dimension
- ``batch``: Number of FFTs to perform
- ``CUFFT_C2C``: Transform type (complex-to-complex, real-to-complex, etc.)

.. dropdown:: Can I call cuBLAS/cuFFT from within custom kernels?
    :icon: info
    :color: danger

    **Short answer:** No, you cannot call ``cuBLAS`` or ``cuFFT`` functions from inside your own ``__global__`` kernel.

    **Why not:**

    ``cuBLAS`` and ``cuFFT`` functions are **host functions** (run on CPU) that launch their own GPU kernels internally. They cannot be called from device code (GPU kernels).

    **What happens between kernel launches:**

    Between separate kernel launches:

    1. **Kernel 1 finishes** → writes results to global memory (VRAM)
    2. **GPU waits** for all threads to complete
    3. **Kernel 2 launches** → reads data from global memory (VRAM)

    **Is the VRAM write→read a bottleneck?**

    Yes and no, depending on the operation.

    **The concern is real:** Writing to VRAM and reading back from it takes time. For a :math:`1024 \times 1024` complex matrix (8 MB):

    - **Write time**: ~0.01 ms (at ~800 GB/s VRAM bandwidth)
    - **Read time**: ~0.01 ms
    - **Total memory traffic**: ~0.02 ms

    **Here's the key insight:** Fusion reduces memory traffic

    .. code-block:: text

        Separate kernels (Tutorial 10):
        ─────────────────────────────────────────────────────────────
        Kernel 1 (cuBLAS): Read input (8 MB) → Compute → Write output (8 MB)
        Kernel 2 (cuFFT):  Read input (8 MB) → Compute → Write output (8 MB)
        
        Total VRAM traffic: 32 MB (4 transfers)
        Memory time: ~0.04 ms (32 MB at 800 GB/s)
        
        Hypothetical fused kernel:
        ─────────────────────────────────────────────────────────────
        Fused kernel: Read input (8 MB) → MatMul → FFT → Write output (8 MB)
        
        Total VRAM traffic: 16 MB (2 transfers)
        Memory time: ~0.02 ms (16 MB at 800 GB/s)
        Savings: 50% memory traffic, ~0.02 ms saved

    **The ptychography case is interesting:** You have 100,000 images where each image undergoes multiple operations:

    1. Matrix multiplication (probe update)
    2. Forward FFT (propagate to detector)
    3. Apply constraints (amplitude/phase corrections)
    4. Inverse FFT (back to real space)
    5. Another matrix multiplication (object update)

    **Should you fuse these operations?**

    For ptychography with :math:`1024 \times 1024` images, **the library approach (Tutorial 10) is still optimal** because:

    - Each operation is compute-intensive (FFT: ~0.5 ms, matrix mult: ~0.1 ms)
    - Memory traffic between kernels (~0.02 ms) is only 3% of total time
    - cuBLAS and cuFFT are 5-10× faster than custom implementations
    - Processing 100,000 images takes minutes either way

    **When fusion WOULD matter:**

    - **Small images**: :math:`256 \times 256` or smaller where computation is fast
    - **Simple operations**: Element-wise operations (scaling, phase shifts) rather than FFT/matrix mult
    - **Tight iteration loops**: Iterative refinement with 10+ operations per image per iteration
    - **Memory-bound regime**: When GPU is waiting on memory rather than computing

    **Important note about ptychography:** You're right that many ptychography operations are memory-bound! The constraint application steps (amplitude corrections, phase updates, element-wise multiplications) are typically **very fast computations** (~0.01 ms) but require reading and writing full images. For these operations, fusion can help significantly.

    **Practical recommendation for ptychography reconstruction:**

    The pipeline has two types of operations:

    1. **Compute-intensive** (use libraries, don't fuse):
    
       - Matrix multiplication (probe/object updates): cuBLAS
       - Forward/inverse FFT: cuFFT
       - These dominate computation time (~0.6 ms total)
       - Memory traffic between them is negligible (~3% of time)

    2. **Memory-bound** (consider fusion):
    
       - Amplitude constraint application
       - Phase corrections
       - Element-wise multiplications
       - These are fast (~0.01 ms each) but happen many times
       - **Fusing these into a single kernel reduces intermediate VRAM writes**

    **Optimal approach:**

    .. code-block:: cpp

        for each iteration:
            // Use library (no fusion needed, computation dominates)
            cuBLAS: probe update              // ~0.1 ms
            cuFFT: forward transform          // ~0.5 ms
            
            // Fuse memory-bound operations (saves VRAM traffic)
            custom kernel: {
                apply amplitude constraint
                phase correction  
                element-wise operations
            }                                  // ~0.01 ms total
            
            cuFFT: inverse transform          // ~0.5 ms
            cuBLAS: object update             // ~0.1 ms

    This hybrid approach gives you the best of both worlds: optimized libraries for heavy computation, fused kernels for memory-bound operations.

    **When the savings DON'T matter (like Tutorial 10):**

    For our :math:`1024 \times 1024` matrices:

    - **cuBLAS computation time**: ~0.1 ms (memory traffic is ~20% of total time)
    - **cuFFT computation time**: ~0.5 ms (FFT is compute-intensive)
    - **Memory transfer time**: ~0.02 ms (only ~3% of total time)

    The computation dominates, so the intermediate VRAM write→read is negligible compared to the actual cuBLAS and cuFFT work.

    **Bottom line:** For large, compute-intensive operations like :math:`1024 \times 1024` matrix multiplication and FFT, the intermediate VRAM traffic between kernels is a small fraction of total time. The benefit of using highly optimized libraries far outweighs the cost of the extra memory traffic.

.. dropdown:: How can I fuse kernels?
    :icon: info
    :color: info

    If you need to combine operations in a single kernel launch, you have two options:

    1. **Write custom fused kernel**: Implement both operations yourself in one kernel

    .. code-block:: cpp
    
        __global__ void fusedMatMulAndFFT(float* input, float* output, int N) {
            // Your custom matrix multiplication code
            // + Your custom FFT code
            // Both in one kernel - no intermediate global memory writes
        }
    
    **Pros**: No intermediate memory traffic, single launch overhead
    
    **Cons**: Much harder to implement, lose cuBLAS/cuFFT optimizations

    2. **Use device-side libraries** (limited availability)

    CUDA provides some device-callable math libraries (like ``curand`` for random numbers), but ``cuBLAS`` and ``cuFFT`` are not available as device functions.

**For Tutorial 10's use case:**

The sequential approach (separate cuBLAS and cuFFT calls) is actually fine because:

- Data stays in GPU memory between calls (no CPU←→GPU transfer)
- Kernel launch overhead is microseconds (negligible compared to computation)
- You get highly optimized implementations from both libraries
- The stream ensures operations execute in order without CPU intervention

**When to worry about kernel launch overhead:**

- Processing very small matrices (< 64×64) where launch overhead dominates
- Launching thousands of tiny kernels in a loop
- Operations that need custom fusion (e.g., applying constraints between matrix ops)

For our 1024×1024 images, cuBLAS + cuFFT as separate calls is the right choice!

Real-world application: Ptychography
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This pattern directly applies to ptychography reconstruction:

1. **Load diffraction pattern**: Read from detector (1024×1024 complex array)
2. **Apply probe update**: Matrix multiplication with probe function (cuBLAS)
3. **Propagate to detector**: 2D FFT to frequency domain (cuFFT)
4. **Apply constraints**: Can add custom kernel if needed
5. **Inverse FFT**: Transform back to real space (cuFFT)
6. **Update object**: Another matrix operation (cuBLAS)

Processing 100,000 diffraction patterns with this pipeline achieves high-resolution image reconstruction in minutes instead of hours.

Multi-library pipeline benefits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Why combine libraries:**

1. **No reinventing the wheel**: Use highly optimized implementations
2. **Composability**: Libraries work together through CUDA streams
3. **Maintainability**: Updates benefit all operations automatically
4. **Performance**: Each library is tuned for its specific operation

**Common library combinations:**

- **cuBLAS + cuFFT**: Signal processing, ptychography, convolution
- **cuBLAS + cuDNN**: Deep learning (matrix ops + neural network layers)
- **cuFFT + Thrust**: FFT processing with sorting/reduction
- **cuSolver + cuBLAS**: Linear system solving with matrix operations

Summary
~~~~~~~

Tutorial 10 demonstrates production-scale multi-operation pipelines:

- **Tutorials 1-6**: GPU fundamentals
- **Tutorial 7**: Large-scale processing with custom kernels
- **Tutorials 8-9**: Single library optimization (cuBLAS)
- **Tutorial 10**: Multi-library pipeline (cuBLAS + cuFFT)

**Key takeaways:**

1. **Library composition**: Multiple CUDA libraries work together via streams
2. **No custom kernels**: Standard operations use optimized library implementations
3. **Real-world pattern**: Typical for scientific computing (matrix ops + FFT)
4. **Scalability**: Same pattern works for any number of images with chunking

**Complete workflow:**

1. Learn fundamentals with custom kernels (understand how GPUs work)
2. Replace custom code with libraries (cuBLAS, cuFFT, cuDNN)
3. Combine libraries in pipelines (handle complex real-world applications)
4. **Mix custom kernels with libraries** (Tutorial 11 shows how)

Tutorial 11: Hybrid pipeline (custom kernels + CUDA libraries)
---------------------------------------------------------------

Problem
~~~~~~~

Real-world applications often need a mix of custom kernels and optimized libraries. For example, in signal processing:

- **Custom operations**: Domain-specific transformations (thresholding, masking, custom filters)
- **Standard operations**: FFT transforms, matrix multiplications

Writing everything as custom kernels means reinventing the wheel for standard operations. Using only libraries means you cannot implement custom domain logic. The solution: **combine custom CUDA kernels with CUDA libraries**.

This tutorial shows a simple image filtering pipeline that:

1. Uses **custom kernel** to apply a threshold mask
2. Uses **cuFFT** for frequency domain transformation
3. Uses **custom kernel** to apply frequency domain filtering
4. Uses **cuFFT** for inverse transform back to spatial domain

Solution overview
~~~~~~~~~~~~~~~~~

**Image filtering pipeline:**

.. code-block:: text

    1. Apply threshold mask: Mask pixels below threshold
       → Custom kernel (domain-specific logic)
    
    2. Transform to frequency domain: spectrum = FFT(masked_image)
       → cuFFT (optimized library)
    
    3. Apply frequency filter: Filter high/low frequencies
       → Custom kernel (element-wise operations)
    
    4. Transform back: filtered_image = IFFT(spectrum)
       → cuFFT (inverse transform)

**Why this hybrid approach:**

- **Custom kernels**: Fast for simple element-wise operations and custom logic
- **cuFFT**: Highly optimized FFT implementation using specialized algorithms
- **Best of both worlds**: Speed of libraries + flexibility of custom code

Example: For 1024×1024 image, ~2 ms total (vs ~8 ms if implementing FFT manually)

Step 1: Define custom kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Custom kernel 1: Apply threshold mask**

.. code-block:: cuda

    // Set pixels below threshold to zero
    __global__ void applyThreshold(
        const float* input,
        float* output,
        float threshold,
        int N)
    {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx >= N) return;
        
        // Keep pixel if above threshold, otherwise set to zero
        output[idx] = (input[idx] >= threshold) ? input[idx] : 0.0f;
    }

**Custom kernel 2: Apply frequency filter**

.. code-block:: cuda

    // Apply low-pass filter in frequency domain
    // Attenuate high frequencies (reduce noise)
    __global__ void frequencyFilter(
        cufftComplex* spectrum,
        int width,
        int height,
        float cutoff_frequency)
    {
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int y = blockIdx.y * blockDim.y + threadIdx.y;
        
        if (x >= width || y >= height) return;
        
        // Calculate distance from center (DC component)
        int center_x = width / 2;
        int center_y = height / 2;
        
        float dx = (float)(x - center_x) / width;
        float dy = (float)(y - center_y) / height;
        float distance = sqrtf(dx * dx + dy * dy);
        
        // Low-pass filter: Attenuate based on distance
        float attenuation = 1.0f / (1.0f + powf(distance / cutoff_frequency, 4.0f));
        
        int idx = y * width + x;
        spectrum[idx].x *= attenuation;
        spectrum[idx].y *= attenuation;
    }

**Custom kernel 3: Normalize after IFFT**

.. code-block:: cuda

    // cuFFT doesn't normalize, so we need to divide by N
    __global__ void normalize(
        cufftComplex* data,
        int N,
        float scale)
    {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx >= N) return;
        
        data[idx].x *= scale;
        data[idx].y *= scale;
    }

Step 2: Set up cuFFT library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: cpp

    #include <cuda_runtime.h>
    #include <cufft.h>
    #include <stdio.h>

    int main() {
        // ═══════════════════════════════════════════════════════════
        // Configuration
        // ═══════════════════════════════════════════════════════════
        const int WIDTH = 1024;
        const int HEIGHT = 1024;
        const int TOTAL_SIZE = WIDTH * HEIGHT;
        
        const float THRESHOLD = 0.3f;       // Threshold for masking
        const float CUTOFF_FREQ = 0.1f;     // Low-pass filter cutoff
        
        // ═══════════════════════════════════════════════════════════
        // Initialize cuFFT
        // ═══════════════════════════════════════════════════════════
        
        // Create 2D FFT plan for complex-to-complex transform
        cufftHandle plan_forward, plan_inverse;
        cufftPlan2d(&plan_forward, HEIGHT, WIDTH, CUFFT_C2C);
        cufftPlan2d(&plan_inverse, HEIGHT, WIDTH, CUFFT_C2C);
        
        // Create CUDA stream for ordering operations
        cudaStream_t stream;
        cudaStreamCreate(&stream);
        
        // Associate FFT plans with stream
        cufftSetStream(plan_forward, stream);
        cufftSetStream(plan_inverse, stream);
        
        printf("Image filtering pipeline\n");
        printf("Image size: %d×%d\n", WIDTH, HEIGHT);
        printf("Threshold: %.2f\n", THRESHOLD);
        printf("Cutoff frequency: %.2f\n", CUTOFF_FREQ);
        
        // ═══════════════════════════════════════════════════════════
        // Allocate GPU memory
        // ═══════════════════════════════════════════════════════════
        float *d_input_real, *d_output_real;
        cufftComplex *d_complex_input, *d_spectrum, *d_filtered;
        
        cudaMalloc(&d_input_real, TOTAL_SIZE * sizeof(float));
        cudaMalloc(&d_output_real, TOTAL_SIZE * sizeof(float));
        cudaMalloc(&d_complex_input, TOTAL_SIZE * sizeof(cufftComplex));
        cudaMalloc(&d_spectrum, TOTAL_SIZE * sizeof(cufftComplex));
        cudaMalloc(&d_filtered, TOTAL_SIZE * sizeof(cufftComplex));
        
        // Load input image (for demo, we'll use synthetic data)
        // In real application: cudaMemcpy from host
        // ... load image data ...

Step 3: Implement the hybrid pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: cpp

        // ═══════════════════════════════════════════════════════════
        // Set up kernel launch configuration
        // ═══════════════════════════════════════════════════════════
        
        // For 1D kernels (threshold, normalize)
        int threadsPerBlock_1D = 256;
        int blocks_1D = (TOTAL_SIZE + threadsPerBlock_1D - 1) / threadsPerBlock_1D;
        
        // For 2D kernel (frequency filter)
        dim3 threadsPerBlock_2D(16, 16);
        dim3 blocks_2D((WIDTH + 15) / 16, (HEIGHT + 15) / 16);
        
        // ═══════════════════════════════════════════════════════════
        // Pipeline execution (Custom + cuFFT + Custom + cuFFT)
        // ═══════════════════════════════════════════════════════════
        
        printf("\nProcessing pipeline:\n");
        
        // ───────────────────────────────────────────────────────────
        // STEP 1: Apply threshold (custom kernel)
        // ───────────────────────────────────────────────────────────
        printf("  [1/4] Applying threshold mask...\n");
        applyThreshold<<<blocks_1D, threadsPerBlock_1D, 0, stream>>>(
            d_input_real,
            d_output_real,
            THRESHOLD,
            TOTAL_SIZE
        );
        
        // Convert to complex for FFT (real part = masked data, imag = 0)
        // In practice, you'd have a kernel for this; simplified here
        cudaMemcpy(d_complex_input, d_output_real, 
                   TOTAL_SIZE * sizeof(float), cudaMemcpyDeviceToDevice);
        
        // ───────────────────────────────────────────────────────────
        // STEP 2: Forward FFT (cuFFT library)
        // ───────────────────────────────────────────────────────────
        printf("  [2/4] Computing FFT...\n");
        cufftExecC2C(plan_forward, d_complex_input, d_spectrum, CUFFT_FORWARD);
        
        // ───────────────────────────────────────────────────────────
        // STEP 3: Apply frequency filter (custom kernel)
        // ───────────────────────────────────────────────────────────
        printf("  [3/4] Applying frequency domain filter...\n");
        frequencyFilter<<<blocks_2D, threadsPerBlock_2D, 0, stream>>>(
            d_spectrum,
            WIDTH,
            HEIGHT,
            CUTOFF_FREQ
        );
        
        // ───────────────────────────────────────────────────────────
        // STEP 4: Inverse FFT (cuFFT library)
        // ───────────────────────────────────────────────────────────
        printf("  [4/4] Computing inverse FFT...\n");
        cufftExecC2C(plan_inverse, d_spectrum, d_filtered, CUFFT_INVERSE);
        
        // Normalize (cuFFT doesn't normalize by default)
        float scale = 1.0f / (WIDTH * HEIGHT);
        normalize<<<blocks_1D, threadsPerBlock_1D, 0, stream>>>(
            d_filtered,
            TOTAL_SIZE,
            scale
        );
        
        // Wait for all operations to complete
        cudaStreamSynchronize(stream);
        
        printf("\nPipeline complete!\n");
        
        // Copy result back to host if needed
        // cudaMemcpy(host_output, d_filtered, ...);
        
        // ═══════════════════════════════════════════════════════════
        // Cleanup
        // ═══════════════════════════════════════════════════════════
        cufftDestroy(plan_forward);
        cufftDestroy(plan_inverse);
        cudaStreamDestroy(stream);
        
        cudaFree(d_input_real);
        cudaFree(d_output_real);
        cudaFree(d_complex_input);
        cudaFree(d_spectrum);
        cudaFree(d_filtered);
        
        return 0;
    }

**Compilation:**

.. code-block:: bash

    nvcc -o tutorial11 tutorial11.cu -lcufft

Key points about hybrid pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**1. All operations share the same stream**

.. code-block:: cpp

    // Set library to use our stream
    cufftSetStream(plan_forward, stream);
    cufftSetStream(plan_inverse, stream);
    
    // Custom kernels also use the same stream
    myKernel<<<blocks, threads, 0, stream>>>(...);

This ensures operations execute in order without extra synchronization between custom kernels and library calls.

**2. Data stays on GPU throughout**

.. code-block:: text

    GPU Memory (no CPU transfers during processing):
    ────────────────────────────────────────────────────────
    Custom kernel → cuFFT → Custom kernel → cuFFT → Custom kernel
         ↓            ↓          ↓            ↓           ↓
    All operations read/write GPU VRAM (fast, ~800 GB/s)
    
    No CPU ←→ GPU transfers (would be slow, ~16 GB/s PCIe)

Everything happens on GPU from input to output.

**3. Custom kernels for simple operations, libraries for complex ones**

.. list-table::
   :header-rows: 1
   :widths: 40 30 30

   * - Operation
     - Complexity
     - Implementation
   * - Thresholding
     - Simple
     - Custom kernel
   * - Frequency filtering
     - Simple
     - Custom kernel
   * - Normalization
     - Simple
     - Custom kernel
   * - 2D FFT
     - Complex
     - cuFFT library

**4. Flexibility of custom kernels**

Custom kernels let you implement domain-specific logic:

.. code-block:: cuda

    // Your custom logic (not available in any library)
    if (input[idx] >= threshold) {
        output[idx] = input[idx];  // Keep value
    } else {
        output[idx] = 0.0f;        // Mask out
    }

Libraries provide optimized standard operations, custom kernels provide flexibility.

Performance comparison
~~~~~~~~~~~~~~~~~~~~~~

For 1024×1024 images:

.. list-table::
   :header-rows: 1
   :widths: 40 30 30

   * - Approach
     - Time
     - Notes
   * - CPU (NumPy + SciPy)
     - ~50 ms
     - Single-threaded
   * - Custom kernels only
     - ~8 ms
     - Manual FFT implementation (slow)
   * - Libraries only
     - N/A
     - Cannot implement custom logic
   * - **Hybrid (Tutorial 11)**
     - **~2 ms**
     - Best of both worlds

**Breakdown for hybrid approach:**

- Threshold kernel: ~0.1 ms
- Forward FFT (cuFFT): ~0.8 ms
- Frequency filter kernel: ~0.2 ms
- Inverse FFT (cuFFT): ~0.8 ms
- Normalize kernel: ~0.1 ms
- **Total: ~2 ms** (25× faster than CPU)

Why the hybrid approach works
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Principle 1: Use the right tool for each operation**

- **Custom kernels**: Quick to write for simple logic (thresholding, masking, element-wise ops)
- **cuFFT**: Years of optimization, specialized algorithms and hardware acceleration

Writing FFT from scratch would take weeks and still be slower than cuFFT.

**Principle 2: Libraries handle complex algorithms**

.. code-block:: text

    Manual FFT implementation:
    - Cooley-Tukey algorithm
    - Bit-reversal permutation
    - Twiddle factor computation
    - Memory access optimization
    - Edge case handling
    → Weeks of work, still slower than cuFFT
    
    cuFFT:
    - One function call
    - Optimized by NVIDIA engineers
    - Hardware-specific tuning
    → ~0.8 ms for 1024×1024

**Principle 3: Minimize synchronization**

.. code-block:: cpp

    // Bad: Synchronize after every operation
    myKernel1<<<...>>>();
    cudaDeviceSynchronize();  // ❌ Unnecessary wait
    myKernel2<<<...>>>();
    cudaDeviceSynchronize();  // ❌ Unnecessary wait
    
    // Good: Let stream handle ordering
    myKernel1<<<..., stream>>>();
    myKernel2<<<..., stream>>>();  // ✅ Automatically waits for kernel1
    cufftExec(...);                // ✅ Waits for kernel2
    cudaStreamSynchronize(stream);  // ✅ Only sync at end

Real-world applications
~~~~~~~~~~~~~~~~~~~~~~~

This hybrid pattern appears in many domains:

**Image processing:**

.. code-block:: text

    Custom kernel: Preprocessing (crop, normalize)
    cuFFT: Frequency analysis
    Custom kernel: Apply filters
    cuFFT: Inverse transform
    Custom kernel: Postprocessing

**Scientific simulation:**

.. code-block:: text

    cuBLAS: Matrix operations
    Custom kernel: Apply boundary conditions
    cuFFT: Transform to frequency domain
    Custom kernel: Apply physics constraints
    cuFFT: Transform back

**Audio processing:**

.. code-block:: text

    Custom kernel: Window function
    cuFFT: Spectrogram computation
    Custom kernel: Noise reduction
    cuFFT: Inverse transform
    Custom kernel: Normalize output

Summary
~~~~~~~

Tutorial 11 shows how to combine custom kernels with CUDA libraries:

- **Tutorials 1-6**: GPU fundamentals (memory, parallelism, optimization)
- **Tutorial 7**: Large-scale processing (chunking, streams)
- **Tutorials 8-9**: Library usage (cuBLAS for matrix operations)
- **Tutorial 10**: Multi-library pipelines (cuBLAS + cuFFT)
- **Tutorial 11**: Hybrid approach (custom kernels + libraries)

**Key takeaways:**

1. **Mix custom and library code**: Libraries for complex algorithms, custom kernels for simple logic
2. **Share CUDA streams**: Ensures correct execution order between custom and library operations
3. **Keep data on GPU**: All operations run on GPU, no CPU transfers
4. **Don't reinvent the wheel**: Use cuFFT instead of writing FFT from scratch
5. **Custom kernels add flexibility**: Implement domain-specific operations not available in libraries

**When to use hybrid approach:**

- Need standard operations (FFT, matrix multiply) plus custom logic
- Want speed of optimized libraries plus flexibility of custom code
- Processing pipeline requires both simple and complex operations
- Production code where both performance and maintainability matter

This completes the CUDA tutorial series from fundamentals to hybrid pipelines combining custom kernels with optimized libraries.

Next steps
~~~~~~~~~~

- **For GPU optimization and profiling**: See :ref:`gpu-optimization` to learn how to profile your code with NSight, identify bottlenecks, and apply kernel fusion for memory-bound operations
- **For conceptual GPU architecture**: See :ref:`cuda` for fundamentals of GPU hardware and programming model

To-do
^^^^^

- [ ] Benchmark for GPU
- [ ] Random matrix generation, 1024-1024px, 100,000 of them
- [ ] Mixed-precision example with FP16/cuBLAS
- [ ] Multi-GPU