WebGPU
======

.. note::
   
   This page covers WebGPU, a modern web API for GPU programming in browsers. WebGPU enables high-performance graphics and compute applications that run across different platforms without native code installation.

.. raw:: html

   <div id="question-toc-container"></div>
   <script src="../_static/question-nav.js"></script>

Resources
---------

- `WebGPU API (MDN) <https://developer.mozilla.org/en-US/docs/Web/API/WebGPU_API>`_
- `WebGPU Fundamentals <https://webgpufundamentals.org/webgpu/lessons/webgpu-fundamentals.html>`_
- `WGSL Specification <https://gpuweb.github.io/gpuweb/wgsl/>`_
- `Your First WebGPU App (Google) <https://codelabs.developers.google.com/your-first-webgpu-app#0>`_
- `Learn WebGPU <https://eliemichel.github.io/LearnWebGPU/>`_
- `GPU Compute in Chrome <https://developer.chrome.com/docs/capabilities/web-apis/gpu-compute>`_
- `Particle Life Simulation <https://lisyarus.github.io/blog/posts/particle-life-simulation-in-browser-using-webgpu.html?utm_source=chatgpt.com>`_

Inspirations

- `Softbodies WebGPU Demo <https://github.com/holtsetio/softbodies/>`_

Fundamentals
------------

The problem
^^^^^^^^^^^

What limitations does WebGL have?
    WebGL wraps OpenGL ES 2.0 and 3.0 designed for mobile devices in 2007 and 2012 where it lacks compute shaders for general purpose computing, modern memory management for efficient data transfer, and explicit multithreading support for parallel workloads. Canvas 2D provides simple drawing methods like ``fillRect`` and ``stroke`` with built in transforms but offers no GPU acceleration for custom operations.

Why is it difficult to use OS native GPU APIs for web developers?
    Native GPU APIs like DirectX 12 (Windows), Vulkan (cross platform), Metal (Apple), and OpenGL require platform specific code and installation where developers writing for multiple platforms must maintain separate implementations or use abstraction layers. Each API has different capabilities, syntax, and performance characteristics making cross platform development difficult.

The solution
^^^^^^^^^^^^

How does WebGPU solve these problems?
    WebGPU provides a modern unified API that exposes GPU capabilities across macOS, Windows, Linux, iOS, and Android through the browser where it translates web API calls to native GPU interfaces. Developers write GPU code once and run it everywhere without platform specific conditionals. WebGPU brings compute shaders, modern graphics features, and high performance to web applications matching capabilities previously available only in native code.

Getting started
^^^^^^^^^^^^^^^

How do you access GPU hardware in the browser?
    WebGPU requires checking browser support and requesting hardware access asynchronously where modern browsers expose ``navigator.gpu`` if available. The adapter represents a bridge between your application and physical GPU hardware where ``requestAdapter()`` queries available GPUs while ``requestDevice()`` creates the programming interface for submitting work. You can request specific features or limits like texture formats or compute capabilities when creating the device.

How do you configure rendering output?
    The GPU needs to know how to store color data in memory where texture format specifies how color channels are arranged. Use ``getPreferredCanvasFormat()`` to get the optimal format for your system while ``configure()`` connects your device to the canvas context enabling rendering to the screen.

Shaders
^^^^^^^

Why do we need shaders?
    CPUs process data sequentially while GPUs excel at parallel operations where thousands of operations run simultaneously. Shaders are programs written in WGSL (WebGPU Shading Language) that execute directly on GPU hardware where thousands of shader invocations run in parallel processing different data elements simultaneously. The term shader comes from graphics programming where they originally shaded pixels but now handle physics simulations, mathematical computations, and general purpose parallel processing.

How do vertex and fragment shaders work together?
    The GPU processes rendering in two stages where the ``@vertex`` shader runs once per vertex (corner point) transforming 3D positions to 2D screen coordinates. The ``@fragment`` shader runs once per pixel determining color by calculating lighting, textures, and material properties. Between stages, the GPU rasterizes geometry automatically interpolating vertex attributes across the surface.

How do compute shaders differ?
    Compute shaders focus on general purpose parallel computation rather than graphics where they operate on arbitrary data structures like matrices, particle systems, or simulation state without producing pixels directly. For operations like matrix multiplication or physics simulation, shaders use storage buffers for both input and output where each invocation processes a portion of the computation. Compute shaders organize into workgroups enabling efficient data sharing through local memory.

Memory
^^^^^^

Why can't shaders use JavaScript arrays directly?
    GPUs have their own memory space (VRAM) optimized for high bandwidth parallel access where JavaScript arrays live in CPU memory (RAM) with much slower transfer rates. Transferring data from CPU to GPU for every operation creates bottlenecks where the GPU sits idle waiting for data. Buffers are contiguous chunks of GPU memory storing arrays of data like vertex positions, transformation matrices, or computation results where you create them using ``device.createBuffer`` specifying usage determining how hardware optimizes memory layout. You copy data into GPU memory once where thousands of shader invocations can access it simultaneously without CPU involvement.

When do you need different buffer types?
    Many computations require shaders to both read input and write output where matrix multiplication reads two matrices and writes results or physics simulations update particle positions. Storage buffers allow shaders to both read and write enabling compute shader output and intermediate results while supporting arbitrary sized arrays. Many operations require shared data where all pixels need camera position or all particles need timestep value where uniform buffers remain constant for all shader invocations efficiently storing shared data through hardware caching and broadcast mechanisms. Uniform buffers have 64 KB limits but provide fastest access for small shared data while storage buffers support dynamic data with read write access.

Why do we need textures beyond buffers?
    Images and spatial data benefit from specialized access patterns where you sample nearby pixels for filtering, access data at different resolutions through mipmaps, or wrap data at boundaries. Textures are multidimensional grids of data typically representing images where they store color information with specialized hardware for filtered sampling and mipmap generation. The GPU can interpolate between texture values automatically providing smooth gradients. Texture views provide different ways to access the same underlying texture data without copying where you might view a 3D texture as 2D slices or access specific mipmap levels. Samplers tell the GPU how to read texture data controlling interpolation (linear or nearest) and edge behavior (repeat, clamp, mirror) separating the "how to read" from actual texture data.

Pipelines
^^^^^^^^^

How do you configure the rendering process?
    A render pipeline describes the complete rendering process from input vertices to output pixels where you configure which shaders to use, how to interpret vertex data layout, what blend modes to apply, and depth testing behavior. You create the pipeline once during initialization and reuse it for multiple draw calls avoiding reconfiguration overhead. Compute pipelines are simpler lacking vertex input, rasterization, and blending stages where they specify a compute shader and resource bindings organizing parallel computation threads into workgroups.

Binding resources
^^^^^^^^^^^^^^^^^

How do shaders access data?
    Bind groups bundle related resources (buffers, textures, samplers) that shaders need to access where they match the shader's ``@group`` declarations organizing resources into logical sets. Group 0 might contain per frame data (camera, lights) while group 1 contains per object data (model matrix, material) enabling efficient partial rebinding where you change group 1 for each object without rebinding group 0. The ``@group(0)`` indicates the first bind group while ``@binding(0)`` specifies the first resource slot within that group where each binding corresponds to one buffer, texture, or sampler. JavaScript writes data to GPU buffers using ``device.queue.writeBuffer`` which copies typed array data from CPU memory to GPU memory then binds them to shaders using bind groups matching shader resource declarations.

Workflow
^^^^^^^^

What is the complete workflow?
    WebGPU rendering follows these steps: get GPU adapter and device (hardware access), configure canvas context (output target), create shaders (vertex and fragment programs), create buffers for vertex data (geometry in GPU memory), set up render pipeline (rendering configuration), create command encoder and render pass (record GPU commands), draw the geometry (execute shaders), and submit commands to GPU queue (start execution). Compute workflows simplify this by skipping vertex input and rasterization where you create compute shader, create storage buffers for input and output data, set up compute pipeline, create bind group linking buffers to shader, create command encoder and compute pass, dispatch workgroups, then submit commands and read results. The pipeline coordinates execution by connecting shaders with resources where you set the pipeline, bind resource groups, and dispatch work where ``dispatchWorkgroups`` launches thousands of shader threads organized into workgroups enabling 3D parallelization.

Examples
--------

Random number generation
^^^^^^^^^^^^^^^^^^^^^^^^^

How can you generate random numbers in parallel with GPU?
    GPUs generate millions of random numbers simultaneously by running independent generators on each thread. Each thread uses a unique seed based on its thread ID where ``global_id.y * 65535 * 256 + global_id.x`` converts 2D coordinates to a linear position. The PCG (Permuted Congruential Generator) algorithm produces high quality pseudo random numbers through multiplication, bit shifting, and XOR operations. Each thread adds its index to a base seed, applies PCG hash, then converts to float in [0.0, 1.0) by dividing by 2^32. Implementation details from `Nathan Reed's blog <https://www.reedbeta.com/blog/hash-functions-for-gpu-rendering>`_.

    .. code-block:: wgsl

        struct OutputBuffer {
          data: array<f32>
        };

        @group(0) @binding(0) var<storage, read_write> output: OutputBuffer;
        @group(0) @binding(1) var<uniform> params: vec2<u32>; // [totalElements, randomSeed]

        // PCG hash for high quality pseudo random numbers
        fn pcg_hash(input: u32) -> u32 {
          var state = input * 747796405u + 2891336453u;
          var word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
          return (word >> 22u) ^ word;
        }

        fn random_float(seed: u32) -> f32 {
          return f32(pcg_hash(seed)) / 4294967296.0;
        }

        @compute @workgroup_size(256, 1, 1)
        fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
          let index = global_id.y * 65535u * 256u + global_id.x;
          
          if (index >= params.x) {
            return;
          }
          
          let seed = params.y + index;
          output.data[index] = random_float(seed);
        }

What do the bit operations mean?
    Bit shifting (``>>``) moves bits right, dividing by powers of 2. XOR (``^``) compares bits where differing bits give 1, matching bits give 0. The PCG algorithm uses these to scramble input bits: multiplying by large primes spreads patterns, shifting extracts high entropy bits, and XOR mixes bits together. This ensures nearby inputs like 100 and 101 produce completely different outputs.

What happens in the nested operations?
    The expression ``((state >> ((state >> 28u) + 4u)) ^ state)`` works inside out: ``state >> 28u`` extracts the highest 4 bits, adding 4 gives a shift amount between 4 and 19, then ``state >> (that amount)`` shifts by a variable amount depending on input, finally ``^ state`` XORs with the original. This creates avalanche effects where changing one input bit affects multiple output bits.

What does the ``u suffix mean?
    The ``u`` suffix marks unsigned integers (0 to 4,294,967,295) versus signed integers (-2,147,483,648 to 2,147,483,647). PCG requires unsigned integers because bit operations work on the full 32 bit pattern. The suffix tells the compiler to use ``u32`` rather than ``i32`` ensuring consistent behavior across platforms.

How does the main compute function work?
    The ``@compute @workgroup_size(256, 1, 1)`` decorator tells the GPU to run this function with 256 threads per workgroup where each workgroup processes 256 numbers in parallel. The ``global_invocation_id`` gives each thread a unique 3D coordinate (x, y, z) identifying its position in the entire compute grid. The line ``index = global_id.y * 65535u * 256u + global_id.x`` converts the 2D thread position to a linear array index where WebGPU limits each dimension to 65535 workgroups so multiplying by 256 threads per workgroup gives the stride for the y dimension. The bounds check ``if (index >= params.x)`` prevents threads from writing beyond the array size since GPU dispatches must use whole workgroups where you might launch 260 threads but only need 250 numbers. Each thread generates a unique seed by adding its index to the base seed ``params.y + index`` ensuring different threads produce different random numbers, then calls ``random_float(seed)`` to generate and store the result at its position in the output array.

What is vec3 and why use unsigned integers for coordinates?
    The ``vec3<u32>`` type represents a 3D vector with three unsigned 32 bit integer components (x, y, z) where vec3 is a built in WGSL type for storing coordinates or directions. GPUs organize threads into 3D grids because many computations work on volumes, images (2D with batches), or 1D arrays (where y and z are 1). Thread coordinates must be unsigned integers because negative thread positions make no sense where thread 0 is the first thread and you count upward. Using ``u32`` allows up to 4 billion threads per dimension matching GPU hardware limits where workgroup dispatches can be very large like 65535 x 65535 x 65535 requiring 32 bit indices. The vector structure ``global_id.x``, ``global_id.y``, ``global_id.z`` provides convenient access to each dimension where for 1D arrays you only use x, for 2D you use x and y, and for 3D you use all three components.

What does dispatch mean and why the 65535 limit?
    Dispatch means launching a compute shader on the GPU where you specify how many workgroups to run organized in a 3D grid pattern. The three numbers in ``dispatchWorkgroups(100, 1, 1)`` represent workgroup counts along x, y, and z axes where this example launches 100 workgroups along the x axis, 1 workgroup along y, and 1 along z for a total of 100 workgroups arranged in a line. Think of it like organizing workers where ``dispatchWorkgroups(10, 5, 1)`` creates a 10x5 grid of 50 workgroups like rows and columns of workers, while ``dispatchWorkgroups(4, 4, 4)`` creates a 4x4x4 cube of 64 workgroups for 3D problems. WebGPU limits each axis to 65535 workgroups because this fits in 16 bits allowing hardware to use compact counters. You rarely need anywhere near this limit where typical dispatches use 10 to 1000 workgroups per axis. For example processing a 1920x1080 image with 256 threads per workgroup needs only 8x5 workgroups (1920/256 = 8 along x, 1080/256 = 5 along y), processing a million element array needs about 4000 workgroups along x axis, and even 100 million elements need only 400,000 workgroups along one axis still well under the 65535 limit. The maximum 65535 x 65535 x 65535 dispatch would create over 281 trillion workgroups representing theoretical hardware capability rather than practical usage.

Is it beneficial to have many workgroups dispatched?
    Yes, GPUs achieve high performance through massive oversubscription where you dispatch far more workgroups than physical cores. A typical MacBook Pro M2 has about 10 GPU cores, but you should dispatch thousands to millions of workgroups. For example, processing a 1920x1080 image with 256 threads per workgroup needs 8x5 = 40 workgroups (8,192 threads total), while processing 10 million array elements needs about 40,000 workgroups (10,240,000 threads). The GPU rapidly switches between workgroups when some threads wait for memory, keeping compute units busy. This is called latency hiding where one workgroup waits for memory while another computes. Aim for at least 100-1000 workgroups to keep the GPU fully utilized where more workgroups give the scheduler flexibility to hide memory latency and maximize throughput.