GPU Wiki
Note
The content below is collected from various sources, including official documentation, large language models, and published literature. Literature sources are cited where available. Many topics in GPU computing and 4D-STEM continue to evolve rapidly, so some information may require updates as new hardware, software, and research emerge.
Quick Reference: Finding Answers by Task
- “My GPU code is slow”
→ Start: How do I benchmark my GPU performance for 4D-STEM? → Check: Why is my GPU code slower than expected? → Analyze: How do I truly maximize GPU power for ptychography with 4D-STEM data?
- “Out of memory error”
→ Start: What does “CUDA out of memory” error mean and how do I fix it? → Solutions: What if my 4D-STEM data doesn’t fit in GPU memory? → Strategies: How do I handle 4D-STEM data larger than GPU memory? → Advanced: memory-capacity-ptychography
- “Choosing a GPU for purchase”
→ Compare: What NVIDIA GPUs exist? → Requirements: gpu-specs-ptychography → Decide: When should I use single-GPU vs multi-GPU for ptychography? → Budget: What are purchasing priorities?
- “Installing and getting started”
→ Check: How do I check if my system has a compatible GPU? → Install: How do I install CuPy and test GPU functionality? → Choose: How do I choose between CuPy, PyTorch, and Numba for my workflow? → Learn: What is a kernel?
- “Optimizing existing GPU code”
→ Profile: What profiling tools help optimize 4D-STEM on GPU? → Common issues: What are common GPU mistakes? → Workflow: What is the recommended workflow for developing GPU code? → Validate: How do I validate GPU results against CPU?
- “Data transfer bottlenecks”
→ Disk: disk-io-challenge → Network: What detector speeds and data volumes are typical? → CPU-GPU: Why is CPU-to-GPU data transfer so slow? → Solutions: How do I minimize GPU↔CPU transfers in analysis pipeline?
- “Understanding GPU fundamentals”
→ Basics: Why are GPUs faster than CPUs for 4D-STEM? → Cores: How do GPU cores differ from CPU cores? → Memory: What is the memory hierarchy in modern GPUs? → Programming: Why do GPUs use kernels instead of regular functions?
Fundamentals
Data representation
- How do computers represent data at the hardware level?
Bit = 0 or 1, smallest data unit. Byte = 8 bits (e.g.,
01001000), representing 256 values. Bytes store integers (0-255), characters, or multi-byte data parts. Eight bits became standard because 256 combinations efficiently cover alphabets, digits, and symbols.- How are bits stored physically?
DRAM uses capacitor + transistor per bit. Charged capacitor = 1, uncharged = 0. Charge leaks naturally (milliseconds), requiring constant refresh. SRAM uses 6 transistors in flip-flop circuit. No capacitors, no refresh needed. Faster (1 ns vs 100 ns) but higher cost and lower density. Used for CPU cache.
- What is a clock cycle?
A clock cycle is the fundamental unit of time in computer processors where every CPU and GPU has an internal clock that ticks at a specific frequency like a metronome keeping time. Modern GPUs run at about 2 GHz meaning 2 billion cycles per second where each cycle lasts 0.5 nanoseconds. When we say memory access takes “30 cycles,” we mean it takes 30 clock ticks or about 15 nanoseconds at 2 GHz where register access at 1 cycle means 0.5 ns while system RAM at 500 cycles means 250 ns. GPU clock crystral is smllaer than a grain of rice, doesn’t wear out and it vibrates hundreds of trillions of times a day.
- Where does the clock cycle originate from?
The clock cycle originates from a physical crystal oscillator on the GPU circuit board where a tiny quartz crystal vibrates at a precise frequency when electricity passes through it due to the piezoelectric effect. This crystal vibrates at a base frequency like 100 MHz where the GPU’s phase-locked loop or PLL circuit multiplies this to the target frequency such as 2 GHz equaling 20 times multiplication. The crystal provides stable reference timing with accuracy within 10 to 100 parts per million w here a 100 MHz crystal drifts only 10 Hz making it reliable for synchronization.
- Which property enables the quartz crystal to function as an oscillator?
The piezoelectric effect enables quartz crystals to function as oscillators where applying an electric field causes mechanical deformation and vice versa. When voltage is applied, the crystal vibrates at its natural resonant frequency determined by its cut and size. This vibration generates an alternating voltage that sustains oscillation. It then creates billions of pulses per second that serve as the timing signal for the GPU clock.
- Who controls the clock speed?
GPU manufacturers design the base clock frequency determined by transistor switching speed where smaller transistors at 4 nm can switch faster than 28 nm transistors enabling higher frequencies. Thermal and power limits constrain the maximum frequency where at 2.5 GHz the GPU consumes 450W generating heat requiring cooling while at 3 GHz it would exceed 600W causing thermal throttling. The GPU driver dynamically adjusts clock speed based on workload and temperature where during light tasks it runs at 300 MHz to save power while during gaming or computation it boosts to 2,500 MHz. Users can manually overclock by adjusting voltage and multiplier through tools like MSI Afterburner where increasing from 2.0 GHz to 2.2 GHz provides 10% more performance but requires better cooling and draws more power at risk of instability.
- Why measure performance in cycles rather than nanoseconds?
Cycles matter more than absolute time because different processors run at different speeds where a GPU at 2 GHz completes operations twice as fast as one at 1 GHz. By measuring in cycles rather than nanoseconds, we can compare hardware independently of clock speed. A memory access taking 300 cycles will take 300 cycles whether your GPU runs at 1.5 GHz or 2.5 GHz, though the actual time differs.
- Why is SRAM fast?
Three reasons: (1) No refresh overhead, (2) Simple access—flipping transistor states takes 1 ns, (3) Physical proximity—SRAM sits on CPU package millimeters from cores. DRAM is centimeters away on separate chips. At 30 cm/ns signal speed, distance matters. Cost: 6 transistors per bit makes SRAM 50-100x higher cost per byte.
- Why does cache hierarchy matter for GPU programming?
L1: 32-64 KB, 1 ns, closest to each core. L2: 256 KB-1 MB, 10 ns, per core. L3: 8-64 MB, 20 ns, shared across cores. RAM: 128-512 GB, 100 ns, separate chips. Larger cache needs more space, increasing distance and latency. Programs with 90% L1 hit rate average 10.9 ns/access. 50% hit rate: 50.5 ns—5x slower.
- How are numbers stored?
Integers use 4 bytes (
int32): 32 bits = 2³² values from -2.1B to +2.1B. Floats use scientific notation: sign (1 bit) × mantissa × 2^exponent.float32: 8 exponent bits (range 10⁻³⁸ to 10³⁸) + 23 mantissa bits (7 decimal digits precision).float64: 11 + 52 bits = 15 digits, 10⁻³⁰⁸ to 10³⁰⁸ range.- What problem does float32 solve compared to float64?
Float64 needs 4 times more silicon area per core because it requires 64-bit datapaths, wider mantissa adders, and larger exponent logic where building a GPU with the same throughput in float64 would require 4 times more die area translating to 4 times higher cost. Early 2000s GPUs attempted float64-heavy designs where NVIDIA Tesla C1060 from 2008 had 240 float64 cores costing $1,200 and delivering 78 GFLOPS float64 performance proving too expensive for mainstream adoption. The market needed $200 to $400 cards for gaming to drive volume while still supporting scientific computing.
- How does float32 enable mass GPU production?
Float32 enables mass production by offering a different trade-off where RTX 4090 packs 16,384 float32 cores delivering 83 TFLOPS alongside only 256 float64 cores delivering 1.3 TFLOPS in the same $1,600 package. This 64 times asymmetry maximizes gaming performance which dominates the market while keeping float64 available for critical needs. Data center GPUs like A100 use more balanced ratio with 9,984 float32 versus 1,248 float64 cores showing 8 times ratio better suited for scientific workloads.
- When does precision matter for different applications?
Float32 provides 7 decimal digits sufficient for graphics, neural networks, and most physics simulations while float64 provides 15 digits needed for iterative algorithms and ill-conditioned problems. For 4D-STEM workflows, float32 works well for FFTs, convolutions, and single-pass operations where float64 becomes necessary only for phase retrieval with more than 100 iterations where errors accumulate from 1e-5 to 1e-3 in float32 causing visible artifacts. Float64 is also required for solving systems like Ax=b with condition numbers exceeding 1e7 where result shows 95% of GPU computations use float32 cores for speed and efficiency while only 5% use float64 when precision becomes critical.
- Why does GPU hardware have separate cores for float32 vs float64?
Float32 and float64 require physically distinct hardware because they cannot share the same transistors where circuit complexity differs substantially. A float64 multiplier uses approximately 4 times more transistors than float32 multiplier due to 64-bit datapath versus 32-bit, wider mantissa adders, and larger exponent logic translating to 4 times more silicon area consuming roughly 2 times more power. Each CUDA core is either float32 or float64 not both where RTX 4090 design includes 16,384 float32 cores and 256 float64 cores as separate physical units processing different operations.
- How are float32 and float64 cores designed differently?
These are not switching modes between data types but rather completely different circuits. Think of a kitchen with 16 small ovens that bake cookies quickly and 1 large oven for slow-roasting turkeys where you cannot convert the small ovens to large ones because they have different physical construction. This design choice is market-driven where gaming requires massive float32 throughput for processing pixels and textures but rarely needs float64 while scientific computing requires float64 but represents a smaller market segment. NVIDIA maximizes die area for the dominant use case where data center GPUs like A100 allocate more area to float64 with 8 times ratio versus 64 times ratio in consumer GPUs because scientific workloads justify the additional cost.
- If float32 is preferred, why include float64 cores at all?
GPUs include float64 cores despite their rare usage because the scientific computing market requires this precision for weather simulation, molecular dynamics, and financial modeling where while this represents small market at approximately $2 billion per year it offers high margins. Competitive positioning also matters since AMD and Intel offer float64 support meaning NVIDIA must match these capabilities or risk losing research labs as customers.
- When does float64 become necessary for accuracy?
Error accumulation becomes problematic in iterative algorithms running more than 100 iterations where ptychography with 200 iterations sees float32 errors grow from 1e-5 to 1e-3 causing visible artifacts while float64 maintains 1e-13 precision throughout. Ill-conditioned matrices such as solving Ax=b with condition numbers exceeding 1e7 require float64 precision which is common in inverse problems. Validation workflows also benefit from occasionally running float32 production code with float64 to verify results where cost trade-off is minimal since 256 float64 cores represent only 1.5% of die area compared to 75% for float32 cores.
- Does each GPU core process only one data type?
Yes, each core is type-specific and cannot switch between data types where a float32 core has 32-bit registers, 32-bit ALU arithmetic logic unit, and 32-bit data paths unable to process float64 without completely different circuit. Each CUDA core is fixed-function unit where GPU runtime dispatches float32 operations to float32 cores and float64 operations to float64 cores. If you run float64 code on RTX 4090, only the 256 float64 cores become active while 16,384 float32 cores sit idle wasting silicon.
- Can code use mixed-precision computation?
Code can use mixed-precision computation by switching types between operations where for example
output_fp32 = float32(compute_fp64(input))uses float64 core for multiplication then converts and stores using float32 enabling precision where needed and speed elsewhere. Modern GPUs also include tensor cores as third type optimized for matrix multiplication with mixed precision such as FP16 input with FP32 accumulation as again separate physical units where RTX 4090 contains 16,384 CUDA cores for float32, 256 FP64 cores, and 512 tensor cores representing 3 distinct hardware types.- How do I calculate memory requirements for 4D-STEM datasets?
256×256 scan × 128×128 detector × 4 bytes (float32) = 17 GB raw data. Add workspace (FFT buffers, intermediate results) = ~20 GB total. RTX 4090 (24 GB) handles comfortably. Larger datasets: 512×512 scan × 256×256 detector = 270 GB—requires A100/H100 (80 GB) with batch processing, or multi-GPU setup. Memory scales as scan² × detector²—quadruples when doubling both dimensions.
- What if my 4D-STEM data doesn’t fit in GPU memory?
Five strategies help when datasets exceed VRAM capacity where batch processing loads subset of data such as 128×256 positions from 512×512 scan, analyzes it, then loads next batch working slower due to transfers but works on any GPU. Chunked loading uses HDF5 or Zarr to load chunks directly from disk and process each sequentially. Gradient accumulation works well for iterative methods like ptychography by computing on small batches and accumulating updates over multiple passes.
See also: How do I handle 4D-STEM data larger than GPU memory?, How do I truly maximize GPU power for ptychography with 4D-STEM data?
- What are multi-GPU and mixed precision strategies?
Multi-GPU data parallelism splits scan across GPUs where each processes different region requiring NVLink for efficient boundary communication. Mixed precision uses float16 for storage and float32 for computation halving memory usage with minimal accuracy loss. For example, 256 GB dataset on 24 GB GPU can be split into 11 chunks taking approximately 20 seconds each for total of 220 seconds where while loading everything at once is impossible, chunked processing works fine.
See also: “When should I use single-GPU vs multi-GPU for ptychography?”, “What interconnect types exist for GPUs?”, “Multi-GPU processing” section.
- How do typical 4D-STEM dataset sizes scale?
Small datasets for teaching and quick tests use 64×64 scan with 128×128 detector totaling 256 MB fitting in any GPU. Medium datasets representing typical experiments use 256×256 scan with 128×128 detector for 17 GB suitable for consumer GPUs like RTX 4090 with 24 GB. Large high-resolution datasets use 512×512 scan with 256×256 detector for 270 GB requiring multi-GPU systems or batch processing. Very large datasets from synchrotron facilities or automated acquisition use 1024×1024 scans with 512×512 detectors for 4.4 TB necessitating distributed computing.
- How do acquisition times scale with dataset size?
Acquisition times scale correspondingly where small datasets take about 1 minute, medium datasets about 30 minutes, large datasets around 4 hours, and very large datasets 24 hours or more. Memory requirements increase as square of both scan and detector dimensions where doubling both dimensions quadruples the memory needed.
- How do I estimate 4D-STEM data size?
Formula: scan_x × scan_y × detector_x × detector_y × bytes_per_pixel. Data types: uint8 (1 byte), uint16 (2 bytes), float32 (4 bytes), float64 (8 bytes). Example: 512×512 scan × 256×256 detector × 2 bytes (uint16) = 17.2 GB. Converting to float32 doubles to 34.4 GB. Rule of thumb: (scan pixels × detector pixels × 4 bytes) ÷ 10⁹ ≈ GB for float32. Disk storage smaller: HDF5 gzip (2-5× compression), Zarr blosc (3-10× for sparse data). GPU needs uncompressed size in VRAM.
- What dataset sizes work in web browsers?
JavaScript ArrayBuffer limited ~2 GB (V8 engine). Maximum practical: 128×128 scan × 128×128 detector = 1 GB (float32) with workspace margin. Larger datasets require server-side GPU processing with web visualization—server streams results (virtual images, reconstructions) to browser for real-time display. Cannot perform on-device ptychography for typical datasets (>10 GB), but web clients can visualize remote computation progress.
Memory Architecture and Management
Memory hierarchy and hardware
- Why is RAM faster than SSDs but slower than cache?
RAM uses direct electrical sensing with capacitor charge detection, taking about 100 nanoseconds per access. This is much faster than SSDs, which use NAND flash requiring quantum tunneling. SSDs force electrons through oxide barriers at 15-20V followed by threshold voltage measurement, taking 10-100 microseconds, or 1,000 times slower than RAM. Hard drives add mechanical movement with 5 ms seek time and 4 ms rotation for a total of 9 ms or 9,000,000 ns, making them 90,000 times slower than RAM.
However, RAM is still slower than CPU cache, which uses SRAM at 1-20 ns, because DRAM needs refresh cycles and sits farther from the CPU cores. The storage hierarchy proceeds from cache at 1-20 ns to RAM at 100 ns to SSD at 10-100 μs to HDD at 5-10 ms. Each step down trades speed for capacity and cost.
- What determines bandwidth?
Bandwidth = bus width × clock speed × DDR multiplier. DDR5 RAM: 64-bit bus × 4,800 MHz = 8 bytes/cycle × 4.8B cycles/sec = 38.4 GB/s per channel [JEDEC DDR5 spec]. Dual channel: 76.8 GB/s. GPUs use wider buses: HBM3 has 1024-bit bus achieving 3,000 GB/s—40× faster than system RAM [JEDEC HBM3 spec].
- Why do GPUs need separate VRAM?
Bandwidth demand from thousands of cores. RTX 4090 has 16,384 cores; each needs data. Theoretical demand: 16,384 cores × 4 bytes × 2 GHz = 131 TB/s. Actual: 1 TB/s. Still 10-30× more than CPU. Distance matters: VRAM sits millimeters from GPU. System RAM centimeters away. At 30 cm/ns, centimeters = nanoseconds.
- How is GPU memory organized?
GPU memory has several distinct layers. Global memory, also called VRAM, provides 24-80 GB at approximately 1 TB/s and holds complete datasets. Shared memory offers roughly 100 KB per block at approximately 10 TB/s and functions as a manually managed scratchpad. Registers provide about 256 KB per thread at approximately 20 TB/s and are private to each thread. PCIe transfers between CPU and GPU operate at 32 GB/s for PCIe 4.0 or 64 GB/s for PCIe 5.0, which is 30 times slower than VRAM. This speed difference makes it important to keep data on the GPU throughout the analysis pipeline.
- Why does memory bandwidth matter for GPU performance?
GPUs read 128-byte chunks. When 32 threads (warp) access consecutive addresses, one transaction loads all (coalesced). Scattered addresses need 32 transactions—32x slower. Example:
array[threadId]is coalesced.array[threadId * 1000]is uncoalesced. Always access sequentially when possible.- What is the memory hierarchy in modern GPUs?
Modern GPUs organize memory in multiple levels to balance speed and capacity. Each streaming multiprocessor has a 128 KB register file accessible in roughly 1 cycle, backed by a 128 KB L1 cache with about 30 cycles latency. All SMs share a 40 MB L2 cache at around 200 cycles, which connects to 80 GB of HBM memory at roughly 300 cycles with 2000 GB/s bandwidth. System RAM adds another level at 500 cycles with only 50 GB/s PCIe bandwidth.
flowchart TD
A["SM Register File<br/>128 KB per SM<br/>1 cycle latency"] --> B["L1 Cache<br/>128 KB per SM<br/>30 cycles"]
B --> C["L2 Cache<br/>40 MB shared<br/>200 cycles"]
C --> D["HBM Memory<br/>80 GB<br/>300 cycles<br/>2000 GB/s"]
D --> E["System RAM<br/>Variable size<br/>500 cycles<br/>50 GB/s via PCIe"]
Note
Additional memory topics covered elsewhere:
Memory management strategies → See “Memory management” section
Out-of-memory solutions → See “How do I handle 4D-STEM data larger than GPU memory?”
Transfer optimization → See “Data movement and bottlenecks” section
Multi-GPU memory → See “Multi-GPU processing” section
GPU architecture
- Why are GPUs faster than CPUs for 4D-STEM?
GPU = processor designed for parallel computation. CPUs: 8-24 large cores (~10 mm² each) handling complex sequential tasks. GPUs: 16,384 simple cores (~0.01 mm² each) executing identical operations simultaneously. 1000× more cores fit in same chip space by sacrificing per-core complexity for massive parallelism.
- How do GPU cores differ from CPU cores?
CPU cores: independent, handle varied instructions, complex logic (branch prediction, out-of-order execution). GPU cores: groups of 32 (warp) execute identical instruction. Simple arithmetic units, minimal control. CPU excels at complex sequential tasks. GPU excels at repetitive parallel operations—perfect for processing thousands of diffraction patterns identically.
- Why are warps fundamental to GPU execution?
Prerequisites: See What is a kernel? to understand GPU threads first.
A group of 32 threads executes the same instruction simultaneously on NVIDIA GPUs [CUDA Programming Guide], while AMD calls this a wavefront and uses 64 threads [AMD GPU architecture]. The warp serves as the scheduling unit where GPU hardware issues one instruction to 32 threads at once. If threads diverge through if/else branches, the hardware serializes execution with some threads sitting idle while others execute their path. Think of 32 students in a classroom following the teacher’s instructions simultaneously where they all complete step 1 together, then step 2 together, but if some students take path A and others take path B, everyone waits while path A finishes, then waits again while path B finishes, creating inefficiency.
- How are GPU cores organized into SMs?
Prerequisites: See Why are warps fundamental to GPU execution? to understand the warp concept first.
GPUs organize thousands of CUDA cores into Streaming Multiprocessors (SMs) where each SM has approximately 128 cores and executes warps. RTX 4090: 16,384 cores organized as 128 SMs with 128 cores per SM [RTX 4090 architecture]. Each SM processes one warp of 32 threads at a time. With 128 cores per SM, four warps totaling 128 threads can use all 128 cores simultaneously. The SM is the physical hardware unit containing cores, shared memory, and registers. Warps are the scheduling abstraction. One SM switches between multiple warps to hide memory latency: when one warp waits for data, the SM executes another warp.
- Why organize cores into SMs instead of having one big pool?
Memory hierarchy: Each SM has fast shared memory (approximately 100 KB) and registers (approximately 256 KB) physically close to its cores. If all 16,384 cores shared one pool, memory would be over 16 MB and far from most cores, creating latency disaster. Control logic sharing: SMs share expensive control hardware (decoder, scheduler, scoreboard) across 128 cores. If every core had full control logic, chip would be 5 times larger, costing $8,000 versus $1,600. Latency hiding: While warp A waits for RAM (approximately 400 cycles), SM switches to warp B. With 8 to 16 warps per SM, chip stays busy.
- What are the chip area and scaling benefits of SM organization?
SMs add approximately 30% overhead but enable scaling to 16,384 cores. Without SMs, every core needs full control logic (approximately 100,000 transistors each). RTX 4090: 609 mm². If each of 16,384 cores had full control, chip would be approximately 1,500 mm², exceeding manufacturing limit of approximately 800 mm² and costing over $20,000.
- How do SMs enable better scaling than CPUs?
CPUs without hierarchical organization max at 64 to 128 cores (AMD EPYC). GPUs with SM hierarchy reach 16,384 cores on similar chip area. Early GPUs (2000s) without mature SM design achieved only 200 to 500 cores. Think: factory with 128 small assembly lines (SMs), each with 128 workers, versus one 16,384-worker line with impractical coordination. CUDA runtime distributes thread blocks across SMs automatically.
- What is “control logic” in GPU hardware?
Control logic manages instruction execution, like a factory supervisor. Four components: instruction decoder translates binary machine code to control signals (binary
5C 60...becomes “FADD” signal to arithmetic units), scheduler decides which instructions execute when and tracks which warps have data versus waiting, scoreboard tracks dependencies and prevents using data before it’s ready, and branch unit handles if/else divergence.- Why do GPUs share control logic across cores instead of giving each core its own?
Cost savings: each decoder costs approximately 10,000 transistors. If every core had its own: 16,384 cores times 10,000 equals 164 million transistors just for decoding. SM-level sharing: one decoder per SM broadcasts to 32 threads in warp simultaneously, saving 31 times transistors. Total control logic per SM: approximately 100,000 transistors at approximately 0.1 mm². Full replication: 1.6 billion transistors equaling 2% of H100 chip that could be compute cores instead. Sharing works because of SIMD: Single Instruction Multiple Data means all 32 threads in warp execute identical instruction on different data, so decode once and broadcast to 32 cores.
- Why do GPUs organize work into threads, blocks, and grids?
Three-level hierarchy solves coordination at different scales. Thread: individual work unit processing one data point (lightweight, registers only). Block: groups of 256 to 1,024 threads sharing fast 100 KB scratchpad memory enabling data reuse (threads share probe data for overlapping scan positions). Grid: collection of independent blocks dispatched across SMs with automatic scheduling. Example: 256×256 scan launches 256 blocks with 256 threads each. Proper blocking can enable substantial speedup (often 10 to 100x) from shared memory versus global memory access.
- What is thread organization?
A thread represents a single execution of a kernel. A block groups 256 to 1,024 threads sharing fast memory. A grid contains all blocks for one kernel launch.
For example, a 256×256 scan processing 65,536 patterns launches 256 blocks with 256 threads each. Each thread processes one pattern by loading data, computing center of mass, and writing results. The GPU schedules work across SMs in warps of 32.
- How many transistors fit in 100 mm² chip area?
Modern chips using 4 nm process in 2024 pack approximately 250 million transistors per mm² totaling 25 billion transistors in 100 mm². Example chips demonstrate this density where NVIDIA H100 has 80 billion transistors in 814 mm² at 98 million per mm², Apple M3 has 25 billion in 147 mm² at 170 million per mm², and RTX 4090 has 76 billion transistors in 609 mm² at 125 million per mm².
- Why does transistor density matter?
Density matters because each transistor equals one switch where more transistors enable more compute cores, larger cache, and more control logic. The H100’s 80 billion transistors include 16,896 CUDA cores plus memory controllers, interconnects, and tensor cores.
- Why does chip manufacturing require photolithography?
Direct patterning proves impossible because you cannot physically draw or etch billions of nanometer-scale features where human hand steadies at best to approximately 100 microns or 100,000 nm while manufacturing needs 4 nm precision which is 25,000 times finer. Mechanical tools fail below 1 micron due to vibration, thermal expansion, and tool wear.
- What did historical manufacturing attempts teach us?
Historical attempts from 1960s to 1970s include electron beam lithography which writes patterns directly with focused electron beam achieving precise results but operates serially addressing one feature at a time where writing 80 billion transistor positions takes approximately 10 years per chip at 1 microsecond per feature making it unusable for mass production. Contact printing stamps rubber mask onto wafer but damages mask after 10 uses achieving only 100 nm best resolution too coarse for modern 4 nm.
- How does photolithography solve manufacturing challenges?
Photolithography provides solution by using light to project entire layer patterns simultaneously exposing billions of features in parallel where key advantages include parallel exposure completing entire wafer in 50 seconds versus years for electron beam, reusable masks where glass photomasks last over 1,000 wafers, and scalable resolution where shorter wavelength light creates finer features with progression running from UV at 365 nm to DUV at 193 nm to EUV at 13.5 nm achieving 4 nm features. Modern necessity arises from scale where H100 has 80 billion transistors in 814 mm² totaling 98 million per mm² with only photolithography achieving this density at production scale.
- How are chips manufactured using photolithography?
Chip manufacturing involves seven process steps starting with ultra-pure silicon crystal sliced into 300 mm diameter wafers and polished to less than 0.5 nm roughness, where each wafer produces 40 to 50 GPU chips. Layering deposits thin films including 20 nm oxide, 5 nm metal, and 10 nm doping, stacking 15 to 30 layers with the RTX 4090 using approximately 20 metal layers. Photolithography coats the wafer with photoresist and projects circuit patterns using EUV light at 13.5 nm wavelength, repeating over 50 times. Etching removes material with plasma at 10,000°C, creating 3D trenches. Doping implants ions like boron or phosphorus to alter conductivity. Metallization fills trenches with copper and polishes flat. Testing includes probe testing each chip, dicing the wafer, and packaging.
- What does “4 nm process node” actually mean?
The term “4 nm” serves as marketing label not actual transistor size where real dimensions show 4 nm node has approximately 20 nm physical gate length and 12 nm gate pitch. Historical progression shows pace of advancement where industry moved from 180 nm in 1999 to 90 nm in 2004 to 45 nm in 2008 to 22 nm in 2012 to 7 nm in 2018 to 4 nm in 2022 with each halving doubling density following Moore’s Law.
- Why is chip node advancement slowing down?
Progress now slows as physical limits emerge where transition from 4 nm to 3 nm took 2 years while 3 nm to 2 nm takes 3 years. Quantum effects below 10 nm cause electron tunneling with solutions including 3D FinFET with vertical fins 50 nm high, high-k metal gates using hafnium oxide, and multi-patterning.
- What are the economics of chip fabrication?
Costs scale dramatically: 4 nm wafer costs $15,000 to $20,000 compared to $5,000 for 28 nm, mask set costs $5 to 10 million, and fab construction costs $10 to 20 billion per factory.
- What key technologies enable modern chip fabrication?
EUV lithography from ASML ($150 million machines) uses 13.5 nm light to enable 4 nm features, atomic layer deposition adds material one atom at a time, low-k dielectrics reduce crosstalk, and strained silicon makes electrons move 20% faster. Industry concentration: ASML dominates lithography, TSMC and Samsung handle fabrication, NVIDIA and AMD focus on design. Modern fabrication patterns features 1/1000th width of human hair.
Getting started with GPU computing
- Why use GPUs for 4D-STEM?
4D-STEM generates massive datasets requiring identical operations on thousands of diffraction patterns where GPUs excel. Typical speedups for key operations: FFT (often 50-100x), convolution (typically 20-50x), pattern matching (commonly 30-80x), virtual imaging (frequently 100-500x). A 512×512 scan with 256×256 detector produces 67 million data points. CPU processes sequentially (hours), GPU processes thousands simultaneously (minutes).
- What speedup do GPUs provide for ptychography?
Ptychography iterates 100+ times over entire dataset. Reported GPU implementations can reduce 24 hour reconstructions to 30 minutes. Memory bandwidth: GPU HBM delivers approximately 2000 GB/s versus CPU RAM at approximately 80 GB/s (roughly 25x faster data access).
- How do I check if my system has a compatible GPU?
Run
nvidia-smi(Linux/Windows) to show GPU model, CUDA version, memory, driver version, temperature, and utilization. On Mac:system_profiler SPDisplaysDataType | grep Chipset(CUDA unavailable on modern Macs, M-series uses Metal). For AMD:rocm-smi(Linux only).- What should I look for in GPU detection output?
No output means no GPU, incompatible GPU, or driver not installed. Look for: NVIDIA GPU models (GeForce, RTX, Quadro, Tesla), CUDA version 12.0 or higher, memory 8 GB or more for small datasets and 24 GB or more for typical 4D-STEM.
- How do I install CuPy and test GPU functionality?
Install via pip matching your CUDA version:
pip install cupy-cuda12x(CUDA 12.x) orpip install cupy-cuda11x(CUDA 11.x). Wrong version gives “CUDA driver version insufficient” error. Verify installation:import cupy as cp import numpy as np # Check GPU availability print(f"GPU count: {cp.cuda.runtime.getDeviceCount()}") print(f"GPU name: {cp.cuda.Device(0).name.decode()}") print(f"GPU memory: {cp.cuda.Device(0).mem_info[1] / 1e9:.1f} GB") # Simple performance test size = 10000 cpu_array = np.random.randn(size, size).astype(np.float32) gpu_array = cp.asarray(cpu_array) # CPU timing import time start = time.time() cpu_result = np.fft.fft2(cpu_array) cpu_time = time.time() - start # GPU timing start = time.time() gpu_result = cp.fft.fft2(gpu_array) cp.cuda.Stream.null.synchronize() # Wait for GPU gpu_time = time.time() - start print(f"CPU: {cpu_time:.3f}s, GPU: {gpu_time:.3f}s, Speedup: {cpu_time/gpu_time:.1f}x") # Expected: 20-100x speedup for FFT
Should show GPU info and 20-100× FFT speedup. If slower, check: CUDA installed correctly, GPU not throttling (temperature <85°C), sufficient PCIe bandwidth.
See also: How do I choose between CuPy, PyTorch, and Numba for my workflow? for selecting the right GPU framework.
- How do I choose between CuPy, PyTorch, and Numba for my workflow?
Choosing between GPU frameworks depends on your specific needs. For deep learning work, PyTorch provides the necessary infrastructure. For existing NumPy code wanting minimal changes, CuPy offers the easiest migration path. For custom algorithms with loops or conditionals, Numba enables GPU kernel development in Python. For maximum performance with willingness to write C++, CUDA C provides full control.
Prerequisites: See How do I install CuPy and test GPU functionality? to install and verify GPU functionality first.
What are the key differences between these GPU frameworks?
Framework |
Best For |
Platforms |
Learning Curve |
When to Use |
|---|---|---|---|---|
CuPy |
NumPy replacement |
NVIDIA only |
Easy (if know NumPy) |
FFT, array ops, virtual imaging (80% of needs) |
PyTorch |
Deep learning |
NVIDIA, AMD, Mac |
Medium |
Neural networks, autodiff, Mac users, cross-platform |
Numba |
Custom kernels |
NVIDIA, CPU |
Hard (GPU) / Easy (CPU) |
Custom logic, loops, conditionals not in CuPy |
CUDA C |
Maximum performance |
NVIDIA only |
Very hard |
Squeeze last 10% performance, write libraries |
Detailed comparison:
CuPy provides drop-in NumPy replacement with typical 50-100× speedup for array operations through cuFFT and cuBLAS underneath, though it only supports NVIDIA and lacks autodiff. PyTorch offers cross-platform support for NVIDIA, AMD, and Mac with autodiff for optimization and deep learning ecosystem, though with more verbose syntax than NumPy and approximately 10% extra overhead. Numba lets you write GPU kernels in Python with the same code for CPU and GPU, working well for custom logic though debugging can be difficult and some NumPy features remain unsupported.
Recommendation for 4D-STEM workflows: Start with CuPy (covers 80% of needs), add PyTorch for machine learning, and add Numba for custom kernels only if needed.
Understanding GPU kernels
- Why do GPUs use kernels instead of regular functions?
CPU calls regular functions through stack: push args, jump to address, execute, return. This works for 1 to 8 cores but fails for 16,384 GPU cores where stack for all threads would need gigabytes.
- What is the function call overhead problem?
Traditional function calls require saving registers (approximately 20 instructions overhead), managing call stack, and handling return addresses. At 16,384 cores times 1,000 function calls equals 16 million stack operations creating massive slowdown. Early GPU programming (2000s) had no function calls so developers wrote entire computation inline, leading to code duplication and programs limited to approximately 500 lines.
- How do kernels solve the function call problem?
Kernels are special functions with four features: executes across ALL threads simultaneously launching 10,000 copies instantly, no call stack where kernel runs to completion without nesting, explicit thread ID parameter like
cuda.grid(1)where thread knows which data to process, and optimized launch overhead with one kernel dispatch instruction. Performance: kernel overhead is 5 microseconds to launch 1 million threads. Regular functions take 5 microseconds per call times 1 million equals 5 seconds (1,000,000 times slower).
- What is a kernel?
A kernel is a function executing in parallel across thousands of GPU threads. Each thread runs identical code on different data through SIMD. Write code for ONE thread and GPU launches over 10,000 copies. Simple example doubling array values:
# CPU version (sequential) for i in range(1000000): output[i] = input[i] * 2 # GPU kernel (parallel) - Numba syntax @cuda.jit def double_kernel(input, output): i = cuda.grid(1) # Get unique thread ID if i < len(input): output[i] = input[i] * 2 # Launch 1000 threads in parallel double_kernel[1000, 1000](input_gpu, output_gpu)
Each of 1,000,000 threads computes its own
i, readsinput[i], writesoutput[i]simultaneously. CPU takes 10 ms sequential while GPU takes 0.1 ms parallel running 100 times faster. For 4D-STEM, replace* 2with center-of-mass calculation whereiindexes patterns not array elements.
Hardware specifications
Memory technologies
- What memory types exist?
GDDR6/6X: 800-1,000 GB/s, consumer GPUs. Optimized for throughput. HBM2e/HBM3: 1,500-3,000 GB/s, data center GPUs. Stacked vertically on package, wider buses (1024-bit vs 512-bit), shorter distance. ECC (data center only): detects/corrects bit flips, critical for multi-day computations.
- What is the memory bandwidth wall problem?
GPU compute performance doubled every 18 months following Moore’s Law but memory bandwidth only improved 20% per year, leaving GPUs increasingly starved for data. NVIDIA Kepler (2012): 3 TFLOPS compute but only 288 GB per second bandwidth, causing GPU to sit idle 80% of the time waiting for data.
- Why can’t traditional GDDR memory solve the bandwidth problem?
Physical limitations constrain traditional GDDR memory. PCB traces connecting GPU to memory are 5 to 10 cm long, limiting bus width to 512 bit maximum before signal integrity fails from crosstalk and timing skew. Even at 8 GHz speed, 512 bit bus equals 512 GB per second theoretical max. Training GPT-3 scale models requires 1.5 TB per second bandwidth, but GDDR physically cannot deliver because traces are too long and buses are too narrow.
- What HBM alternatives failed before TSV technology?
Failed alternatives in the 2000s included external HBM modules which proved too expensive, wider GDDR buses where signal integrity collapsed beyond 512 bit, and faster GDDR clocks which hit power wall at 8 GHz consuming 10W per chip.
- How is HBM manufactured using TSV technology?
Manufacturing uses Through-Silicon Via (TSV) where vertical holes drilled through silicon chips get filled with copper. The process stacks 4 to 12 DRAM dies vertically, connects with over 10,000 TSVs per chip, and mounts on interposer substrate beside GPU.
- Who manufactures HBM and what market challenges exist?
Three vendors dominate: SK Hynix (50-60% market share) supplies NVIDIA H100/A100 and AMD MI300, Samsung (30-40%) supplies own products and some AMD, and Micron (5-10%) as emerging player.
- What is ASML’s role in HBM manufacturing?
ASML provides EUV lithography machines ($150 million each) to pattern DRAM chips. TSV drilling uses different equipment from Applied Materials and Lam Research. ASML is the printing press for circuit patterns while TSV equipment is the drill press for vertical connections.
- What performance does HBM achieve compared to GDDR?
Vertical stacking eliminates long traces, reducing distance from millimeters to micrometers. This makes 1024 bit bus possible compared to 512 bit max for planar, achieving 3,000 GB per second which is 10 times faster than GDDR6.
GPU products
- What NVIDIA GPUs exist?
Consumer GeForce RTX includes the 4090 with 24GB at 83 TFLOPS for $1,600 and the 4080 with 16GB at 49 TFLOPS for $1,200. These suit development and medium datasets. Data center options include the H100 with 80GB HBM3 at 30 TFLOPS FP64, 3,000 GB/s for $25,000 to $30,000 and the A100 with 40 to 80GB HBM2e at 9.7 to 19.5 TFLOPS FP64, 1,555 to 2,000 GB/s for $10,000 to $20,000. These provide full NVLink, ECC, and 24/7 operation. Workstation choice is the RTX 6000 Ada with 48GB for $6,000 as a professional workstation option.
GPU Product Comparison Table
Model |
Memory |
Bandwidth |
FP32 TFLOPS |
FP64 TFLOPS |
Price |
Best For |
|---|---|---|---|---|---|---|
RTX 4090 |
24 GB GDDR6X |
1,008 GB/s |
83 |
1.3 |
$1,600 |
Development, medium datasets, best value for FP32 |
RTX 4080 |
16 GB GDDR6X |
720 GB/s |
49 |
0.8 |
$1,200 |
Budget builds, smaller datasets, single-user |
RTX 6000 Ada |
48 GB GDDR6 |
960 GB/s |
91 |
1.4 |
$6,000 |
Professional workstations, large memory needs |
A100 (40GB) |
40 GB HBM2e |
1,555 GB/s |
19.5 |
9.7 |
$10,000 |
Balanced FP64, NVLink, ECC, multi-GPU |
H100 (80GB) |
80 GB HBM3 |
3,000 GB/s |
51 |
30 |
$30,000 |
Maximum performance, real-time acquisition |
AMD MI300X |
192 GB HBM3 |
5,200? GB/s |
163 |
81 |
$10,000 (source) |
Extreme memory capacity, alternative to NVIDIA |
Intel Max 1550 |
128 GB HBM2e |
3,200 GB/s |
22 |
22 |
$8,000 |
Large memory, balanced FP32/FP64, emerging |
See also: gpu-specs-ptychography for detailed selection criteria, When should I use single-GPU vs multi-GPU for ptychography? for scaling considerations.
NVIDIA software
Programming and compilation
Compilation fundamentals
- Why do we need compilers at all?
Computers only understand binary machine code (1s and 0s), but humans cannot write or maintain programs in binary. Early computers (1940s-1950s) used punch cards taking weeks for simple calculations. Assembly language provided first abstraction with symbolic names like
movandadd, but still required managing every memory address manually where simple loop took 20+ instructions. See “hello wolrd” example here. High-level languages (FORTRAN 1957, C 1972, Python 1991) let humans write readable code:for i in range(1000): result += data[i]replaces 50+ assembly instructions. Compilers bridge the gap translating human code to efficient machine code, applying optimizations humans cannot track (register allocation, instruction reordering), and managing low-level details. Without compilers, modern software would be impossible.- Why do compilers need multiple stages instead of directly translating code?
The direct translation problem becomes clear when you try converting
x = 5 + 3directly to machine code in one step which would face impossible challenges where machine code has no concept of variables only memory addresses, it has no operators like+only specific CPU instructions likeADD, it requires knowing exact register allocation, and worse every target including x86, ARM, and GPU needs completely different output requiring separate translation logic for each. Early compilers in the 1950s tried this approach by writing custom translator for each language-hardware pair where the result proved unmaintainable, buggy, and couldn’t optimize. Modern compilers solve this by breaking into specialized stages, each solving one specific problem.- What is interpreted vs compiled vs JIT?
Interpreted (Python): compiles to bytecode at runtime, executes line-by-line, slow (10-100× slower) but flexible. Compiled (C++): converts to machine code before execution, fast but requires recompile for changes. JIT (Numba, PyTorch): compiles hotspots at first run, caches machine code for later runs—combines Python flexibility with compiled speed.
- Why do we need LLVM, PTX, SASS, and nvcc? How do they work together?
Four layers enable code portability. Layer 1 LLVM converts any language (Python, C++, Rust) to universal intermediate representation. Layer 2 PTX provides portable NVIDIA GPU assembly working across all generations (RTX 3090 code runs on RTX 4090 unmodified). Layer 3 SASS is final GPU-specific machine code optimized for exact architecture. Layer 4 nvcc orchestrates compilation splitting CUDA code into CPU and GPU parts, producing single executable with embedded PTX. Benefits: write once run on all NVIDIA GPUs, driver updates improve performance without recompiling, language-independent GPU support.
- What problem does LLVM solve for modern programming languages?
Before LLVM: N languages × M platforms = N×M compilers (hundreds). LLVM provides universal intermediate representation (IR) separating language frontend from hardware backend. Result: N+M instead of N×M. New language (Swift) instantly works on all platforms (x86, ARM, GPU). New hardware (RISC-V) instantly supports all LLVM languages.
- What are PTX and SASS?
PTX (Parallel Thread Execution) is portable NVIDIA GPU assembly that works across all GPU generations. Same PTX binary runs on RTX 4090, A100, and H100 unmodified. SASS is GPU-specific machine code optimized for exact architecture. NVIDIA driver JIT-compiles PTX to SASS at runtime, enabling portability and forward compatibility.
- What is nvcc?
NVIDIA CUDA Compiler that splits mixed CPU/GPU code into two compilation paths. Host code goes to g++/clang, device code becomes PTX. Links everything into single executable with embedded PTX. Usage:
nvcc kernel.cu -o program.- What does machine code look like?
Binary instructions hardware executes. CPU example:
add eax, ebxbecomes01 C3(2 bytes hex). GPU example:FADD R0, R0, R1becomes5C 60 00 00 00 10 00 F0(8 bytes). Tools:objdump -dfor CPU,cuobjdumpfor GPU.- What is the relationship between machine code and assembly?
Assembly provides human-readable mnemonics (
add,mov) that map one-to-one with binary machine code. Example:add rax, rbxbecomes binary48 01 D8. Humans write assembly because it’s readable, hardware executes binary because it’s efficient. Assemblers convert assembly to binary, disassemblers reverse it. Tools like Nsight Compute show SASS as assembly for readability.- What problem does Numba solve?
Some algorithms need custom logic including nested loops, conditionals, and complex state management that NumPy or CuPy cannot express efficiently. Example: iterative phase retrieval with adaptive convergence checking requires loops with if/else branches. CuPy forces awkward vectorization, hand-written CUDA C requires C++ expertise taking days to write.
- How does Numba work?
Numba compiles Python directly to GPU code [Numba documentation]. Add
@numba.cuda.jitdecorator to function. Numba analyzes types and generates GPU instructions via LLVM. First call takes 5 seconds for compilation, later calls are instant from cached results. Write familiar Python with loops and if/else to get GPU speed.- What are Numba’s trade-offs?
Simpler than CUDA C with Python syntax but slower than expert hand-written code. Perfect for prototyping custom algorithms. Anaconda Inc. created it as open-source with NVIDIA sponsorship to drive Python GPU adoption.
- How does Numba compare to vectorized code?
Consider center-of-mass calculation with adaptive thresholding. Vectorized approach using CuPy requires multiple array allocations, boolean masks, and scatter operations. Numba allows direct loop with threshold logic where each thread processes one diffraction pattern accumulating sum_x, sum_y, and sum_intensity directly. Code is clearer and avoids intermediate arrays.
- What does Numba code look like?
import numba from numba import cuda import numpy as np @cuda.jit def center_of_mass_kernel(patterns, thresholds, output_x, output_y): idx = cuda.grid(1) if idx < patterns.shape[0]: pattern = patterns[idx] threshold = thresholds[idx] sum_x = 0.0 sum_y = 0.0 sum_intensity = 0.0 for i in range(pattern.shape[0]): for j in range(pattern.shape[1]): intensity = pattern[i, j] if intensity > threshold: sum_x += j * intensity sum_y += i * intensity sum_intensity += intensity if sum_intensity > 0: output_x[idx] = sum_x / sum_intensity output_y[idx] = sum_y / sum_intensity
GPU programming
GPU platforms and ecosystems
- What problem does CUDA solve?
In 2006, GPUs existed for graphics but scientists couldn’t use them for computation. Graphics APIs like OpenGL and DirectX forced awkward workarounds requiring you to represent data as fake textures and computation as pixel shaders [Accelerator: using data parallelism to program GPUs for general-purpose uses]. NVIDIA created CUDA as first platform treating GPU as general-purpose parallel processor [CUDA Toolkit documentation], where you write C-like code, access memory directly, launch thousands of threads, and use optimized libraries like cuFFT, cuBLAS, and cuDNN. CUDA’s 15 year head start means NVIDIA GPUs dominate research with 70 to 80% market share. CUDA is standard for 4D-STEM because py4DSTEM, LiberTEM, CuPy, and PyTorch all use CUDA libraries where cuFFT remains fastest FFT implementation. Limitation: NVIDIA-only. Mac users post-2019 cannot use CUDA and must use PyTorch with Metal or remote NVIDIA server.
- What about OpenCL and WebGPU?
OpenCL (2009) supports all vendors (NVIDIA, AMD, Intel, Apple) [Khronos OpenCL] but typically runs 10 to 30% slower than CUDA on NVIDIA hardware and lacks mature libraries. Use when multi-vendor support required or avoiding NVIDIA lock-in matters. WebGPU (2021-2024) provides browser-based GPU access [W3C WebGPU spec] for interactive demos and teaching tools but generally runs 2 to 5 times slower with approximately 2 GB memory limit. Use when web deployment needed and datasets under 1 GB. For production 4D-STEM, CUDA remains best choice.
- What is Metal and why do platform-specific APIs exist?
GPU vendors design fundamentally different hardware. NVIDIA uses CUDA cores, AMD uses compute units, and Apple designs custom GPUs in M series chips. Each architecture has distinct memory layouts, threading models, and specialized features that cannot be abstracted without performance loss.
- Why did early universal GPU standards fail?
Early universal standards like OpenGL and OpenCL (2000s) sacrificed 20 to 40% performance compared to native APIs. Platform holders created optimized native APIs: Metal (Apple, macOS/iOS), DirectX 12 (Microsoft, Windows/Xbox), Vulkan (cross-platform but needs vendor tuning), and CUDA (NVIDIA only).
- Why does Metal not work on Windows or Linux?
Apple controls both hardware (M series GPU) and OS (macOS/iOS), enabling tight integration impossible with vendor-neutral APIs. Windows and NVIDIA systems lack Apple’s custom GPU hardware. CUDA dominates scientific computing because NVIDIA manufactures most datacenter GPUs and maintains mature libraries. Metal works for Mac-only applications but lacks scientific computing ecosystem.
- Can I use CUDA libraries on MacBooks with GPU?
No, CUDA requires NVIDIA hardware. MacBooks with M1/M2/M3/M4 have Apple GPUs that only support Metal. You cannot use cuFFT, cuBLAS, CuPy, or any CUDA-specific libraries on Mac.
- Can I use GPU acceleration on MacBook for 4D-STEM?
Use PyTorch or JAX with Metal backend, not CuPy. PyTorch MPS:
device = torch.device("mps")providestorch.fft.fft2()andtorch.matmul()on Apple GPU. JAX: experimental Metal backend. TensorFlow:tensorflow-metal. Works: FFT, matrix ops, convolutions, deep learning. Doesn’t work: CuPy (NVIDIA-only), py4DSTEM/LiberTEM GPU modes, custom CUDA kernels. Performance: M-series typically gives 2 to 10 times CPU speedup, generally slower than NVIDIA discrete GPU but useful for medium datasets. Code changes require converting NumPy to PyTorch tensors. Example showsnp.fft.fft2(array)becomestorch.fft.fft2(torch.from_numpy(array).to("mps")).
Python GPU libraries
- What problem does CuPy solve?
The problem shows scientists have years of NumPy code including FFT, array operations, and linear algebra running on CPU where rewriting everything in CUDA C would take months and require C++ expertise most researchers lack, creating need for GPU acceleration without abandoning existing codebase. The solution shows CuPy provides NumPy-compatible GPU library where you replace
import numpy as npwithimport cupy as cpand that’s it, with arrays automatically storing on GPU and operations executing on GPU, such asnp.fft.fft2(array)becomingcp.fft.fft2(array_gpu)with same syntax often running 50 to 100 times faster with minimal code changes giving substantial speedup. Why it works shows CuPy translates familiar NumPy calls to highly optimized NVIDIA libraries underneath including cuFFT and cuBLAS where you write Python and get CUDA performance, providing easiest path to GPU acceleration for array-based scientific computing.- Who develops CuPy and what does it use underneath?
Developed by Preferred Networks, a Japanese AI company, CuPy is open-source with MIT license, not created by NVIDIA but depends on NVIDIA libraries. Under the hood, CuPy dispatches to multiple libraries including cuFFT for FFT operations where
cp.fft.fft2()calls cuFFT, cuBLAS for matrix math includingcp.dot()andcp.matmul(), cuRAND for random numbers, cuSPARSE for sparse matrices, Thrust for sorting and reduction, and cuDNN for deep learning if installed. Think of this like NumPy calls Intel MKL as optimized CPU library for speed where CuPy calls NVIDIA libraries as optimized GPU code for speed. You write high-level NumPy-like code while library handles low-level optimization. This is why CuPy is fast because it uses over 10 years of NVIDIA engineering, not custom implementations.- What matrix operations does CuPy accelerate with cuBLAS?
cuBLAS handles linear algebra operations. Example:
import cupy as cp # Create matrices on GPU A = cp.random.randn(1000, 1000, dtype=cp.float32) B = cp.random.randn(1000, 1000, dtype=cp.float32) # Matrix multiply - calls cuBLAS under the hood C = cp.matmul(A, B) # or A @ B # cuBLAS achieves ~10-20 TFLOPS on RTX 4090 # Matrix operations using cuBLAS: inverse = cp.linalg.inv(A) # Matrix inversion eigenvals = cp.linalg.eigvals(A) # Eigenvalues Q, R = cp.linalg.qr(A) # QR decomposition U, S, V = cp.linalg.svd(A) # SVD decomposition # Solve linear system Ax = b (uses cuBLAS GESV) b = cp.random.randn(1000, 1, dtype=cp.float32) x = cp.linalg.solve(A, b)
- What problem does PyTorch solve for scientific computing?
PyTorch provides GPU arrays with automatic differentiation (autodiff) for optimization problems including minimizing loss functions, parameter fitting, and deep learning where CuPy handles pure array operations but lacks autodiff. PyTorch works cross-platform on NVIDIA with CUDA, Mac with Metal, and AMD with ROCm. For detailed comparison, PyTorch optimization techniques, and when to use PyTorch vs CuPy, see GPU optimization.
- Why does PyTorch work on Mac but CuPy doesn’t?
CuPy limitation shows hard-coded to call CUDA libraries including cuFFT and cuBLAS, working only on NVIDIA GPUs where Mac M-series uses Apple silicon with Metal API, not CUDA, and CuPy has no Metal backend failing immediately on Mac. PyTorch flexibility shows abstract backend layer that detects hardware at runtime and dispatches to appropriate library where NVIDIA GPU uses CUDA and cuFFT, Mac GPU uses Metal Performance Shaders, and AMD GPU uses ROCm, with same PyTorch code using different backends. Performance on Mac shows
torch.fft.fft2()calls Metal FFT, not cuFFT, where this typically runs 5 to 15 times faster than CPU but generally 2 to 5 times slower than NVIDIA due to less mature optimization since Metal is newer than CUDA. Bottom line shows Mac users must use PyTorch or JAX with Metal while NVIDIA users can choose where CuPy is simpler for array ops and PyTorch is better for deep learning and cross-platform.- What libraries help GPU computing?
cuFFT provides NVIDIA FFT library typically running 10 to 100 times faster than CPU. cuBLAS offers optimized matrix operations underlying PyTorch and TensorFlow. cuDNN includes deep learning primitives for convolution and pooling. NCCL handles multi-GPU communication for all-reduce and broadcast. Numba provides Python JIT to CUDA. PyTorch and TensorFlow offer automatic differentiation and GPU memory management. JAX provides NumPy-like API with autodiff and GPU support.
- When should I write custom kernels?
Write custom when first, CuPy can’t express algorithm with complex logic and conditionals; second, fusion opportunity exists where you combine operations and eliminate intermediate arrays such as
(data * mask).sum() / mask.sum()creating 3 temporaries while custom kernel does one pass saving 3 memory round-trips potentially running 2 to 5 times faster; and third, profiled bottleneck appears. Trade-off requires CUDA C++, memory hierarchy knowledge, and parallel debugging.- Why use Numba when PyTorch and CuPy exist?
Numba allows JIT compiling pure Python to GPU, good for custom algorithms without CUDA C. CuPy provides NumPy on GPU as easiest migration for array code. PyTorch offers deep learning with autodiff, not for general scientific computing. Decision tree shows standard operations including FFT and matrix multiply use CuPy or PyTorch calling optimized NVIDIA libraries. Custom algorithm that is simple uses Numba with Python syntax and GPU speed. Custom algorithm that is performance-critical uses CUDA C for maximum control. Deep learning uses PyTorch for ecosystem. Numba fills gap between “too complex for CuPy vectorization” and “not worth CUDA C investment.”
- When should I NOT use Numba?
Avoid Numba in three situations. First, standard operations including FFT, matrix multiply, and convolutions should use cuFFT, cuBLAS, and cuDNN via CuPy or PyTorch. NVIDIA-optimized with over 10 engineer-years typically runs 2 to 10 times faster than custom code. Second, deep learning should use PyTorch or TensorFlow providing autodiff, optimizers, data loading, and model zoo. Numba has none Third, production code needing maximum speed often gets 2 to 5 times improvement from CUDA C via fine control including shared memory tiling and warp intrinsics.
- What are the disadvantages of using Numba for GPU programming?
Six main limitations exist: first, debugging difficulty means you cannot use pdb or print inside GPU kernels easily where errors show cryptic LLVM messages instead of clear Python tracebacks and you must debug by commenting out code sections; second, limited NumPy support shows only subset of NumPy functions work in kernels with no fancy indexing, boolean masks, or advanced operations requiring manual loops.
- What performance challenges does Numba face?
Third, performance unpredictability shows JIT compilation timing varies from 5 to 10 seconds first run to less than 1 second later where type inference sometimes fails causing fallback to slow Python; fourth, no library ecosystem means cannot call cuFFT or cuBLAS directly from Numba kernels requiring return to Python, use CuPy, then call Numba again.
- When is Numba acceptable despite its limitations?
Fifth, memory management manual requires explicitly allocate device arrays, copy data, and free memory where CuPy handles automatically; and sixth, slower than hand optimized CUDA shows Numba generates good code but misses optimizations including bank conflict avoidance, warp shuffle intrinsics, and tensor cores, typically 2 to 5 times slower than expert CUDA C for complex kernels. When acceptable includes prototyping custom algorithms where you value simplicity over 20% speed loss, educational use teaching GPU concepts without C++, and bridging gaps CuPy cannot express including complex conditionals and state machines.
- Who develops Numba and what is its relationship to NVIDIA?
Numba is developed by Anaconda, Inc., formerly Continuum Analytics. NVIDIA sponsors because Numba drives CUDA adoption in Python community, though Anaconda owns codebase and controls direction. Similar to CuPy developed by Preferred Networks as Japanese AI company and PyTorch developed by Meta, these are not NVIDIA products but depend on CUDA libraries. NVIDIA provides CUDA Toolkit including compiler, libraries, and drivers as foundation while third parties build Python interfaces including Numba for JIT kernels, CuPy for NumPy API, and PyTorch for deep learning. Think of NVIDIA as road builder providing CUDA infrastructure where Anaconda and others serve as car manufacturers creating Python tools driving on those roads. NVIDIA benefits from ecosystem growth where more users buy GPUs, so sponsors but doesn’t control.
- What is the relationship between Numba and CUDA/PTX?
Numba compiles Python kernels to PTX (GPU assembly) via LLVM backend, then CUDA driver converts PTX to SASS (machine code). CuPy wraps pre-compiled NVIDIA libraries (cuFFT, cuBLAS) already in SASS. Different paths: Numba = JIT compiler for custom code, CuPy = library wrapper for standard operations. Both run on GPU but compile independently.
- How do I combine Numba and CuPy in practice?
CuPy arrays pass directly to Numba kernels sharing same GPU memory. Use CuPy for standard operations (FFT, matrix multiply calling cuFFT/cuBLAS), Numba for custom logic (conditionals, loops). No CPU-GPU transfers between operations. Pattern: CuPy FFT, Numba custom kernel, CuPy inverse FFT. Example:
import cupy as cp from numba import cuda data_fft = cp.fft.fft2(data) # CuPy calls cuFFT @cuda.jit def custom_filter(data, output, threshold): i, j = cuda.grid(2) if i < data.shape[0] and j < data.shape[1]: magnitude = abs(data[i, j]) output[i, j] = data[i, j] * (1.5 if magnitude > threshold else 0.5) custom_filter[blocks, threads](data_fft, result, 0.5) # Numba custom final = cp.fft.ifft2(result) # Back to CuPy
- Does Numba translate Python to CUDA C code?
No, Numba never generates CUDA C.
Actual process flows through five stages. First comes Python. Second shows type inference where Numba analyzes that
xis float32 andiis int32. Third produces LLVM IR as intermediate representation that is language-independent. Fourth creates PTX assembly as NVIDIA GPU portable assembly. Fifth generates SASS machine code as GPU-specific binary created by driver at runtime. Skips C entirely.
- How do I integrate GPU processing into my existing pipeline?
Six step migration strategy guides the process where first, profile CPU code with
cProfileto identify bottlenecks taking over 60% time, second, start small by converting one function like FFT or center of mass to CuPy testing against NumPy withnp.allclose(), third, minimize transfers by loading full dataset to GPU once usingdata_gpu = cp.asarray(data_cpu)keeping all operations GPU side and transferring only final results.- What batch processing strategies optimize GPU pipelines?
Fourth, batch operations by vectorizing as
results = process_all_patterns(data_gpu)instead of looping over patterns to exploit parallelism, fifth, add error handling by wrapping GPU calls in try except for out of memory with fallback to CPU or chunking, and sixth, iteratively expand by converting next bottleneck and repeating.- What does a typical 4D-STEM GPU pipeline look like?
Example 4D STEM pipeline shows load HDF5, transfer to GPU, compute virtual images on GPU, apply strain mapping on GPU, transfer results, and save where CPU time takes 30 minutes while GPU time takes 2 minutes keeping intermediate arrays on GPU throughout.
Practice
PyTorch and CUDA
- PyTorch is really fast but why would you use CUDA at all?
PyTorch provides high level operations but custom CUDA kernels give you fine grained control over memory access patterns, thread synchronization, and data layout where optimizations like memory tiling can achieve 3-8x speedups. Production systems like FlashAttention (GPT-4), xFormers (Llama), DeepSpeed (Microsoft), and vLLM use custom CUDA C++ because they need maximum performance and precise control over GPU resources that PyTorch’s general purpose kernels cannot provide. For specialized algorithms where memory access patterns matter more than ease of development, CUDA offers performance gains that justify the implementation complexity.
- Can NumPy run on GPUs, or is that just something PyTorch does?
NumPy runs only on CPUs where all computations execute in CPU memory using optimized BLAS libraries like OpenBLAS or Intel MKL. PyTorch and CuPy provide GPU accelerated alternatives where you explicitly move data to GPU memory using
.cuda()orcp.asarray()then operations run on GPU hardware. NumPy’s design predates widespread GPU computing so it focuses on CPU performance rather than GPU support.- What happens when PyTorch runs without a GPU does it automatically fall back to the CPU?
Yes, PyTorch tensors default to CPU where operations run on CPU unless you explicitly move them to GPU with
.to('cuda')or.cuda(). If you try to use GPU operations without available GPU hardware, PyTorch raises an error rather than silently falling back. You must handle device placement explicitly in your code usingdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu').- How does PyTorch on CPU compare to NumPy in performance?
PyTorch on CPU performs similarly to NumPy for basic operations since both use the same underlying BLAS libraries (MKL, OpenBLAS, or Accelerate). PyTorch adds overhead for autograd tracking and tensor metadata but for large matrix operations the BLAS calls dominate so performance differences are typically within 10-20%. NumPy may be slightly faster for simple operations without gradients while PyTorch excels when you need automatic differentiation or plan to move computations to GPU later.
- Do GPUs make FFTs (Fast Fourier Transforms) faster than CPUs?
Yes, GPUs can be 10-50x faster for large FFTs because the algorithm parallelizes well across thousands of threads where butterfly operations at each stage process independent data elements simultaneously. For small FFTs (under 1024 points) CPUs may be faster due to data transfer overhead and better memory hierarchy. The crossover point depends on FFT size where 2D FFTs on 1024x1024 or larger images benefit significantly from GPU acceleration.
- Are GPUs faster when running batches of FFTs instead of just one?
Yes, batched FFTs achieve much better GPU utilization where processing 100 images simultaneously keeps thousands of GPU threads busy compared to a single FFT that may underutilize available parallelism. Libraries like cuFFT offer batched operations that amortize memory transfer costs and kernel launch overhead across multiple transforms. For example computing FFTs on 100 images of size 512x512 takes only 2-3x longer than one image rather than 100x longer showing massive parallel efficiency.
- Does PyTorch automatically parallelize matrix operations on the GPU?
Yes, PyTorch automatically parallelizes all GPU operations where a single
torch.matmul(A, B)call launches thousands of threads that compute different output elements simultaneously. The parallelization happens transparently using optimized cuBLAS kernels for linear algebra and custom CUDA kernels for element wise operations. You write sequential looking code while PyTorch’s backend distributes work across GPU cores automatically.- How does PyTorch actually manage GPU memory?
PyTorch uses a caching memory allocator that requests large blocks from CUDA then suballocates from these blocks to avoid repeated expensive CUDA malloc calls where allocated memory stays cached even after tensors are deleted. You can check memory usage with
torch.cuda.memory_allocated()and force release withtorch.cuda.empty_cache()though this rarely helps since PyTorch reuses cached memory efficiently. The allocator reduces fragmentation and improves performance by maintaining a pool of reusable memory blocks.- Do PyTorch developers write their own GPU kernels?
PyTorch core developers write custom CUDA kernels for operations not covered by cuBLAS or cuDNN where they implement element wise operations, reductions, indexing, and specialized functions. See GPU optimization for details on writing custom PyTorch extensions.
Simulations
- Are GPUs useful for DFT, molecular dynamics, and fluid dynamics simulations?
Yes, GPUs provide 10-50x speedups for molecular dynamics (GROMACS, LAMMPS) and fluid dynamics (lattice Boltzmann, CFD solvers) because force calculations and cell updates parallelize naturally. DFT benefits less from GPUs since many quantum chemistry algorithms have sequential dependencies though GPU accelerated codes like VASP-GPU and CP2K show 2-5x improvements for specific calculations. The speedup depends on system size where larger simulations (10,000+ atoms, fine meshes) benefit more from GPU parallelism than small systems.
- How do GPUs actually speed up molecular dynamics calculations?
GPUs assign each atom’s force calculation to separate threads where thousands of atoms compute their forces from neighbors simultaneously rather than sequentially. The GPU evaluates pairwise interactions in parallel where thread 0 computes forces on atom 0 while thread 1 handles atom 1 all happening at once. After force computation completes, another parallel pass updates all atom positions and velocities simultaneously where operations like
new_position = old_position + velocity * timesteprun independently for each atom.- After each step in molecular dynamics, does the GPU update atom positions immediately?
Yes, after computing forces on all atoms, the GPU runs a kernel that updates all positions simultaneously in a single parallel operation. Each thread reads its atom’s force, velocity, and current position then writes the new position back to memory where all atoms update together rather than one at a time. This happens entirely on GPU memory without transferring data to CPU until you need to save snapshots or visualize results.
- If CPUs are so fast, why are they still slower than GPUs for large simulations?
CPUs have 8-16 cores while GPUs have thousands where a CPU with 16 cores processes 16 atoms simultaneously but a GPU with 10,000 cores processes 10,000 atoms simultaneously. Even though each CPU core runs faster (3-5 GHz vs 1-2 GHz GPU), the sheer parallelism overwhelms the clock speed advantage. For 100,000 atom simulations, the CPU needs 6,250 iterations (100,000/16) while the GPU needs only 10 iterations (100,000/10,000) resulting in 10-50x speedups despite slower individual operations.
- Do small sequential delays on CPUs really add up to big slowdowns?
Yes, processing 100,000 atoms sequentially where each takes 1 microsecond totals 100 milliseconds per timestep, but processing them in parallel on 10,000 GPU threads takes only 10 microseconds per batch for 10x speedup. Memory latency compounds the problem where CPUs wait 100-200 ns per memory access but GPUs hide this latency by switching to other threads while waiting. Over millions of timesteps these microsecond differences accumulate to hours or days where a simulation taking 10 days on CPU might finish in 5 hours on GPU.
- In ptychography, do GPUs handle data in batches for faster processing compared to NumPy on CPUs?
Yes, GPUs process multiple diffraction patterns simultaneously where 16-64 patterns compute FFTs, apply constraints, and update in parallel compared to NumPy processing one pattern at a time. Each GPU thread handles different pixels across all patterns enabling batch processing of the entire reconstruction workflow. A GPU might process 32 patterns in the same time NumPy processes 1 pattern giving 20-30x speedups for iterative ptychography algorithms requiring hundreds of iterations.
- Would people at OpenAI use Numba?
Unlikely, because production systems require maximum performance where custom CUDA kernels outperform Numba’s JIT compiled code by 2-5x through hand optimized memory access and thread coordination. Numba works well for prototyping and research where fast iteration matters more than peak performance but production deployments use CUDA C++ for fine grained control. OpenAI uses custom CUDA kernels for critical operations like FlashAttention while PyTorch handles standard layers where the performance gain justifies development effort.
- What is BLAS?
BLAS (Basic Linear Algebra Subprograms) is a standardized API for vector and matrix operations like dot products, matrix multiplication, and linear solves where implementations like Intel MKL, OpenBLAS, and Apple Accelerate provide highly optimized code using assembly and SIMD instructions. NumPy uses BLAS under the hood where
A @ Bcalls the BLASdgemmfunction that has been optimized over decades with cache aware algorithms and parallel threading. Different BLAS libraries offer different performance where Intel MKL excels on Intel CPUs while Apple Accelerate optimizes for M series chips, but all implement the same standard interface letting NumPy work with any BLAS backend.
Hardware understanding
- Would studying hardware, like Jensen Huang did, be a smart move for me?
Understanding hardware fundamentals gives you intuition for performance bottlenecks and helps you write faster code where knowing memory hierarchies explains why small matrices run slowly on GPUs or why tiling improves cache utilization. You don’t need to design chips but knowing how caches work, what SIMD means, and why memory bandwidth limits throughput makes you a better computational scientist. This knowledge pays dividends across your career as algorithms change but hardware principles remain constant.
- Which Stanford EE courses cover hardware and accelerated computing?
EE180 covers digital systems and processor design teaching how CPUs work at the hardware level. CS149 Parallel Computing teaches GPU programming, CUDA, and performance optimization for parallel architectures. EE282 Computer Systems Architecture explores modern processor design including caches, pipelining, and memory systems giving deep understanding of why hardware behaves as it does.
- If AI can write code, do I still need to understand hardware deeply?
Yes, because AI generates syntactically correct code but cannot optimize for your specific hardware constraints or debug performance problems where understanding cache behavior or memory bandwidth becomes critical. Knowing hardware lets you ask better questions, verify AI suggestions, and recognize when generated code will perform poorly. The combination of AI productivity with human hardware intuition creates better results than either alone where AI handles boilerplate while you optimize critical paths.
- Why do CPUs need caches if they already have fast RAM?
RAM takes 100-200 nanoseconds to access while CPUs execute instructions in 0.3 nanoseconds meaning each memory fetch wastes 300-600 CPU cycles where the processor sits idle. Caches provide memory access in 1-10 nanoseconds using expensive SRAM technology placed directly on the CPU chip reducing latency by 10-100x. Modern processors spend more transistors on cache than compute cores because memory latency, not computation speed, limits most program performance.
- Why are there multiple cache levels (L1, L2, L3) instead of just one big one?
Larger caches are physically farther from the CPU core taking more nanoseconds for signals to travel where a 32 KB L1 cache responds in 1 ns but a 32 MB cache would take 10+ ns due to wire length. Each level trades size for speed: L1 (32-64 KB, 1 ns), L2 (256-512 KB, 3-5 ns), L3 (8-32 MB, 10-20 ns) forming a hierarchy where small fast caches filter requests before reaching larger slower ones. This architecture gives you the latency of small caches with the hit rate of large caches where 95% of accesses hit L1/L2 providing fast access while L3 catches the remaining 5%.
- Would a CPU be faster if it were entirely made of L1 cache?
No, because making L1 cache larger increases access latency due to longer signal paths where a 1 MB L1 cache would be slower than today’s 32 KB version. SRAM used for cache requires 6 transistors per bit making it expensive and power hungry where filling a chip with cache leaves no room for compute cores that actually execute instructions. The current design balances cache size, speed, and compute capability where adding more cache beyond a certain point helps less than adding more CPU cores or improving other bottlenecks.
NumPy and PyTorch
- Will PyTorch eventually replace NumPy?
Unlikely, because NumPy remains the standard for CPU based scientific computing with decades of ecosystem integration where libraries like SciPy, pandas, and scikit-learn depend on NumPy arrays. PyTorch excels at GPU accelerated machine learning and automatic differentiation but adds overhead and complexity unnecessary for simple CPU computations. Most projects use both where NumPy handles data preprocessing and analysis while PyTorch trains neural networks with the two libraries interoperating seamlessly through
torch.from_numpy()and.numpy().- How can NumPy and PyTorch complement each other in practice?
Use NumPy for data loading, preprocessing, and statistical analysis on CPU then convert to PyTorch tensors for GPU accelerated training or inference. PyTorch handles the compute intensive neural network operations while NumPy works with the broader scientific Python ecosystem for visualization (matplotlib), dataframes (pandas), and traditional algorithms (scipy). For example load data with NumPy, augment images with OpenCV (NumPy based), train with PyTorch on GPU, then use NumPy again for result analysis and plotting.
- Can PyTorch handle eigenvalue problems and optimization tasks like NumPy?
Yes, PyTorch provides
torch.linalg.eig()for eigenvalues and many optimization routines but NumPy/SciPy offer more comprehensive linear algebra and numerical methods. PyTorch focuses on operations useful for deep learning where automatic differentiation and GPU acceleration matter more than exhaustive numerical algorithms. For specialized tasks like sparse matrices, signal processing, or statistical distributions, SciPy remains more complete while PyTorch excels at iterative gradient based optimization on large datasets.
Practice questions
PyTorch and CUDA
- PyTorch is really fast but why would you use CUDA at all?
PyTorch provides high level operations but custom CUDA kernels give fine grained control over memory access patterns, thread synchronization, and data layout. Optimizations like memory tiling can potentially achieve 3-8x speedups. Production systems like FlashAttention (GPT-4), xFormers (Llama), DeepSpeed (Microsoft), and vLLM use custom CUDA C++ for maximum performance.
- When does custom CUDA justify the complexity?
For specialized algorithms where memory access patterns matter more than ease of development, CUDA offers performance gains that justify implementation complexity. PyTorch’s general purpose kernels cannot provide precise control over GPU resources that production systems need.
- Can NumPy run on GPUs, or is that just something PyTorch does?
NumPy runs only on CPUs where all computations execute in CPU memory using optimized BLAS libraries like OpenBLAS or Intel MKL. PyTorch and CuPy provide GPU accelerated alternatives where you explicitly move data to GPU using
.cuda()orcp.asarray(). NumPy’s design predates widespread GPU computing.- What happens when PyTorch runs without a GPU does it automatically fall back to the CPU?
Yes, PyTorch tensors default to CPU where operations run on CPU unless you explicitly move them to GPU with
.to('cuda')or.cuda(). If you try to use GPU operations without available GPU hardware, PyTorch raises an error rather than silently falling back. Handle device placement explicitly:device = torch.device('cuda' if torch.cuda.is_available() else 'cpu').- How does PyTorch on CPU compare to NumPy in performance?
PyTorch on CPU performs similarly to NumPy for basic operations since both use same underlying BLAS libraries (MKL, OpenBLAS, or Accelerate). PyTorch adds overhead for autograd tracking and tensor metadata but for large matrix operations the BLAS calls dominate. Performance differences typically within 10-20%. NumPy slightly faster for simple operations without gradients. PyTorch excels when you need automatic differentiation or plan to move computations to GPU later.
- Do GPUs make FFTs (Fast Fourier Transforms) faster than CPUs?
Yes, GPUs can be 10-50x faster for large FFTs because the algorithm parallelizes well across thousands of threads where butterfly operations at each stage process independent data elements simultaneously. For small FFTs (under 1024 points) CPUs may be faster due to data transfer overhead and better memory hierarchy. The crossover point depends on FFT size where 2D FFTs on 1024x1024 or larger images typically benefit significantly from GPU acceleration.
- Are GPUs faster when running batches of FFTs instead of just one?
Yes, batched FFTs achieve much better GPU utilization where processing 100 images simultaneously keeps thousands of GPU threads busy compared to a single FFT that may underutilize available parallelism. Libraries like cuFFT offer batched operations that amortize memory transfer costs and kernel launch overhead across multiple transforms. For example computing FFTs on 100 images of size 512x512 typically takes only 2-3x longer than one image rather than 100x longer showing massive parallel efficiency.
- Does PyTorch automatically parallelize matrix operations on the GPU?
Yes, PyTorch automatically parallelizes all GPU operations. A single
torch.matmul(A, B)call launches thousands of threads that compute different output elements simultaneously. Parallelization happens transparently using optimized cuBLAS kernels for linear algebra and custom CUDA kernels for element wise operations. You write sequential looking code while PyTorch distributes work across GPU cores automatically.- How does PyTorch actually manage GPU memory?
PyTorch uses caching memory allocator that requests large blocks from CUDA then suballocates from these blocks to avoid repeated expensive CUDA malloc calls. Allocated memory stays cached even after tensors deleted. Check usage with
torch.cuda.memory_allocated()and force release withtorch.cuda.empty_cache()(rarely helps since PyTorch reuses cached memory efficiently). Allocator reduces fragmentation by maintaining pool of reusable memory blocks.- Do PyTorch developers write their own GPU kernels?
PyTorch core developers write custom CUDA kernels for operations not covered by cuBLAS or cuDNN implementing element wise operations, reductions, indexing, and specialized functions. See GPU optimization for details on writing custom PyTorch extensions and when to use them.
Hardware understanding
- Would studying hardware, like Jensen Huang did, be a smart move for me?
Understanding hardware fundamentals gives intuition for performance bottlenecks and helps write faster code. Knowing memory hierarchies explains why small matrices run slowly on GPUs or why tiling improves cache utilization. You don’t need to design chips but knowing how caches work, what SIMD means, and why memory bandwidth limits throughput makes you better computational scientist. Knowledge pays dividends as algorithms change but hardware principles remain constant.
- Which Stanford EE courses cover hardware and accelerated computing?
EE180 covers digital systems and processor design teaching how CPUs work at the hardware level. CS149 Parallel Computing teaches GPU programming, CUDA, and performance optimization for parallel architectures. EE282 Computer Systems Architecture explores modern processor design including caches, pipelining, and memory systems giving deep understanding of why hardware behaves as it does.
- If AI can write code, do I still need to understand hardware deeply?
Yes, because AI generates syntactically correct code but cannot optimize for specific hardware constraints or debug performance problems. Understanding cache behavior or memory bandwidth becomes critical. Knowing hardware lets you ask better questions, verify AI suggestions, and recognize when generated code will perform poorly. Combination of AI productivity with human hardware intuition creates better results. AI handles boilerplate while you optimize critical paths.
- Why do CPUs need caches if they already have fast RAM?
RAM takes 100-200 nanoseconds to access while CPUs execute instructions in 0.3 nanoseconds meaning each memory fetch wastes 300-600 CPU cycles where processor sits idle. Caches provide memory access in 1-10 nanoseconds using expensive SRAM technology placed directly on CPU chip reducing latency by 10-100x. Modern processors spend more transistors on cache than compute cores because memory latency, not computation speed, limits most program performance.
- Why are there multiple cache levels (L1, L2, L3) instead of just one big one?
Larger caches are physically farther from CPU core taking more nanoseconds for signals to travel. A 32 KB L1 cache responds in 1 ns but 32 MB cache would take 10+ ns due to wire length. Each level trades size for speed: L1 (32-64 KB, 1 ns), L2 (256-512 KB, 3-5 ns), L3 (8-32 MB, 10-20 ns).
- What advantage does cache hierarchy provide?
Hierarchy gives you latency of small caches with hit rate of large caches where 95% of accesses hit L1/L2 providing fast access while L3 catches remaining 5%.
- Would a CPU be faster if it were entirely made of L1 cache?
No, because making L1 cache larger increases access latency due to longer signal paths where 1 MB L1 cache would be slower than today’s 32 KB version. SRAM used for cache requires 6 transistors per bit making it expensive and power hungry. Filling chip with cache leaves no room for compute cores that actually execute instructions. Current design balances cache size, speed, and compute capability.
NumPy and PyTorch
- Will PyTorch eventually replace NumPy?
Unlikely. NumPy remains standard for CPU based scientific computing with decades of ecosystem integration where libraries like SciPy, pandas, and scikit-learn depend on NumPy arrays. PyTorch excels at GPU accelerated machine learning and automatic differentiation but adds overhead unnecessary for simple CPU computations. Most projects use both where NumPy handles data preprocessing and analysis while PyTorch trains neural networks. The two interoperate seamlessly through
torch.from_numpy()and.numpy().- How can NumPy and PyTorch complement each other in practice?
Use NumPy for data loading, preprocessing, and statistical analysis on CPU then convert to PyTorch tensors for GPU accelerated training or inference. PyTorch handles compute intensive neural network operations while NumPy works with broader scientific Python ecosystem for visualization (matplotlib), dataframes (pandas), and traditional algorithms (scipy). Example: load data with NumPy, augment images with OpenCV (NumPy based), train with PyTorch on GPU, then use NumPy for result analysis and plotting.
- Can PyTorch handle eigenvalue problems and optimization tasks like NumPy?
Yes, PyTorch provides
torch.linalg.eig()for eigenvalues and many optimization routines but NumPy/SciPy offer more comprehensive linear algebra and numerical methods. PyTorch focuses on operations useful for deep learning where automatic differentiation and GPU acceleration matter more than exhaustive numerical algorithms. For specialized tasks like sparse matrices, signal processing, or statistical distributions, SciPy remains more complete. PyTorch excels at iterative gradient based optimization on large datasets.
Simulations
- Are GPUs useful for DFT, molecular dynamics, and fluid dynamics simulations?
Yes, GPUs can provide 10-50x speedups for molecular dynamics (GROMACS, LAMMPS) and fluid dynamics (lattice Boltzmann, CFD solvers) because force calculations and cell updates parallelize naturally. DFT benefits less since many quantum chemistry algorithms have sequential dependencies though GPU accelerated codes like VASP-GPU and CP2K often show 2-5x improvements for specific calculations. Speedup depends on system size where larger simulations (10,000+ atoms, fine meshes) benefit more.
- How do GPUs actually speed up molecular dynamics calculations?
GPUs assign each atom’s force calculation to separate threads where thousands of atoms compute their forces from neighbors simultaneously rather than sequentially. GPU evaluates pairwise interactions in parallel where thread 0 computes forces on atom 0 while thread 1 handles atom 1 all happening at once. After force computation completes, another parallel pass updates all atom positions and velocities simultaneously where operations like
new_position = old_position + velocity * timesteprun independently for each atom.- After each step in molecular dynamics, does the GPU update atom positions immediately?
Yes, after computing forces on all atoms, GPU runs kernel that updates all positions simultaneously in single parallel operation. Each thread reads its atom’s force, velocity, and current position then writes new position back to memory where all atoms update together rather than one at a time. This happens entirely on GPU memory without transferring data to CPU until you need to save snapshots or visualize results.
- If CPUs are so fast, why are they still slower than GPUs for large simulations?
CPUs have 8-16 cores while GPUs have thousands where a CPU with 16 cores processes 16 atoms simultaneously but a GPU with 10,000 cores processes 10,000 atoms simultaneously. Even though each CPU core runs faster (3-5 GHz vs 1-2 GHz GPU), the sheer parallelism overwhelms the clock speed advantage. For 100,000 atom simulations, the CPU needs 6,250 iterations (100,000/16) while the GPU needs only 10 iterations (100,000/10,000) potentially resulting in 10-50x speedups despite slower individual operations.
- Do small sequential delays on CPUs really add up to big slowdowns?
Yes, processing 100,000 atoms sequentially where each takes 1 microsecond totals 100 milliseconds per timestep, but processing them in parallel on 10,000 GPU threads takes only 10 microseconds per batch for potential 10x speedup. Memory latency compounds the problem where CPUs wait 100-200 ns per memory access but GPUs hide this latency by switching to other threads while waiting. Over millions of timesteps these microsecond differences accumulate to hours or days where a simulation taking 10 days on CPU might finish in 5 hours on GPU.
- In ptychography, do GPUs handle data in batches for faster processing compared to NumPy on CPUs?
Yes, GPUs process multiple diffraction patterns simultaneously where 16-64 patterns compute FFTs, apply constraints, and update in parallel compared to NumPy processing one pattern at a time. Each GPU thread handles different pixels across all patterns enabling batch processing of the entire reconstruction workflow. A GPU might process 32 patterns in the same time NumPy processes 1 pattern potentially giving 20-30x speedups for iterative ptychography algorithms requiring hundreds of iterations.
- Would people at OpenAI use Numba?
Unlikely, because production systems require maximum performance where custom CUDA kernels can outperform Numba’s JIT compiled code by 2-5x through hand optimized memory access and thread coordination. Numba works well for prototyping and research where fast iteration matters more than peak performance but production deployments typically use CUDA C++ for fine grained control. OpenAI uses custom CUDA kernels for critical operations like FlashAttention while PyTorch handles standard layers where the performance gain justifies development effort.
- What is BLAS?
BLAS (Basic Linear Algebra Subprograms) is a standardized API for vector and matrix operations like dot products, matrix multiplication, and linear solves where implementations like Intel MKL, OpenBLAS, and Apple Accelerate provide highly optimized code using assembly and SIMD instructions. NumPy uses BLAS under the hood where
A @ Bcalls the BLASdgemmfunction that has been optimized over decades with cache aware algorithms and parallel threading. Different BLAS libraries offer different performance where Intel MKL excels on Intel CPUs while Apple Accelerate optimizes for M series chips, but all implement the same standard interface letting NumPy work with any BLAS backend.
Implementation for 4D-STEM
This section covers practical GPU implementation for 4D-STEM data analysis. Begin with Getting started for software setup and basic operations. Review Performance characteristics to understand when GPUs provide benefits. Explore Common 4D-STEM operations for specific techniques like virtual imaging and strain mapping. Understand Bottlenecks and challenges for data transfer and real-time issues. Scale up with Multi-GPU processing for large facilities. Optimize using Memory management and Optimization workflow sections. Handle Failure modes and recovery for production reliability.
Getting started
- What software supports GPUs for 4D-STEM?
py4DSTEM provides Python library with CuPy backend supporting virtual imaging, center of mass, strain mapping, and ptychography on single GPU with Jupyter integration. LiberTEM offers distributed framework with optional GPU per node handling live microscope streaming and TB-scale datasets.
- What other GPU-enabled tools exist?
HyperSpy provides partial GPU support via CuPy with lazy loading. pyxem enables crystallographic orientation mapping with GPU template matching. Prismatic supports multi-GPU STEM simulation. abTEM handles multislice simulation.
- What do national labs use?
National labs use SHARP for ptychography, riCOM for real-time DPC, and PtychoShelves for high-throughput processing.
- How do I enable GPU in Python?
Check availability first to prevent errors:
import cupy as cp try: test = cp.array([1, 2, 3]) print(f"GPU available: {cp.cuda.runtime.getDeviceCount()} devices") print(f"GPU name: {cp.cuda.Device(0).name.decode()}") print(f"GPU memory: {cp.cuda.Device(0).mem_info[1] / 1e9:.1f} GB") except Exception as e: print(f"GPU not available: {e}")
- How do I convert NumPy to CuPy?
Replace
import numpy as npwithimport cupy as cpwherenp.arraybecomescp.arrayand most operations have equivalents likenp.sumbecomingcp.sumandnp.fft.fft2becomingcp.fft.fft2.- How do I transfer data to GPU?
Use
gpu_array = cp.asarray(cpu_array)for moving data to GPU andcpu_array = gpu_array.get()to retrieve results. Core algorithm stays identical with most changes being initial transfer and final retrieval steps.- What file formats work best with GPU processing?
HDF5 is most common for 4D-STEM. Chunked storage enables partial loading where you load 64 by 64 scan tile without full dataset. Compression using gzip or lz4 reduces disk space 2 to 5 times.
h5pywithchunks=(1,1,128,128)optimizes for pattern access.- What are alternatives to HDF5?
Zarr is cloud-optimized working better for distributed systems. Compression often performs better with blosc providing 3 to 10 times reduction for sparse data. NumPy .npy format works fast for small datasets under 10 GB but no compression.
- What about raw binary formats?
Raw binary offers fastest load but no metadata. Memory-map using
np.memmap('data.dat', shape=(256,256,128,128))for efficient access.- What format is recommended?
HDF5 for general use provides tooling, compression, and metadata. Use Zarr for cloud or distributed systems. Use raw binary for maximum speed when metadata stored elsewhere. Avoid many small files where overhead dominates.
Performance characteristics
- When is GPU faster than CPU?
GPU faster when applying same operation to different data like center of mass on 65,536 patterns, running parallel operations like 100 virtual image detectors simultaneously, or doing repetitive calculations like FFT on each pattern.
CPU better for sequential tasks including file parsing and metadata reading, varied operations like building pipelines, or small datasets where GPU overhead dominates.
Overhead shows PCIe transfer plus kernel launch at approximately 0.5 ms. Worth it at hundreds to thousands of operations.
What speedups do GPUs provide?
Operation Type |
Typical Speedup |
Notes |
|---|---|---|
Element-wise (center of mass, radial integration) |
50-100× |
Memory-bound, highly parallel |
Virtual aperture imaging |
100-500× |
100+ detectors simultaneously |
FFT (cuFFT vs FFTW) |
20-50× |
Depends on size, batching |
Template matching |
30-80× |
GPU orientation mapping [1] |
Strain mapping |
10-30× |
Sequential dependencies limit speedup |
Deep learning (inference) |
50-200× |
With tensor cores, batching |
Deep learning (training) |
100-1000× |
Batch size, tensor cores critical |
Real-time reconstruction (riCOM) |
Acquisition-speed |
Impossible on CPU [2] |
Factors affecting speedup: Bandwidth utilization, algorithm parallelizability, transfer overhead, kernel optimization.
- What are representative GPU speedup ranges for 4D-STEM operations?
Element-wise operations including center of mass and radial integration achieve 50-100× speedup. Virtual imaging reaches 100-500× with over 100 detectors simultaneously. FFT runs 20-50× faster comparing cuFFT vs FFTW. Template matching for GPU orientation mapping runs significantly faster. Strain mapping achieves 10-30× speedup.
- What about deep learning speedups?
Deep learning shows inference 50-200× and training 100-1000× with tensor cores plus batching. Real-time systems like riCOM achieve acquisition-speed reconstruction, impossible on CPU.
- When does GPU overhead matter?
Hypothetical example showing overhead compares single 128×128 pattern where CPU takes 0.1 ms while GPU takes 0.01 ms compute plus 0.5 ms transfer equaling 0.51 ms making CPU faster. Full 256×256 dataset shows CPU at 6.5 seconds while GPU uses 0.65 seconds plus 0.5 seconds transfer equaling 1.15 seconds making GPU 5.6 times faster.
Principle shows overhead amortizes over large datasets. Best practice keeps data on GPU between analyses to eliminate repeated transfers.
- How long does GPU analysis take?
Representative timing ranges will vary by implementation and hardware. For 256×256 scan with 128×128 detector at 17 GB on RTX 4090, center of mass takes 1 to 2 seconds, virtual imaging with 100 detectors takes 5 to 10 seconds, strain mapping takes 10 to 20 seconds, and phase reconstruction takes 2 to 5 minutes.
Same operations on CPU take 3 to 5 minutes, 10 to 30 minutes, 30 to 60 minutes, and hours respectively.
Large datasets with 512×512 scan and 256×256 detector at 256 GB on A100 take 5 to 10 times longer but still minutes to tens of minutes versus hours to days on CPU.
Actual performance depends on algorithm optimization, memory access patterns, and whether operations are memory-bound or compute-bound.
Common 4D-STEM operations
- Why does virtual aperture imaging take so long on CPUs?
The sequential bottleneck shows virtual aperture imaging requires computing 10 to 100 different detector geometries including bright-field center, multiple dark-field angles, and annular rings across entire 4D dataset where each detector means looping through all 65,536 patterns from 256 by 256 scan.
- What is the computational load for virtual aperture imaging?
Computational load shows 100 detectors times 65,536 patterns times 128 by 128 pixels equals 106 billion mask and sum operations where CPU with 16 cores processes 65K patterns divided by 16 equals 4,096 patterns per core sequentially at 1 ms per pattern equaling 4 seconds per detector times 100 detectors equaling 400 seconds total.
- Why is memory access a problem for CPU virtual imaging?
Memory access pattern requires reading same 17 GB dataset 100 times, once per detector, from RAM where total data movement reaches 1.7 TB at 76 GB per second RAM bandwidth equaling 22 seconds just for memory reads. Nested loops problem shows outer loop for detectors contains inner loop for patterns contains pixel loop where you cannot vectorize across all dimensions simultaneously on CPU.
- What is virtual aperture imaging and how is it parallelized?
Virtual aperture imaging sums diffraction intensity within specific detector regions to form images where bright-field or BF uses center disk, dark-field or DF uses off-axis regions, and annular dark-field or ADF uses ring around center. GPU parallelization assigns each scan position as independent thread where thread loads one diffraction pattern, applies mask with 1 inside aperture and 0 outside, sums pixels, and writes result to output image, launching 100 times scan_positions threads for 100 virtual detectors where each thread computes all 100 detectors for one position. Memory pattern shows coalesced reads where consecutive threads load consecutive patterns and scattered writes where each detector image updates independently, achieving typical performance 100 to 500 times faster than CPU because 65K patterns times 100 detectors equals 6.5 million independent operations creating ideal GPU workload with code showing
output[detector][scan_y][scan_x] = sum(pattern * mask[detector])in parallel across all scan positions.- Why is Bragg disk detection challenging on CPUs?
Variable computation per pattern shows crystalline samples produce 5 to 500 Bragg peaks per diffraction pattern depending on zone axis, sample thickness, and beam convergence where you cannot predict count ahead and need dynamic processing.
- What is the sequential bottleneck for Bragg disk detection?
Sequential bottleneck shows CPU processes one pattern at a time through six steps including first, load 256 by 256 pattern equaling 65K pixels, second, background subtraction with 65K operations, third, Gaussian blur with 5 times convolution totaling 325K operations, fourth, template matching scanning 256 positions times 256 positions equaling 65K correlations each with 100 pixel circular template totaling 6.5 million operations.
- How many operations does Bragg disk detection require?
Fifth, non maximum suppression finding local maxima with 65K comparisons, and sixth, subpixel refinement fitting 2D Gaussian per peak with 100 to 500 peaks times 50 iterations each equaling 25,000 operations per pattern where total per pattern reaches approximately 7 million operations. For 65,536 patterns, this equals 460 billion operations where single CPU core at 3 GHz takes approximately 460 seconds equaling 7.7 minutes for peak detection alone before strain analysis.
- How do I compute Bragg disk detection on GPU?
Bragg disk detection finds crystalline reflection peaks in diffraction patterns where GPU approach involves three phases including preprocessing handling each pattern in parallel with background subtraction and Gaussian blur operations running simultaneously across all patterns, peak finding convolving with circular template applying threshold and finding local maxima using shared memory for efficient coordinate computation, and subpixel refinement fitting 2D Gaussian per peak with each peak processed independently by its own thread. Challenge arises from variable peaks per pattern ranging from 5 to 500 depending on crystallography where solution uses atomic operations for global peak list or fixed-size buffers to handle this dynamic output. py4DSTEM library uses GPU via CuPy for these operations typically achieving 20 to 100 times CPU speedup where process is memory-bound due to loading 256×256 pattern rather than compute-bound.
- What is the workflow for strain mapping on GPU?
Strain mapping involves three main steps where first, detect Bragg peaks using template matching and Gaussian fitting, second, index peaks to crystal lattice by matching positions to reciprocal lattice vectors, and third, compute strain tensor by comparing measured peak positions to unstrained reference positions.
- What GPU speedups does each strain mapping step achieve?
Peak detection gains 20 to 100 times speedup, indexing gains 50 times speedup from parallel matching operations, and strain tensor computation gains 100 times speedup from parallel linear algebra. For 256×256 scan with 65K patterns, these speedups enable efficient processing of large datasets.
- How much speedup does GPU provide for strain mapping?
GPU acceleration yields 20 to 100 times for peak detection by parallelizing template matching, 10 to 30 times for template matching through efficient correlations, and 50 to 100 times for element-wise math from massive parallel arithmetic.
- Why is end-to-end strain mapping speedup lower than individual steps?
Overall pipeline expects 10 to 30 times end-to-end speedup due to sequential dependencies where indexing must wait for detection and strain computation must wait for indexing.
- What optimization strategies improve strain mapping performance?
Batch processing multiple patterns together and keeping intermediate results on GPU avoids transfer overhead. py4DSTEM and LiberTEM support these workflows with built-in GPU acceleration.
Bottlenecks and challenges
- What detector speeds and data volumes are typical?
Modern detector speeds range from 400 Hz for older DE-64 detectors to 12.5 kHz for binary counting detectors. A 256×256 scan acquires 65,536 diffraction patterns, taking 65.5 seconds at 1 kHz or just 5.2 seconds at 12.5 kHz. Storage requirements grow accordingly. A single scan produces 17 GB of raw data, expanding to 2 to 3 TB after preprocessing. Real-time reconstruction means processing must keep pace with acquisition, but traditional transfer paths cannot deliver data fast enough for GPU processing speeds.
- How can I have data directly on GPU VRAM from the detector?
In a conventional setup, the detector sends its images to the microscope’s control computer. The CPU copies them into system memory, the operating system packages them into network packets, and the network interface card (NIC) sends them over the physical network link. On the receiving side, the workstation’s NIC hands the packets to its CPU, which copies the data into host RAM before the GPU can access them. Each copy and CPU interrupt slows the stream.
With RDMA, that middle layer disappears. Both the detector and the GPU workstation have special network cards called RNICs that can move data directly between their memories. The detector’s RNIC streams the data over the network link, and the RNIC on the workstation writes it straight into the GPU’s memory. The CPUs on both sides stay idle while the transfer happens entirely in hardware, creating a continuous, low-latency path from detector to GPU VRAM.
- What hardware does GPUDirect RDMA require?
CUDA-compatible network cards from Mellanox or NVIDIA, Linux RDMA drivers that map GPU memory into network address space, and detector firmware supporting RDMA writes. Most research groups already have 10 GbE or faster networks, so upgrading just the network card and enabling RDMA support costs under $2,000 per system. Real-time imaging systems like riCOM rely on this technology for live center of mass visualization during experiments.
- What network speeds support different detector rates?
Standard 10 GbE delivers 1.25 GB per second, supporting detectors up to about 10 kHz. National labs upgrade to 40 GbE at 5 GB per second, 100 GbE at 12.5 GB per second, or InfiniBand HDR at 25 GB per second for faster detectors. Network infrastructure costs $500 to $5,000 per endpoint depending on speed tier, requiring compatible switches, cables, and network cards on both detector and computer sides.
- What is on-detector processing and why is it useful?
Embedding GPU hardware directly in detector electronics enables processing before network transmission. Rather than streaming raw 256×256 patterns, the detector computes derived quantities like center of mass coordinates, Bragg peak positions, or deep learning features. A full diffraction pattern contains 256 KB of data. The center of mass reduces to just 8 bytes, creating 32,000 times compression. Even computing multiple features achieves 100 to 1,000 times reduction.
- How does on-detector processing shift the bottleneck?
From network bandwidth to detector computing power. A 256×256 detector at 10 kHz generates 6.5 GB per second of raw data, exceeding most university lab network capabilities. Extracting features on the detector electronics reduces this to 65 MB per second for center of mass or 650 MB per second for multiple derived quantities, fitting comfortably within 10 GbE bandwidth limits.
- What advanced on-detector processing exists?
Deep learning inference on detector classifies crystal phases, detects defects, or predicts material properties in real time. Binary counting mode records just 1 bit per pixel rather than 16 bits, reducing data by 8 to 16 times with minimal information loss for many applications. Detector electronics using NVIDIA Jetson or small datacenter GPUs cost $5,000 to $15,000 plus engineering effort for firmware integration, but enable experiments impossible with raw pattern streaming alone.
- What infrastructure challenges remain?
University labs typically have 10 GbE networks that saturate at 10 kHz detector speeds with raw pattern streaming. Next generation detectors operating above 10 kHz require on-detector compression as an essential component since network bandwidth cannot keep pace. Multiple detectors operating simultaneously or large-scale facilities with dozens of microscopes demand 100 GbE or InfiniBand infrastructure costing $10,000 to $50,000 per microscope station.
- What storage challenges exist?
Local NVMe SSDs provide 3 to 7 GB per second write speeds sufficient for single detectors but fill rapidly. A 256×256 scan produces 17 GB of raw data. Collecting 100 scans per day creates 1.7 TB daily, requiring 620 TB annually per microscope. Facilities with multiple microscopes need petabyte-scale storage with parallel filesystems like Lustre or GPFS, centralized data management, and network bandwidth to sustain simultaneous acquisition from dozens of detectors.
- What is the best upgrade path for research groups?
The fundamental tradeoff remains between real-time capability and infrastructure investment. GPUDirect RDMA for under $2,000 provides the largest performance improvement using existing networks. On-detector processing at $5,000 to $15,000 enables experiments beyond network bandwidth limits. Full infrastructure upgrades with 100 GbE networking and petabyte storage cost $50,000 to $500,000 per facility but support cutting-edge detector speeds and multi-user environments. Most research groups start with GPUDirect RDMA, add on-detector features as detector speeds increase, and upgrade infrastructure when scaling to facility-level operations.
Benchmarking and performance monitoring
- How do I benchmark my GPU performance for 4D-STEM?
Measure FFT throughput with 1000× 256×256 FFT targeting 50,000 to 100,000 patterns per second on RTX 4090. Test memory bandwidth with large array copy expecting 80 to 90 percent of 1000 GB per second. Run virtual imaging with 256×256 scan and 100 detectors in 5 to 10s.
- What are additional benchmarks?
Center of mass on 65,536 patterns should take under 2s. One ePIE ptychography iteration on 256×256 scan should target 2 to 5s.
- How should I time GPU operations?
Use
time.time()withcp.cuda.Stream.null.synchronize()to ensure GPU finishes. Compare to literature values. Over 2 times slower indicates thermal throttling, PCIe bottleneck, memory fragmentation, or driver issue.See also: How do I truly maximize GPU power for ptychography with 4D-STEM data? for performance optimization, gpu-specs-ptychography for expected performance by hardware.
- What metrics should I track to evaluate GPU efficiency?
Track utilization with
nvidia-smi dmontargeting 90 to 100 percent during processing. Under 50 percent indicates bottleneck elsewhere like CPU, disk I/O, or small batches. Memory bandwidth from Nsight Compute should exceed 80 percent of theoretical.- What other efficiency metrics matter?
Occupancy showing active warps divided by max warps should exceed 50 percent. Under 25 percent indicates too few threads or register pressure. Speedup comparing GPU time to CPU time should match expectations where FFT gains 20 to 50 times, virtual imaging 100 to 500 times, deep learning 100 to 1,000 times.
- How should I monitor long runs?
Plot metrics during long runs. Sudden drops indicate thermal throttling, memory issues, or system contention. Goal is sustained high utilization with speedup matching literature values for your hardware.
- How do I test for real-time reliability?
Use synthetic data at real acquisition rates. For 10 kHz 256×256 patterns, generate data at 1.28 GB/s and process in a loop. Run for 1+ hour stress test to catch intermittent failures. Monitor for dropped frames, buffer overflows, latency spikes. Synthetic testing is safer than actual detector since you control rate and can repeat failures.
GPU for ptychography
How are the kernals launched in PyTorch for my initial single-slice ptyhocgraphy?
Launch kernel A to gather patches Launch kernel B to multiply by probe Launch cuFFT kernels for FFT Launch kernel C for amplitude constraint Launch cuFFT kernels for iFFT Launch kernel D to compute updates Launch kernel E to scatter-add updates Repeat that sequence for every batch, every iteration
Leads to a six to ten round-trips to VRAM for every pixel per iteration, 16 384 scan positions × 256² detector pixels. The math is relatively cheap compared to memory traffic.
It launch as 3-10 micrseconds of overhead. They add up when I have thousands of positions and several kernals. This is a memory-bandwith bound.
- How do you determine whether computation is “memory-bound” or “compute-bound”?
L2/DRAM banwidth becomes the bottleneck, then it becomes “memory bound”.
- How can a persistnet kernal help?
pulls a position index from a device-side loop/queue,
loads the object tile + probe into shared memory/registers,
does FFT → constraint → iFFT in-place (e.g., with cuFFTDx),
computes the updates,
overlap-adds to the global object,
loops to the next position (and possibly the next algorithm iteration) without returning to the CPU.
There is less DRAM traffic. Don’t split exit waves/FFT intermedaites to global memory between steps. They live in registers/shared memory.
- What does it mean by “spil exit waves between steps?”
Don’t write intermediate results out to slow global memory (DRAM)
- How is GPU memory hierachy used?
Registers (per thread) → Shared memory (per block) → L1/L2 cache → Global memory (VRAM). Here, Registers and Shared memory live on chip.
- VRAM and DRAM?
VRAM is a special name for DRAM (Dynamic Random Access Memory) loaded on GPU.
- How can we structure this?
Keep autodiff and ADAM, no optimizer math in CUDA kernal.
- What is cuFFTDx?
FFT inside CUDA kernal. Fusing FFT with other operations decreaASE latency.
Computational demands and fundamentals
- What makes ptychography so computationally demanding?
Ptychography reconstructs both object and probe by iteratively processing thousands of diffraction patterns where each iteration involves propagating the probe to the detector using FFT, applying amplitude constraints where measured and calculated amplitudes must match, propagating back using inverse FFT, and updating object and probe estimates based on differences. This cycle repeats many times with typical reconstructions requiring 50 to 200 iterations where total computation grows as scan positions times iterations times operations per position creating significant compute resources requirement.
- How many FFTs are needed for a typical ptychography experiment?
For a 512 by 512 scan with 256 by 256 detector, there are 262,144 patterns total where each pattern requires two FFTs per iteration (forward and inverse) totaling 524,288 FFTs per iteration. At 100 iterations (typical for good quality), that totals 52.4 million FFT operations where GPUs accelerate these calculations by 10 to 50 times compared to CPUs.
- How does batched FFT processing accelerate ptychography?
Batched FFT lets GPU process many diffraction patterns at once using cuFFT batch API, achieving much higher throughput than sequential processing. Processing 1,000 patterns in single batch typically runs 10 to 50 times faster than looping individually. Performance comparison: single FFT approach where launch overhead (approximately 5 microseconds) plus compute (approximately 50 microseconds) totals 55 microseconds per FFT, versus batched 1,000 FFTs where one launch (approximately 5 microseconds) plus parallel compute (approximately 500 microseconds) totals 505 microseconds equaling 0.5 microseconds per FFT (potentially 100 times better efficiency). Batching amortizes overhead across many operations, improves memory locality, and maximizes occupancy keeping all streaming multiprocessors busy [2].
- How does overlapping probe position optimization work?
By loading adjacent scan positions together, GPU can reuse memory and reduce redundant access. Adjacent positions in ptychography scan typically have 60 to 80 percent probe overlap meaning significant shared field of view. This spatial overlap means less data movement since nearby object regions stay in cache and higher cache efficiency from locality of reference where GPU threads processing adjacent positions access nearby memory addresses improving coalescing. This can speed up reconstruction by 20 to 40 percent especially for large scans with significant probe overlap where memory bandwidth becomes limiting factor.
- Why use mixed precision in ptychography?
Mixed precision stores data as float16 but computes in float32. This halves memory usage from 4 bytes to 2 bytes allowing larger batches to fit in VRAM and faster processing from doubled memory bandwidth. Reconstruction quality is unaffected because critical computations use float32 precision while storage and transfer use float16. Mixed precision is standard in modern GPU workflows for typical speed gains of 1.5 to 2 times and memory savings of 50 percent.
- Why is CPU-to-GPU data transfer so slow?
Memory paging problem: regular RAM (pageable memory) can be swapped to disk by OS at any time. During transfer, OS must: check page tables to determine if data still in RAM or swapped (approximately 100 nanoseconds per page), lock pages temporarily to prevent swap mid-transfer, copy to GPU via DMA, and unlock pages after transfer completes.
See also: What are the benefits of using pinned memory? for the solution, How do I minimize GPU↔CPU transfers in analysis pipeline? for minimizing transfers, disk-io-challenge for storage bottlenecks.
- What is the overhead cost of memory paging?
For 1 GB transfer with 4 KB pages: 262,144 pages times 100 nanoseconds equals 26 milliseconds just for page management overhead. Actual transfer takes 1 GB divided by 32 GB per second PCIe bandwidth equals 31 milliseconds. Total becomes 57 milliseconds where 45 percent is overhead.
- How does context switching create additional delays?
Context switching creates additional delays where if OS needs RAM during transfer, it pauses GPU copy, moves pages to disk, and resumes adding 10 to 100 milliseconds latency spikes. Historical problem from 2000s showed early CUDA systems achieved only 10 to 15 GB per second effective compared to 32 GB per second theoretical PCIe bandwidth due to paging overhead where iterative algorithms required 100 plus transfers and paging overhead dominated runtime.
- How does pinned memory improve streaming?
Pinned memory also called page locked memory enables faster disk to GPU transfers using DMA or direct memory access where operating system cannot swap pinned pages to disk guaranteeing they remain in RAM at fixed addresses.
- What are the benefits of using pinned memory?
This reduces transfer latency by eliminating page table lookups and lock unlock overhead where GPU can directly access pinned memory without CPU intervention and transfer speed improves from 15 GB per second with pageable memory to 25 to 30 GB per second with pinned memory approaching theoretical PCIe bandwidth. This becomes especially important for real time streaming from fast storage devices where consistent low latency matters more than peak bandwidth, allowing GPU to start processing data sooner and reducing pipeline bubbles.
See also: Why is CPU-to-GPU data transfer so slow? for why transfers are slow without pinning.
- How do multi-GPU setups scale ptychography?
Multi-GPU scaling distributes scan positions across multiple GPUs increasing throughput nearly linearly with GPU count where each GPU reconstructs subset of scan positions independently processing its assigned patterns in parallel. Careful handling of overlapping regions is needed to avoid artifacts at boundaries since scan positions near partition boundaries share object information requiring communication between GPUs to maintain consistency. Implementation uses MPI for inter-GPU communication where boundary regions exchange object updates each iteration, with two to four GPU setup typical providing 1.8 to 3.5 times speedup and eight GPU systems at national labs achieving 6 to 7 times speedup with careful optimization.
Reference provides implementation details and scaling benchmarks [4].
Performance optimization and bottlenecks
- How do I truly maximize GPU power for ptychography with 4D-STEM data?
The maximization challenge shows ptychography iterates through scan positions repeatedly with 50 to 200 iterations where each requires FFTs, probe updates, and object updates. Naive implementation barely uses 10 to 20 percent of GPU potential. You see 1,000 GB per second theoretical bandwidth but achieve only 100 GB per second.
See also: What if my 4D-STEM data doesn’t fit in GPU memory? for general strategies, How do I handle 4D-STEM data larger than GPU memory? for chunked processing approaches.
- What is the real bottleneck in GPU ptychography?
Memory bandwidth, not compute cores. GPUs sit idle waiting for data. Six key strategies maximize performance: keep all data on GPU throughout iterations saving 83% time by eliminating CPU-GPU transfers, batch 64 to 256 patterns together to saturate memory bandwidth, use async streams to overlap data loading with computation achieving 90% GPU utilization, tile overlapping scan positions to exploit spatial locality, use mixed precision with float32 compute and float64 accumulation for 2x speedup, and use multi-GPU data parallelism for datasets exceeding 80 GB. Most critical: avoid transferring data every iteration and batch operations.
- What is the theoretical maximum speedup achievable?
H100 achieves 740 milliseconds for 100 iterations on 17 GB dataset. RTX 4090 takes 2,200 milliseconds. 8 H100s reach 92 milliseconds with NVLink. CPU baseline takes 340 seconds. This gives 154x speedup for single RTX 4090 and 3,700x for 8 H100s.
- What speedup do real codes achieve?
Real-world codes achieve 60 to 80% of theoretical: 100 to 120x for single GPU and 2,000 to 3,000x for 8-GPU clusters.
- What percentage of GPU power do real ptychography codes achieve?
Memory bandwidth utilization is key metric. SHARP, Ptychopy, and ssc-cdi achieve 70 to 85% of peak HBM bandwidth (excellent). py4DSTEM with CuPy achieves 40 to 60% (good with easy interface). Poor research codes achieve only 10 to 30%. Use
nvidia-smi dmonto monitor: over 70% means good performance, under 50% needs better batching.- Why is compute utilization lower than memory utilization?
Compute utilization typically reaches 30 to 50% of peak FLOPS because ptychography is memory-bound, not compute-bound.
- Practical checklist to maximize GPU for your ptychography workflow
Load all data to GPU once, eliminating iteration transfers. Use batched cuFFT with 64 to 256 patterns per call. Enable async streams with 3-stage pipeline keeping GPU busy. Process spatially adjacent scans to exploit probe overlap. Use mixed precision for 2x speedup. Profile with Nsight Systems and monitor with
nvidia-smi dmontargeting over 70% memory utilization.- What total speedup is possible from optimization?
Consider multi-GPU only if dataset exceeds 80 GB. Strategies 1-3 can provide 10 to 50x gains, strategies 4-6 may add another 2 to 5x: potentially 20 to 250x total speedup from optimization alone.
- When should I use single-GPU vs multi-GPU for ptychography?
Single GPU (RTX 4090 or A100) works when dataset fits memory (256×256 scan = 17 GB under 24 GB), time is not critical, budget is limited ($1.6K vs $10K+), or for development and testing. Multi-GPU works when dataset exceeds single GPU memory (512×512 scan = 270 GB needs 4-8 GPUs), time is critical (batch processing 100+ datasets overnight), or real-time requirements exceed 2 kHz acquisition.
- What is multi-GPU scaling efficiency for ptychography?
Scaling efficiency: 2 GPUs give 1.8-1.9x speedup, 4 GPUs give 3.4-3.7x, 8 GPUs give 6-7x showing diminishing returns. Start single GPU and scale when bottlenecks identified.
See also: What are multi-GPU and mixed precision strategies?, multi-gpu-interconnect, gpu-specs-ptychography for hardware selection.
- What role do tensor cores play in ptychography?
Tensor cores found in RTX and A100 GPUs accelerate matrix operations in iterative update steps. They provide significant speedups for deep learning-style computations and are increasingly used in advanced ptychography algorithms including neural network based phase retrieval and learned reconstruction methods.
These specialized units perform mixed-precision matrix multiplies with accumulate operations at very high throughput. Reference provides implementation details [1].
Ptychography software packages
- What is SHARP and what does it do?
SHARP from 2016 is a distributed GPU solver using CUDA, optimized for synchrotron facilities. It implements difference map and maximum likelihood algorithms for ptychography providing robust phase retrieval.
Designed for large-scale experiments, it supports multi-GPU clusters and high-throughput data processing. Handles datasets from synchrotron beamlines where continuous data streams require efficient processing. Reference provides technical details [4].
- What is Ptychopy and how is it used?
Ptychopy (2022) is a GPU framework written in CUDA-C, supporting ePIE, DM, and LSQML algorithms for real-time, terabyte-scale data processing. It is designed for speed and efficiency in modern ptychography experiments. Reference: [3].
- What is ssc-cdi and when is it needed?
ssc-cdi (2024) is a memory-efficient multi-GPU package for handling extreme datasets (>1 TB). It features automatic GPU memory management and is suitable for large-scale, high-resolution ptychography. Reference: [1].
- What are other notable GPU ptychography packages?
PtychoShelves (2020) is a high-level framework for ptychography. py4DSTEM is a Python library with GPU-accelerated phase retrieval. Most packages support both single-GPU workstations and multi-GPU clusters, making them flexible for different research environments.
- Do ptychography packages use libraries or custom kernels?
Most ptychography packages use a hybrid approach: cuFFT for forward/backward propagation (NVIDIA-optimized for 10+ years), cuBLAS for matrix operations, and Thrust for parallel primitives. These libraries handle memory-bandwidth-optimized FFTs and matrix math, providing high performance with minimal custom code.
- When do developers write custom kernels for ptychography?
Custom kernels are written for update steps (ePIE, DM equations), probe overlap handling, constraint application (positivity, support), and memory management. These operations require specialized logic not covered by standard libraries. Typically, 70% of compute is handled by cuFFT, 20% by custom update kernels, and 10% by data movement. Developers may write ~500 lines of CUDA for these custom parts, leveraging millions of lines in cuFFT.
- What does a ptychography update kernel look like?
Typical update kernel uses atomic operations for thread-safe array updates. Each thread processes a scan position, loads probe and exit wave, applies update step. Example from Ptychopy’s
ob_update_ML.cu:template <class T> __device__ inline void atomicAdd(complex<T>* x, const complex<T>& y) { auto xf = reinterpret_cast<T*>(x); atomicAdd(xf, y.real()); atomicAdd(xf + 1, y.imag()); } // ...kernel launch and address mapping omitted for brevity... for (int c = tx; c < C; c += blockDim.x) { complex<float> obj_val = obj[b * I + c]; complex<float> exit_val = exit_wave[b * C + c]; // ML update: obj += fac * conj(probe) * exit_wave complex<float> update = conj(obj_val) * exit_val * fac; // Atomic add to handle overlapping probes atomicAdd(&obj[b * I + c], update); } } }
- What makes ptychography kernels perform well on GPUs?
Atomic operations handle overlapping scan positions where multiple probes updating same object pixel need thread-safe accumulation. Address indirection through
addrarray maps scan positions to object, probe, and exit wave locations. Strided loops likeb += blockDim.yensure coalesced memory access where neighboring threads access consecutive addresses for 10 times bandwidth improvement.- How is time distributed in ptychography GPU kernels?
Kernels execute after FFT back-propagates exit wave from detector to sample plane: 70% of time spent in FFT, 20% in kernels, 10% in data transfer. Memory-bound nature means loading 3 complex arrays and writing 1. GPU achieves 500 to 1,000 GB per second versus CPU’s 80 GB per second. Ptychopy has similar kernels for probe update (
pr_update_ML.cu), difference map (ob_update.cu), and WASP algorithm (ob_update_wasp.cu).- How does real-time GPU ptychography work?
Streaming architecture: detector acquires at 1,000-10,000 Hz, streams over 10 GbE (1.25 GB/s) to GPU server, reconstructs on-the-fly with <1s latency [5]. At 1 kHz with 256×256 detector = 256 MB/s (within 10 GbE capacity). Colin Ophus’s riCOM achieves real-time center-of-mass at acquisition speed [6].
- What enables higher acquisition rates?
Binary 4D-STEM enables 12.5 kHz by recording 0/1 counts (8-32× smaller data) [7]. Deep learning at edge compresses before transmission [5]. Critical: asynchronous pipeline—acquire next pattern while processing current.
- What is the complete 4D-STEM data pipeline from detector to storage?
Modern 4D-STEM systems handle massive data rates through careful pipeline design. Detectors acquire at 100,000 frames per second generating 3.5 GB/s of raw data. This connects either directly to GPU memory through GPUDirect RDMA bypassing CPU entirely, or traditionally through system RAM with PCIe bottleneck overhead. GPU processing applies parallel kernels for reconstruction, analysis, and compression. Results flow to local NVMe storage at 7 GB/s, then asynchronously transfer to network storage.
- What advantage does GPUDirect RDMA provide?
GPUDirect RDMA eliminates traditional bottleneck of copying through system memory, enabling sustained high-speed acquisition without data loss.
- Why use NVIDIA Holoscan instead of direct GPU programming?
Holoscan = streaming pipeline framework handling detector, network, GPU, and display plumbing. Provides: (1) Zero-copy GPUDirect RDMA (critical at 10 GB/s), (2) pipeline scheduling (overlap acquisition/transfer/compute/visualization), (3) hardware abstraction, (4) pre-optimized operators (FFT, filtering, display). Direct CUDA = computation only. Holoscan = complete microscope-to-visualization system. For facilities processing TB/hour, saves months of infrastructure development.
- What are typical ptychography GPU speedups?
Single GPU (RTX 3090) versus 16-core CPU achieves 20 to 50 times speedup for forward and backward propagation. Multi-GPU systems show near-linear scaling for 4 to 8 GPUs with NVLink [2]. Early implementations (2010s) achieved order-of-magnitude speedups while modern implementations (2020s) add 2 to 5 times additional gains through tensor cores and optimized FFT libraries [2].
- What about real-time streaming latency?
Real-time streaming achieves less than 1 second latency from acquisition to reconstruction [5]. Performance depends on scan size, detector size, algorithm, and convergence criteria.
Ptychography algorithms and implementation
- What ptychography algorithms are GPU-friendly?
Different ptychography algorithms have varying GPU compatibility. ePIE uses sequential updates with moderate GPU utilization and converges reliably. DM uses parallel updates across all positions with excellent GPU efficiency, running faster than ePIE but less robust to noise. LSQML provides highest quality and is most intensive, benefiting most from GPU acceleration. Gradient-based methods like Adam use automatic differentiation where tensor cores accelerate backpropagation. The choice depends on your needs where DM works for speed with clean data, ePIE handles noisy data well, and LSQML delivers quality when critical. GPU batch processing favors parallel updates like DM and gradient descent over sequential approaches like ePIE.
- How does GPU parallelize ePIE across scan positions?
The extended Ptychographic Iterative Engine traditionally updates scan positions sequentially moving through position 1, then 2, then 3, which limits GPU utilization. GPU parallelization uses several strategies to improve performance. Batch updates group non-overlapping positions into independent batches, processing batch 1 with positions having no probe overlap in parallel, then batch 2, then batch 3. A typical 512 by 512 scan with 50% overlap requires 4 to 8 batches where each batch processes over 1,000 positions simultaneously on GPU, often running 50 to 200 times faster than CPU sequential processing. Position ordering reorders the scan to maximize batch size, converting spiral scan into checkerboard pattern where every other position can update together safely. Momentum methods replace sequential updates with parallel gradient accumulation, averaging updates from all positions and applying simultaneously, offering less stability but full parallelization. The trade-offs show pure sequential ePIE remains robust but slow with GPU idle 90% of time, batched ePIE typically runs 10 to 50 times faster with slight convergence slowdown of 5 to 10% more iterations, and fully parallel approaches may run 50 to 100 times faster but may diverge on noisy data. Using batched ePIE with 4 to 8 batches provides good balance, with reference implementations in Ptychopy and py4DSTEM.
- How does DM algorithm achieve better GPU parallelization than ePIE?
Difference Map updates all scan positions simultaneously, making it inherently parallel. DM is GPU-friendly because all positions get processed in one kernel launch with no sequential dependency. Every position performs identical operations including FFT, constraint, and inverse FFT, creating uniform workload. The algorithm loads all exit waves together, processes them, and writes back with coalesced memory access. High occupancy means 10,000 plus threads run active simultaneously compared to ePIE’s 100 to 1,000 in batched mode. Performance metrics show DM achieves 80 to 95% GPU utilization versus ePIE’s 30 to 60%. On 512 by 512 scan, DM iteration takes 2 seconds, batched ePIE takes 5 seconds, and sequential ePIE takes 300 seconds. DM isn’t always preferred because it proves less robust to noise, illumination variation, and incomplete data where ePIE converges successfully. A hybrid approach starts with DM for fast initial convergence then switches to ePIE for refinement, with GPU efficiently handling both by dispatching different kernels based on algorithm stage.
- When should I use GPU for ptychography?
Use GPU when: iterative reconstruction time exceeds acceptable wait (>30 min on CPU), real-time or near-real-time feedback needed during acquisition [4], large datasets (>100K diffraction patterns), exploring parameter space (testing multiple initialization, regularization values), batch processing many datasets. CPU sufficient for: small scans (<64×64), quick preview reconstructions with relaxed convergence, educational demonstrations, algorithm prototyping where development speed matters more than execution speed.
- What are convergence criteria for ptychography and how do I monitor them on GPU?
Ptychography uses four main convergence metrics to track reconstruction quality where R-factor or residual error calculates
sqrt(sum((I_measured - I_calculated)²) / sum(I_measured²))and should decrease monotonically each iteration with typical targets below 0.1 for good data and below 0.05 for excellent data where stagnation indicates convergence or algorithmic failure, object update norm measuressum(|object_new - object_old|²)which should decrease and stabilize where large oscillations indicate divergence and step size being too large, probe power stability monitorssum(|probe|²)which should remain constant within 1 to 5% where large drift indicates beam instability or probe constraints failing, and phase gradient smoothness tracks local phase variance where noisy phase indicates insufficient iterations or regularization. GPU implementation computes metrics in parallel where R-factor reduces 65,000 patterns to single number using parallel reduction kernel in less than 1 millisecond while update norm gets computed during update kernel with zero overhead. Plot every iteration and log to file for monitoring where early stopping halts when R-factor change drops below 0.001 for 10 iterations indicating convergence or when R-factor increases 3 consecutive iterations indicating divergence, saving compute time and preventing overfitting. For example, a 512 by 512 scan with 100 iterations target sees R-factor drop from 0.45 to 0.08 by iteration 60 with changes below 0.001 after, allowing stop at iteration 70 and saving 30 iterations equaling 1 minute.- How do I handle partial coherence in GPU ptychography?
Partial coherence from finite source size, energy spread, and vibrations requires additional probe modes. Multi-mode approach represents probe as sum of orthogonal modes using
probe_total = Σ probe_i × sqrt(eigenvalue_i)with typical 3 to 5 modes.- How does GPU implement multi-mode ptychography?
GPU implementation stores all modes in VRAM where 5 modes times 256 by 256 times 8 bytes equals 2.6 MB (negligible). Forward propagates all modes independently with 5 times FFTs in parallel each iteration. Sums intensities as
I_total = Σ|FFT(probe_i × object)|². Backward propagates each mode with shared gradient and orthogonalizes modes using Gram-Schmidt in parallel.- What is the performance impact of multi-mode?
5 modes equals 5 times FFTs but fully parallel, adding 20 to 30% time versus single mode. Memory scales linearly: 5 modes on 512 by 512 scan equals 3.3 GB additional (fits RTX 4090). Convergence with multi-mode proves more robust needing fewer iterations (60 versus 100), providing net time savings. Use when working with synchrotron data, noisy data, or poor single-mode convergence.
- What regularization strategies work well on GPU for ptychography?
Five common methods: Total Variation penalizes
|∇object|to encourage smooth phase (GPU computes gradient with finite differences, 1-2 ms per iteration, useful for noisy data). Tikhonov penalizes|object|²to prevent overfitting (GPU performs element-wise multiply and reduction, useful for ill-conditioned problems). Support constraint enforces object equals 0 outside known region (GPU performs binary mask multiply under 1 ms, useful for isolated sample).- What other regularization constraints are useful?
Positivity enforces real(object) ≥ 0 or phase bounds (GPU performs element-wise clamp like
max(object, 0), useful when physical constraints known). Probe power constraint fixessum(|probe|²) = constant(GPU uses parallel reduction and normalize in 2 ms, recommended always use).- How do I combine regularization methods?
Apply sequentially each iteration: object update, then TV, then support, then positivity with total overhead below 5 milliseconds. Hyperparameters like TV weight (0.001-0.01) get tuned by monitoring R-factor versus smoothness. GPU enables rapid parameter sweeps running 10 configurations in 30 minutes versus 5 hours on CPU.
- How do I initialize probe and object for GPU ptychography?
Probe initialization: Fourier transform of bright-field averages all patterns and applies inverse FFT (best with coherent illumination). Gaussian uses
exp(-r²/2σ²) × exp(i×phase)with known beam size from FWHM (simple but effective). Measured empty acquires pattern with no sample and takes sqrt intensity (most accurate).- What about object initialization?
Object initialization: uniform transmission setting
object = 1 + 0ilets algorithm find features robustly. Random phase usesobject = 1 × exp(i×random)to avoid local minima. Low-resolution reconstruction runs quick ePIE with relaxed convergence (10 iterations) then refines (fast preview).- What GPU advantage exists for initialization?
GPU allows trying multiple initializations in parallel by launching 4 GPU streams each with different probe and object, selecting best R-factor after 20 iterations. Takes 3 times single run time but finds better solution. Recommended: probe from FT of bright-field or Gaussian (1-2 minutes setup), object as ones (zero cost), avoiding manual tuning unless convergence fails.
- What are common ptychography artifacts and their causes?
Ringing: circular bands around features from insufficient probe overlap below 60%. Streaking: parallel lines from position errors, vibrations. Low contrast: washed out from under-iteration and poor convergence. Checkerboard: alternating artifacts from NaN propagation. Phase wrapping: discontinuous jumps from phase above π. Blurring: loss of detail from over-smoothing. Probe contamination: sample features in probe from mode mixing.
- How does GPU help detect and fix artifacts?
Ringing: GPU FFT of phase checks power spectrum for high-frequency oscillations. Fix: increase overlap or apodize probe. Streaking: directional gradient analysis checks
|∂φ/∂x|versus|∂φ/∂y|imbalance. Fix: position refinement. Low contrast: R-factor above 0.15. Fix: more iterations. Checkerboard:cp.isnan(object).any(). Fix: check numerics. Phase wrapping: phase gradient above π. Fix: unwrapping. Blurring: decreased high-frequency FFT power. Fix: reduce TV weight.- How should I monitor reconstruction progress?
GPU monitoring computes metrics every 10 iterations with overhead below 10 milliseconds, flags anomalies, and stops if diverging. Display intermediate results with
cp.asnumpy(object[::4,::4])for quick visual check without full transfer.- Can I run full ptychography run in web browsers?
Full reconstruction unlikely since typical datasets exceeding 10 GB surpass browser memory limits around 2 GB ArrayBuffer.
- Can I visualize GPU ptychography in web browsers?
Web visualization of server computation works well. Server runs GPU reconstruction using CUDA and Python, streams results via WebSocket (phase image, residual, convergence metrics). Browser displays real-time progress using WebGL and Canvas2D. User adjusts parameters triggering server restart. Architecture: microscope → server GPU → web interface, enabling remote monitoring, collaborative viewing, no local installation. Examples: Jupyter notebooks with live widgets, custom Three.js dashboards.
- What acquisition strategies optimize GPU ptychography?
Sparse sampling: randomly samples 10 to 30% of scan positions where GPU reconstructs with compressed sensing, reducing acquisition time 3 to 10 times while maintaining quality. Binary detection [7]: records 0 or 1 counts at 12.5 kHz versus 1 kHz conventional, reducing data transfer 8 to 32 times. Adaptive scanning: GPU provides real-time feedback where microscope adjusts scan based on reconstruction quality.
- What other data reduction strategies exist?
Event-driven approach: detector sends only electron hit positions instead of full frames, reducing bandwidth 100 times with GPU processing sparse events. Subsampling: Colin Ophus demonstrated 100 times faster acquisition with 10% sampling using phase retrieval algorithms. Strategy involves balancing acquisition speed, data size, and GPU processing capacity.
- How do position errors affect ptychography and how does GPU help refinement?
Position errors from stage drift, vibrations, calibration cause misalignment between scan grid and actual probe positions. Errors >1 pixel severely degrade reconstruction (phase artifacts, reduced resolution). GPU position refinement: (1) Gradient-based = compute
∂R/∂positionfor each scan point, optimize with gradient descent. GPU parallelizes gradient calculation across all positions (<100 ms per iteration). Typical: refine every 10-20 ePIE iterations. (2) Cross-correlation = correlate measured vs calculated patterns, find shift maximizing overlap. GPU FFT-based correlation (1-2 ms per pattern). (3) Annealing method = test small position perturbations (±2 pixels), GPU runs 5×5 grid in parallel (25 variants), select best R-factor. Performance: Position refinement adds 10-20% overhead but improves final R-factor by 2-3×. Essential for drifting stages (thermal, mechanical). Implementation: Store positions as GPU array (scan_positions[N,2]), update during reconstruction, use refined positions for FFT interpolation. Monitoring: Plot position corrections as vector field—systematic drift shows consistent direction, random errors show scatter.- How does GPU acceleration help scan position jitter correction in real-time?
Scan jitter (high-frequency position variation from stage electronics, vibrations) requires per-frame correction. CPU: Cross-correlation takes 10-50 ms per pattern (too slow for real-time at 1 kHz). GPU: Batched cross-correlation handles 1,000 patterns in <100 ms (1,000× faster in batch). Real-time pipeline: (1) Acquire pattern from detector → (2) GPU computes cross-correlation against reference → (3) Extract position offset (±5 pixels typical) → (4) Interpolate pattern to corrected grid → (5) Feed to reconstruction. Steps 2-4 take <1 ms on GPU. Batched approach: Buffer 100 patterns (100 ms at 1 kHz), GPU processes batch in 10 ms while next batch acquires—overlap hides latency. Accuracy: Subpixel refinement via phase gradient in Fourier space (GPU interpolation, <0.1 pixel precision). Memory: 100-frame buffer = 100 × 256×256 × 4 bytes = 25 MB (negligible). Use cases: Fast detectors (>1 kHz), unstable stages, synchrotron vibrations.
- Can I do ptychography parameter tuning on GPU faster than CPU?
Yes, GPU enables exhaustive parameter search where key parameters include step size for object update, probe power constraint weight, TV regularization strength, and number of iterations. Grid search on GPU launches 4 to 8 reconstructions simultaneously on consumer GPU where RTX 4090 can fit 4 times 256 by 256 scans in 24 GB memory with each using different parameters where all run in parallel streams taking 1.2 times time of single run showing slight memory bandwidth contention compared to 4 times time on CPU. Example shows testing 4 step sizes at 0.5, 0.7, 0.9, and 1.1 for 256 by 256 scan with 50 iterations each where GPU requires 6 minutes total for 4 parallel runs while CPU requires 2 hours for sequential processing finding optimal parameters in one experiment runtime. Bayesian optimization approach uses GPU running forward model reconstruction while CPU updates parameter distribution using Gaussian process converging in 10 to 20 trials compared to 100 plus for grid search, while coarse to fine strategy first does 64 by 64 downsampled scan providing 16 times faster processing to identify promising region then refines at full resolution. Practical usage shows most users settle on standard parameters with step size 0.9 and TV weight 0.005 where tuning becomes worth effort for challenging samples, new detector, novel algorithm, or publication-quality results.
- How does data flow from detector to GPU in real-time?
Five-stage physical path moves data from acquisition to processing where first stage at detector captures electrons and digitizes to 256 by 256 pixels creating 128 KB per frame where 1 kHz equals 128 MB per second, second stage at detector computer buffers frames in RAM and packetizes data for transmission, third stage using network where 10 Gigabit Ethernet connects detector PC to GPU server where each packet contains chunk typically 1,500 bytes or 9,000 bytes for jumbo frames and each frame splits into approximately 85 to 1,400 packets depending on packet size, fourth stage at GPU server NIC receives packets and reassembles frames where with GPUDirect RDMA, NIC writes directly to GPU VRAM bypassing CPU reducing latency from approximately 1 millisecond to approximately 50 microseconds, and fifth stage at GPU VRAM receives frames arriving in ring buffer where GPU processes immediately enabling continuous data flow.
- What memory layout optimizations help ptychography GPU performance?
Scan order: Store patterns in acquisition order (row-major typical). Coalesced reads occur when processing sequentially. For batched ePIE, reorder to batch-order grouping non-overlapping positions—preprocessing overhead is small but saves time per iteration. Complex numbers: Use
float32[N,2]for real and imaginary components rather than Python complex. GPU has specialized complex multiply providing faster execution. Array layout: Struct-of-arrays[probe_real..., probe_imag...]better than array-of-structs[[re0,im0],[re1,im1],...]for coalesced memory access. Padding: Extend dimensions to multiples of 32 (warp size) or 64 (cache line). Example: 250×250 padded to 256×256 improves performance. 256×256 already optimal. Pinned memory: Allocate host buffers withcp.cuda.alloc_pinned_memory()for 2-3× faster PCIe transfers. Typical layout:patterns[scan_y, scan_x, det_y, det_x, 2]as float32—first two indices vary slowly (scan), last three vary fast (detector, real/imaginary). Tradeoff: Optimal layout costs extra memory for padding but improves compute speed.- How does asynchronous execution improve GPU ptychography throughput?
Synchronous (naive): Load data CPU→GPU, wait, launch kernel, wait, copy result GPU→CPU, repeat. GPU sits idle during transfers, CPU sits idle during compute—fully sequential. Asynchronous (optimized): Use CUDA streams to overlap operations. Three concurrent streams: Stream A loads chunk 2 while Stream B processes chunk 1 while Stream C saves chunk 0—operations in flight simultaneously. Implementation: Create streams
stream1 = cp.cuda.Stream(); stream2 = cp.cuda.Stream()then usestream1.use(); data_gpu.set(data_cpu); kernel(data_gpu). No explicit sync needed—GPU scheduler overlaps automatically. Speedup: Multi-stream pipeline achieves significant speedup (less than number of streams due to memory bandwidth sharing). Ptychography-specific: Five stages—Stream 1 loads patterns, Stream 2 FFT forward, Stream 3 update kernels, Stream 4 FFT backward, Stream 5 saves results. Hides latencies effectively. VRAM requirement: Need enough memory for 2-3 chunks in flight simultaneously. Smaller datasets fit easily, larger datasets need chunking.- What are best practices for GPU memory management in iterative ptychography?
Pre-allocate everything: Before loop starts, allocate object, probe, exit_wave, FFT workspace, positions. Prevents fragmentation, ensures predictable memory usage. Reuse buffers: Single
exit_wavebuffer reused for all positions—avoid allocating new memory each iteration. In-place operations:object *= factorrather thanobject = object * factor. Former reuses memory, latter allocates new. CuPy optimizer usually handles this but explicit is safer. Memory pools:pool = cp.cuda.MemoryPool(); cp.cuda.set_allocator(pool.malloc). Reuses freed blocks faster than cudaMalloc. Monitor usage:cp.cuda.Device().mem_info()each iteration. Log peak usage to catch leaks early. Explicit free:del large_array; pool.free_all_blocks()after major phases to reclaim memory. Mixed precision storage: Store patterns as float16 (half memory), convert to float32 for compute. GPU casting is fast. Common pitfall: Creating temporary arrays in loop (temp = array.copy()) causes fragmentation. Use pre-allocated buffers instead.- What are the bottlenecks and timing in detector-to-GPU data flow?
Typical timing breakdown (example with fast detector and 256×256 frames): Frame capture ~1 ms, network transfer ~1 ms, GPU processing varies by operation. Three main bottlenecks: (1) Network—10 GbE provides ~1.25 GB/s maximum. Saturates at high frame rates. (2) Detector readout—Some detectors have maximum frame rate limits. (3) GPU processing—Complex reconstruction may exceed available time per frame. GPUDirect benefit: At lower frame rates producing moderate data rates, CPU RAM path adequate. At high frame rates producing GB/s data streams, CPU becomes bottleneck. GPUDirect allows NIC to GPU via PCIe directly with no CPU involvement, reducing latency significantly.
Multi-GPU processing
- What interconnect types exist for GPUs?
PCIe provides 32 to 64 GB per second with approximately 5 microseconds latency offering universal CPU to GPU connection available on every motherboard but slow for GPU to GPU communication.
- What about high-speed GPU interconnects?
NVLink provides 300 to 600 GB per second with approximately 1 microsecond latency offering NVIDIA-only GPU to GPU connection 10 times faster than PCIe but requires expensive server boards costing over $2K and limits to 8 to 16 GPUs.
- What about multi-node interconnects?
InfiniBand provides 12.5 to 50 GB per second with under 1 microsecond latency serving multi-node HPC with 100 plus servers offering ultra-low latency RDMA but very expensive at $5K to $20K per server. Ethernet provides 0.125 to 12.5 GB per second with approximately 100 microseconds latency offering standard networking at low cost but higher latency.
See also: multi-gpu-interconnect for why this matters in ptychography, When should I use single-GPU vs multi-GPU for ptychography? for scaling decisions.
Interconnect Technology Comparison Table
Technology |
Bandwidth |
Latency |
Topology |
Cost/Port |
Scale Limit |
Best Use Case |
|---|---|---|---|---|---|---|
PCIe 5.0 |
64 GB/s |
~5 µs |
Point-to-point |
$0 (built-in) |
CPU to 1-4 GPUs |
Latest workstations, 2× PCIe 4.0 speed |
NVLink 3.0 |
300 GB/s |
~1 µs |
All-to-all mesh |
$2,000/server |
8 GPUs/server |
Multi-GPU ptychography, DGX A100 systems |
NVLink 4.0 |
600 GB/s |
~1 µs |
All-to-all mesh |
$3,000/server |
8 GPUs/server |
H100 systems, maximum GPU-GPU bandwidth |
InfiniBand NDR |
50 GB/s |
<0.5 µs |
Switch fabric |
$10,000/server |
1,000s nodes |
Next-gen supercomputers, TB-scale data |
Decision matrix: For single workstation with 1-2 GPUs, PCIe sufficient (built-in, no extra cost). For 4-8 GPU server with coupled workloads (ptychography overlapping scans), NVLink required (10× faster GPU-GPU). For multi-node facility with 100+ GPUs, InfiniBand needed (ultra-low latency RDMA). For budget multi-node with independent datasets (embarrassingly parallel), 10-25 GbE adequate.
Real-world example: Synchrotron facility processing 1,000 datasets per day uses 8 H100s with NVLink 4.0 inside server (600 GB/s GPU-GPU) connected to storage via 100 GbE and multiple servers via InfiniBand NDR (50 GB/s cross-server).
See also: How do I choose interconnects for 4D-STEM? for detailed 4D-STEM recommendations.
- How do I choose interconnects for 4D-STEM?
Single workstation with 1 to 2 GPUs finds PCIe sufficient for independent datasets. Small cluster with 4 to 8 GPUs in one server requires NVLink for multi-GPU ptychography providing 300 to 600 GB per second GPU communication.
- What about large-scale facilities?
Large facility with 100 plus GPUs across servers needs InfiniBand for TB-scale batch processing using internal NVLink with cross-server InfiniBand. Budget multi-node setup uses 10 to 25 GbE Ethernet working if processing independent datasets with embarrassingly parallel workload.
- Why do different interconnects exist?
PCIe from 1990s pre-dated multi-GPU computing and was meant for single GPU plus peripherals but insufficient for modern multi-GPU workloads. NVLink from 2016 was created when AI training hit PCIe bottleneck where systems couldn’t transfer model weights fast enough.
- What about network interconnects?
InfiniBand from 1999 predates GPU computing and was designed for supercomputer MPI later adopted for multi-node GPU clusters where RDMA enables direct GPU communication across network. Ethernet serves general-purpose needs working well for loosely coupled tasks like independent datasets.
- Why is GPU-to-GPU communication so difficult in distributed computing?
Training large models on 8 GPUs requires synchronizing 175 billion parameters totaling 700 GB every training step. Naive approach has each GPU send gradients to GPU 0 where GPU 0 averages then broadcasts result back creating bottleneck.
- What happens at the bottleneck?
GPU 0 receives 7 times 100 GB equals 700 GB overwhelming even NVLink at 300 GB per second taking 2.3 seconds per step while GPUs 1 through 7 sit idle wasting compute resources.
- What makes topology and routing complex in multi-GPU systems?
Data center topology has 8-GPU nodes connected by NVLink at 300 GB per second, multiple nodes connected by InfiniBand at 200 GB per second, and some by Ethernet at 10 to 100 Gb per second creating complex routing paths.
- How does message size affect routing?
Small messages at 1 MB need tree algorithm providing logarithmic time while large messages at 1 GB need ring algorithm avoiding bottlenecks where wrong choice causes 10 times slower performance.
- What is NCCL?
NVIDIA Collective Communications Library optimizes all-reduce, broadcast, and gather for multi-GPU. Automatically uses fastest path from NVLink to PCIe to InfiniBand to Ethernet handling topology discovery and algorithm selection. PyTorch and TensorFlow use NCCL under hood achieving 280 GB per second on 8×A100 (87 percent of theoretical).
- What is all-reduce?
Distributed operation that combines values from all nodes like summing gradients and broadcasts result. Ring algorithm has each GPU send to next and receive from previous where after N-1 steps all have result using 6 GB total versus naive 24 GB with GPU0 bottleneck where all network links used simultaneously.
Memory management
- How do multiple users share one GPU server?
Time slicing schedules one user at a time via queue like SLURM or PBS where GPU idle during transitions but simple. Multi-Instance GPU (MIG) hardware partitions A100 or H100 into 7 isolated instances giving each user dedicated slice with 10 GB memory and 1/7 compute. Virtualization using NVIDIA vGPU allows software multiplexing and oversubscription.
- How do I handle 4D-STEM data larger than GPU memory?
Split scan into tiles like 512×512 becoming 4×256×256 processing each by load, analyze, save, clear VRAM, then move to next implementing
for chunk in dataset.chunks(256, 256): gpu_process(chunk). CUDA streams overlap next load with current processing achieving 2 times faster. For 256 GB dataset on 24 GB GPU expect 11 chunks times 20 s yielding 220 s total.See also: What if my 4D-STEM data doesn’t fit in GPU memory? for quick strategies, How do I truly maximize GPU power for ptychography with 4D-STEM data? for ptychography-specific approaches.
- What strategies for out-of-core GPU processing?
Out-of-core processes data larger than GPU memory by keeping most on disk or RAM loading pieces as needed for 256 GB dataset on 24 GB GPU. Memory-mapped files using HDF5 or Zarr like
data = h5py.File('data.h5')['dataset']look like array where OS pages from disk with slow random access but fast sequential. Manual chunkingchunk = np.load(f'chunk_{i}.npy'); gpu_data = cp.asarray(chunk)provides full control best for custom patterns. Dask arraysda.from_array(dataset, chunks=(64,64,128,128))auto-schedules good for standard operations with overhead for custom while streaming processes during acquisition saving compressed results only.- Is unified memory recommended for 4D-STEM?
No, use explicit transfers because 4D-STEM has predictable access patterns like load entire chunk, process, done where explicit
cp.asarray(chunk)gives full control and best performance. Unified memory useful for irregular access like graph algorithms, rapid prototyping where performance matters less, or exceeding VRAM occasionally better than crash where unified memory on integrated GPUs like MacBook actually uses shared physical memory with no migration working well.
- How do I minimize GPU↔CPU transfers in analysis pipeline?
Keep intermediate results on GPU by chaining operations without
.get()likeresult = cp.fft.fft2(cp.mean(data_gpu, axis=0))staying all on GPU with one final transfer. Batch operations doingall_results = process_batch(data_gpu).get()once instead of looping. Use pinned memorycpu_buffer = cp.cuda.alloc_pinned_memory(size)for 2 to 3 times faster transfers and async transfers with streams to overlap transfer with compute.See also: Why is CPU-to-GPU data transfer so slow? for why transfers are slow, What are the benefits of using pinned memory? for faster transfers, What is the recommended workflow for developing GPU code? for best practices.
- What are additional strategies to reduce data transfer overhead?
Transfer float16 instead of float32 when precision allows cutting bandwidth in half. For sparse data compress on CPU, transfer, then decompress on GPU trading CPU time for bandwidth. Typical 4D-STEM pipeline loads full dataset once with all operations on GPU then transfers only final results showing 17 GB input plus 5 MB output versus 17 GB plus 17 GB plus 17 GB if transferring intermediates.
- What are common GPU mistakes?
Frequent CPU to GPU transfers severely impact performance where you should transfer once at start and keep on GPU throughout. Branch divergence occurs when threads in warp take different paths serializing execution requiring minimizing conditionals. Uncoalesced memory shows scattered accesses needing many transactions solved by accessing consecutive elements. Too many small kernels accumulate launch overhead requiring batching operations together. Ignoring profiling leads to optimizing wrong part where using Nsight Systems identifies actual bottlenecks.
See also: How do I minimize GPU↔CPU transfers in analysis pipeline? for transfer optimization, Why are warps fundamental to GPU execution? for branch divergence details, What profiling tools help optimize 4D-STEM on GPU? for profiling.
Optimization workflow
- How do I optimize GPU code?
Start with working CPU code and verify correctness then profile CPU using cProfile to find bottlenecks. Port to GPU using CuPy first since easiest then verify GPU results match CPU exactly using numerical comparison. Profile GPU using Nsight Systems for new bottlenecks then optimize memory patterns for coalescing and adjust thread blocks for occupancy. Fuse operations to reduce kernel launches then profile again iterating until acceptable performance reached.
- What is the optimization sequence?
Minimize CPU to GPU transfers keeping data on GPU throughout then ensure memory coalescing through consecutive access patterns. Maximize occupancy by adjusting threads per block trying 128, 256, or 512 then use shared memory for reused data reducing global memory access. Fuse kernels to eliminate intermediate arrays, apply mixed precision where appropriate, and overlap computation and communication for multi-GPU systems where profiling shows what matters most.
- What causes bank conflicts in FFT shared memory access?
Shared memory divides into 32 banks where stride patterns naturally cause conflicts with multiple threads accessing same bank serializing execution. Example shows threads 0, 16, 32, and 48 all hit bank 0 running 32 times slower where solutions require padding or bit-reversal permutations.
- Why does FFT algorithm selection depend on array size?
Different algorithms optimize for different sizes including Cooley-Tukey for power-of-2, Bluestein for any size, and mixed-radix for factors of 2, 3, 5, and 7 where 256 by 256 uses one code path while 250 by 250 uses another. cuFFT implements approximately 50 variants.
- What makes transpose operations challenging in 2D FFT?
2D FFT requires row FFT then transpose then column FFT where transpose writes columns as rows creating worst possible GPU pattern with non-coalesced access. Optimized transpose via shared memory tiling achieves approximately 60 percent bandwidth requiring careful corner handling for non-square matrices.
- How do twiddle factors affect FFT performance?
Twiddle factors multiply by complex exponentials creating trade-off where you can precompute all for fast execution using 512 KB memory, compute on-the-fly saving memory but with slow sin or cos operations, or use hybrid approach where strategy varies by size.
- Why does work decomposition matter for FFT parallelization?
Work decomposition decides how to split FFT across 10,000 plus threads where options include one thread per output showing poor occupancy, one block per FFT working well for batching, or decompose single FFT across blocks creating synchronization overhead where each requires different kernel.
- Why does FFT require architecture-specific tuning?
Architecture-specific tuning adapts to hardware where Volta has 128 KB shared memory, Ampere has 164 KB, and Hopper has 228 KB leading to different optimal tiling. Warp shuffles on Volta plus enable new patterns where tensor cores accelerate some decompositions requiring separate code for each generation where NVIDIA employs 10 plus engineers on cuFFT full-time since 2006.
- Why is FFT optimization so difficult on GPUs?
FFT is fundamentally memory-bound rather than compute-bound where bottleneck comes from loading data from VRAM rather than arithmetic operations. Complex memory access patterns show FFT uses butterfly operations accessing distant elements with stride patterns like 512, 256, 128, and 64 positions apart creating non-coalesced access where each thread fetches different cache lines running 512 times slower than adjacent reads. Naive FFT wastes 90 percent on memory and only 10 percent on arithmetic where cuFFT achieves less than 5 TFLOPS equaling 6 percent of peak while matrix multiply achieves 60 to 70 TFLOPS equaling 70 to 90 percent of peak because it reuses data 1000 times in cache.
- What does “memory-bound” mean for FFT performance?
Memory-bound means GPU cores sit idle waiting for data to arrive from VRAM, not that data transfer between CPU and GPU is slow. This happens AFTER data already lives on GPU where compute cores can perform math 60 times faster than memory bus delivers numbers to work on. The bottleneck is internal GPU memory bandwidth from VRAM to streaming multiprocessors, not external PCIe transfer from system RAM.
Concrete example using NVIDIA L40s GPU shows internal bandwidth bottleneck clearly. Our research group has GPU server with 4 L40s GPUs where each L40s has 48 GB VRAM with 864 GB per second internal bandwidth and 90 TFLOPS FP32 compute capability. For 4D STEM ptychography with 512 by 512 scan positions and 256 by 256 detector requiring 67 million complex numbers at 1 GB total size already loaded into VRAM, each FFT iteration reads 1 GB from VRAM into GPU cores, performs FFT butterfly operations requiring approximately 2 billion floating point operations, then writes 1 GB results back to VRAM. Memory read time equals 1 GB divided by 864 GB per second equals 1.2 milliseconds while compute time for 2 GFLOPS divided by 90 TFLOPS equals 0.02 milliseconds showing internal memory movement takes 60 times longer than arithmetic. GPU utilization reaches only 1.5 percent where compute cores spend 98.5 percent time idle waiting for next batch of numbers to arrive from VRAM, not doing math. This internal bandwidth bottleneck means FFT optimization focuses on memory access patterns including coalescing reads where 32 threads grab consecutive addresses in one transaction, batching operations to amortize overhead, and keeping intermediate results in fast 10 TB per second shared memory rather than slow 864 GB per second VRAM.
Connection to in-situ 4D STEM streaming matters because memory-bound limitation applies regardless of data source. Whether data arrives from detector through direct memory access at 10 GB per second, loads from NVMe SSD at 7 GB per second, or transfers from CPU RAM through PCIe at 32 GB per second, once data sits in GPU VRAM the same internal 864 GB per second bandwidth bottleneck applies during FFT computation. In-situ streaming with detector generating 100 patterns per second means GPU receives new data every 10 milliseconds but spends only 1.2 milliseconds reading existing data from VRAM for FFT, so streaming rate does not make memory-bound problem worse. The 864 GB per second internal bottleneck dominates regardless of input rate. For comparison, external transfer paths show detector to GPU at 10 GB per second, NVMe to GPU at 7 GB per second, and CPU to GPU at 32 GB per second, all 27 times to 86 times slower than internal 864 GB per second VRAM bandwidth, meaning once data reaches VRAM the internal memory system becomes limiting factor not external data arrival. Similar pattern appears on other GPUs including RTX 4090 with 1000 GB per second internal bandwidth reaching 2 percent utilization and H100 with 3000 GB per second internal bandwidth reaching 5 percent utilization, all fundamentally limited by internal memory bandwidth not external streaming rate or compute capability.
Troubleshooting common errors
- What does “CUDA out of memory” error mean and how do I fix it?
Error:
RuntimeError: CUDA out of memory. Tried to allocate X GB (GPU 0; Y GB total capacity). Cause: Attempting to allocate array exceeding available VRAM. Happens when: dataset too large, memory leak (allocate without freeing), fragmentation, multiple programs using GPU. Solutions: (1) Free memory—cp.get_default_memory_pool().free_all_blocks()after operations. (2) Reduce batch size—process 64 patterns instead of 256. (3) Mixed precision—use float16 (halves memory). (4) Chunked processing—load/process/free in loop. (5) Check for leaks—nvidia-smishows memory usage, should decrease afterfree_all_blocks(). (6) Kill other processes—close browsers, other Python scripts. (7) Upgrade GPU—if dataset fundamentally too large. Prevention: Calculate memory before allocating:data_size_gb = scan_x * scan_y * det_x * det_y * 4 / 1e9. Add 20% margin for workspace.See also: What if my 4D-STEM data doesn’t fit in GPU memory?, How do I handle 4D-STEM data larger than GPU memory?, memory-capacity-ptychography.
- What causes “illegal memory access” errors on GPU?
Error:
RuntimeError: CUDA error: an illegal memory access was encountered. Causes: (1) Out-of-bounds indexing—array[i]wherei >= len(array). (2) Invalid pointer—accessing freed memory. (3) Misaligned access—reading from unaligned address. (4) Race condition—threads simultaneously write same location without atomics. Debug: Run withcompute-sanitizer python script.py—shows exact line and thread causing error (slow but precise). Common 4D-STEM bugs: Forgettingif i < Ncheck in kernel, wrong dimensions (256×256 vs 128×128), integer overflow in index calculation (i * jexceeds 2³¹). Prevention: Always bounds-check:if idx < array.size: array[idx] = value. Validate dimensions:assert input.shape == expected_shapebefore kernels. Use.get()to inspect intermediate arrays.- How do I interpret cuFFT error codes?
CUFFT_INVALID_PLAN: Plan not created or already destroyed. Recreate plan. CUFFT_ALLOC_FAILED: Insufficient memory for FFT workspace. Reduce batch size or use smaller FFT. CUFFT_INVALID_VALUE: Wrong parameters (negative size, NULL pointer). Check dimensions:
assert data.shape[0] > 0. CUFFT_EXEC_FAILED: Execution error during FFT. Often NaN/Inf in input data. Check:assert not cp.isnan(data).any(). CUFFT_SETUP_FAILED: Old GPU or incompatible CUDA version. Update drivers. CUFFT_INVALID_SIZE: Dimensions exceed limits. Max 2D FFT size ~32768×32768. Split larger. Debug: Print shape before FFT:print(f"FFT input: {data.shape}, dtype: {data.dtype}"). Ensure contiguous:data = cp.ascontiguousarray(data)before FFT. Most issues: NaN in data (check upstream preprocessing) or wrong dimensions.
- Why is my GPU code slower than expected?
Ten common causes: (1) CPU↔GPU transfers—moving data between every operation. Keep on GPU. (2) Small batches—launching 10 kernels for 10 patterns. Batch to 1 kernel for 10 patterns. (3) Uncoalesced memory—scattered access patterns. Ensure consecutive threads access consecutive memory. (4) Thermal throttling—temperature >85°C reduces clock 30-50%. Improve cooling. (5) PCIe bottleneck—PCIe 3.0 instead of 4.0 (half bandwidth). Check:
lspci -vv | grep LnkSta. (6) Wrong precision—float64 on consumer GPU (64× slower). Use float32. (7) Synchronous operations—.get()blocks pipeline. Use async. (8) CPU preprocessing—NumPy operations before GPU. Move to GPU. (9) Memory fragmentation—restart GPU process. (10) Shared GPU—other user consuming resources. Checknvidia-smi. Profile first:nsys profile python script.pyidentifies actual bottleneck. Don’t guess.See also: How do I benchmark my GPU performance for 4D-STEM? for performance baselines, How do I truly maximize GPU power for ptychography with 4D-STEM data? for optimization strategies.
- What profiling tools help optimize 4D-STEM on GPU?
NVIDIA Nsight Systems provides timeline profiler showing GPU utilization, kernel launches, and memory transfers best for finding where time goes launching with
nsys profile python script.py. NVIDIA Nsight Compute offers deep dive into single kernel showing occupancy, memory throughput, and warp stalls launching withncu --set full python script.py. CuPy profilercupy.prof.profile()gives quick checks without GUI while PyTorch profilertorch.profiler.profile()captures CPU plus GPU exporting to Chrome trace format. For 4D-STEM use Nsight Systems first to find bottleneck then Nsight Compute to optimize specific kernel.See also: How do I validate GPU results against CPU? for verifying correctness, What is the recommended workflow for developing GPU code? for the complete development process.
- How do I validate GPU results against CPU?
Use element-wise comparison with
assert np.allclose(cpu_result, gpu_result.get(), rtol=1e-5, atol=1e-8)setting relative tolerance 1e-5 for float32. Compare mean, std, min, max accepting under 0.01 percent difference. Display outputs side-by-side for systematic shifts or artifacts and check physical constraints like positivity, conservation, and symmetry where virtual images must be real and positive.See also: What causes differences between CPU and GPU results? for understanding numerical discrepancies, What precision (float32 vs float64) for different operations? for precision considerations.
- What causes differences between CPU and GPU results?
Floating-point order shows
(a+b)+c ≠ a+(b+c)due to rounding where GPU reorders operations causing differences around 1e-7 normally. cuFFT versus FFTW use different algorithms causing differences around 1e-6. Atomic operations with non-deterministic order in accumulation cause small differences. Use same precision float32 versus float64 for fair comparison where best practice validates on small datasets first like 8×8 scan then trusts GPU for production.
- What precision (float32 vs float64) for different operations?
Use float32 with 7 digits for center of mass, virtual imaging, FFT, and template matching where GPU runs 4 to 64 times faster. Use float64 with 15 digits for ptychography, strain mapping, tomography, and large matrices. Use float32 by default switching to float64 if iterative algorithm diverges, computing small differences, or large matrix conditioning exceeds 1e6 where mixed precision stores float32 but computes critical steps in float64.
See also: float64-precision-ptychography for when precision matters in ptychography.
- Should I batch process many datasets or process one large one?
Batch many small datasets under 100 MB each with over 100 datasets for better GPU utilization and memory locality where pipeline overlap helps and launch overhead per dataset under 50 μs remains manageable. Process one large dataset over 10 GB with single launch and minimal transfers using simpler code. Batch 10 to 100 small datasets at once loading all to GPU while for large datasets chunk 1 to 2 GB pieces and pipeline.
- What is the recommended workflow for developing GPU code?
Start with working NumPy version verifying correctness on small test case like 8×8 scan then profile CPU with
python -m cProfile script.pyidentifying bottlenecks. Convert bottleneck function only replacingnpwithcpthen validate withassert np.allclose(cpu_result, gpu_result.get(), rtol=1e-5)requiring exact match. Benchmark both versions calculating speedup expecting 10 to 50 times for FFT and array operations then profile GPU withnsys profileshowing where GPU time goes. Optimize memory eliminating transfers and ensuring coalescing then optimize compute adjusting block size and using shared memory if needed where validating at every step prevents debugging nightmares.Prerequisites: See How do I install CuPy and test GPU functionality? for GPU setup first. See also: How do I validate GPU results against CPU? for validation step, What profiling tools help optimize 4D-STEM on GPU? for profiling tools.
Error handling and validation
- How do I check for numerical errors in GPU calculations?
Check for NaN and Inf using
assert not cp.isnan(result).any()andassert not cp.isinf(result).any()catching division by zero, overflow, and invalid operations. Validate physical constraints where diffraction intensities must be non-negative and phase should be negative π to π. Compare norms wheresum(input) ≈ sum(output)if conservation expected where large difference indicates error. For 4D-STEM most common issues include NaN fromdata / 0, Inf from exponential overflow, and wrong dimensions.- What about NaN/Inf handling in diffraction analysis?
Sources of NaN include division by zero fixed with
data / (mask + 1e-10)pattern,sqrtof negative numbers fixed withcp.sqrt(cp.maximum(data, 0))pattern, andlogof zero fixed withcp.log(data + 1e-10)pattern. Sources of Inf include overflow fixed withcp.exp(cp.minimum(data, 50))pattern and division by tiny number fixed adding epsilon to denominator. GPU silently propagates NaN or Inf through calculations without warning where detection adds epsilon everywhere defensively and asserts after each operation.- How do I ensure reproducibility with GPU?
Sources of non-determinism include atomic operations where threads update same location in random order creating differences around 1e-7, floating-point associativity where
(a+b)+c ≠ a+(b+c)due to rounding and GPU reordering operations, cuFFT algorithm selection varying by problem size, thread scheduling causing race conditions if not synchronized, and random number generation giving different results with different seeds.- What steps ensure reproducible GPU results?
Set random seeds using
cp.random.seed(42)ortorch.manual_seed(42)for consistency. Avoid atomics or accept non-determinism where atomics in ptychography cause approximately 1e-7 variance which should be documented. Use deterministic algorithms where PyTorch offerstorch.use_deterministic_algorithms(True)though slower and fix cuFFT plan by reusing same FFT plan for consistent algorithm. For 4D-STEM applications most operations including FFT, center of mass, and virtual imaging are deterministic where ptychography with overlaps varies approximately 1e-7 acceptably.- Where can I learn more about writing custom CuPy kernels?
Resources include CuPy User Guide for Writing Kernels covering
RawKernel,ElementwiseKernel, andReductionKernel, NVIDIA CUDA C++ Programming Guide explaining thread hierarchy and memory model, and py4DSTEM or ptychopy source code showing real implementations. Start simple withElementwiseKernelfor pixel-wise operations then graduate toRawKernelfor shared memory or atomics where learning curve requires 1 to 2 weeks for first kernel achieving 2 to 10 times faster than NumPy-style CuPy when optimized correctly.- Why does PyTorch use CUDA FFT behind the scenes? How does it work?
torch.fft.fft2()calls ATen which is PyTorch C++ library then routes to cuFFT for NVIDIA GPUs, MKL for Intel CPU, or Metal Performance Shaders for Mac. cuFFT represents approximately 100 engineer-years optimization across algorithms including Cooley-Tukey and Bluestein, memory patterns including coalescing and shared memory tiling, hardware utilization including tensor cores and warp shuffles, architecture-specific tuning with separate code for Volta and Ampere and Hopper, and multi-GPU strategies using NVLink decomposition where PyTorch overhead is approximately 1 to 5 microseconds negligible for FFTs larger than 64 by 64.- How does the ptychography reconstruction algorithm iterate?
Ptychography reconstruction alternates between real space and reciprocal space using iterative projections where the algorithm initializes both probe and object estimates then for each scan position computes exit wave by multiplying probe and object. FFT transforms this to reciprocal space where measured diffraction intensities constrain amplitude while preserving phase then inverse FFT returns to real space where gradient descent updates both object and probe. Most computation happens in FFT steps highly optimized through cuFFT while update steps require custom CUDA kernels with atomic operations to handle overlapping scan positions safely.
- How do I debug GPU code for diffraction analysis?
Use print debugging with
printfin CUDA limiting output to avoid buffer overflows or cuda-gdb offering NVIDIA’s GPU debugger with kernel breakpoints but steep learning curve. Compute Sanitizer detects memory errors and races running withcompute-sanitizer python script.pybut runs 10 to 100 times slower. Debug smaller test cases like 8 by 8 scan instead of 256 by 256 for faster iteration and compare GPU output versus NumPy usingnp.allclose(cpu, gpu.get())for numerical accuracy. Most 4D-STEM bugs involve indexing errors, memory corruption from races, or numerical issues like NaN from division by zero.
Resources
GPU hardware
NVIDIA
Product pages: GeForce RTX 40 series, H100, H100 architecture whitepaper, A100, RTX 6000 Ada
Documentation: CUDA Toolkit, CUDA Programming Guide, cuFFT library, cuBLAS library, CuPy documentation, Nsight Systems profiler, Nsight Compute
AMD
Product pages: Radeon RX 7000 series, Instinct MI300, MI250X
Documentation: ROCm documentation, ROCm installation, HIP (CUDA alternative)
Intel
Product pages: GPU Max series, Arc graphics
Documentation: OneAPI toolkit, DPC++ compiler
Cloud providers
AWS: GPU instance types GCP: GPU documentation Azure: GPU VMs Lambda Labs: GPU cloud
Software
4D-STEM analysis
py4DSTEM: Python library with GPU support via CuPy LiberTEM: Distributed processing with live streams HyperSpy: Multi-dimensional analysis, partial GPU support pyxem: Crystallographic analysis
Deep learning
PyTorch: Dynamic graphs, research-focused TensorFlow: Production-focused, multi-platform JAX: Research library with autodiff
Tutorials
CUDA by Example: Introductory CUDA GPU Gems: Advanced GPU techniques Fast.ai: Practical deep learning with GPU practices
WebGPU resources
WebGPU API (MDN): Official documentation WebGPU Fundamentals: Comprehensive tutorial WGSL Specification: Shading language spec Your First WebGPU App (Google): Getting started guide Learn WebGPU: Detailed tutorials GPU Compute in Chrome: Browser support Softbodies WebGPU Demo: Interactive example
Glossary
Hardware
- Bit
0 or 1, smallest unit of data.
- Byte
8 bits. Basic memory addressing unit.
- RAM
Random Access Memory. System memory using DRAM. Volatile, loses data at power off. Typical 128-512 GB for workstations.
- DRAM
Dynamic RAM uses 1 transistor plus 1 capacitor per bit. Cheaper and denser than SRAM but slower at 100 nanoseconds access due to constant refresh requirements. See SRAM for comparison and HBM for GPU bandwidth solution.
- SRAM
Static RAM uses 6 transistors per bit with no refresh needed. Access time 1 nanosecond matching CPU speed but 50 to 100 times more expensive than DRAM. Used in CPU caches (L1, L2, L3) where speed matters most.
- VRAM
Video RAM mounted directly on GPU circuit board providing high bandwidth (1,000 to 3,000 GB per second). Consumer GPUs use GDDR6 achieving 1,000 GB per second. Data center GPUs use HBM reaching 1,500 to 3,000 GB per second. See HBM for bandwidth solution.
- HBM
High Bandwidth Memory stacks 4 to 12 DRAM chips vertically using Through-Silicon Via technology. Delivers 1,500 to 3,000 GB per second bandwidth through 1024 to 4096 bit buses. Used in data center GPUs (A100, H100) where bandwidth justifies higher cost. See “Why was HBM developed?” for detailed memory wall explanation.
- PCIe
Peripheral Component Interconnect Express connects CPU to GPU [PCI-SIG specifications]. Version 4.0 with x16 lanes provides 32 GB per second. Version 5.0 with x16 lanes provides 64 GB per second. Often becomes bottleneck for transfers.
- NVLink
NVIDIA high-speed GPU interconnect [NVIDIA NVLink]. Version 3.0 provides 300 GB per second. Version 4.0 provides 600 GB per second. Enables efficient multi-GPU on same node.
- Core
Processing unit. CPU cores: complex, powerful, independent. GPU cores: simple, specialized, execute identical operations in groups.
- Thread
Single execution path. One worker doing one calculation.
- Warp
Group of 32 threads (NVIDIA) executing same instruction. Fundamental unit. Called wavefront on AMD (64 threads).
- Block
Group of 256-1024 threads sharing fast memory. Also called workgroup.
- Grid
Collection of all blocks in one kernel launch.
- Kernel
Function executing in parallel across thousands of threads. Each thread runs identical code on different data.
- SM
Streaming Multiprocessor. Basic GPU building block. Contains CUDA cores, shared memory, registers. GPU has many SMs working in parallel.
- Occupancy
Fraction of GPU cores actively used. High (>50%) = busy GPU. Low (<25%) = underutilized. Affected by register usage, shared memory.
- InfiniBand
High-speed HPC network. 100-400 Gb/s (12.5-50 GB/s), <1 μs latency. Used for multi-node GPU clusters. Much faster than Ethernet for distributed computing.
- SSD
Solid State Drive. NAND flash storage. 500-7,000 MB/s, faster than hard drives but slower than RAM. Uses quantum tunneling, microsecond latencies.
Architectures
- Ampere
NVIDIA architecture (2020). A100, RTX 3000 series. First FP64 tensor cores.
- Ada Lovelace
NVIDIA architecture (2022). RTX 4000 series. Third-gen tensor cores with FP8.
- Hopper
NVIDIA architecture (2022). H100. Fourth-gen tensor cores, FP8 precision. 60% faster training than A100.
- RDNA 3
AMD architecture (2022). Radeon RX 7000. Chiplet design.
- CDNA 3
AMD data center (2023). Instinct MI300. CPU-GPU chiplet combination.
- Ponte Vecchio
Intel architecture (2022). GPU Max 1550. Tile-based multi-chip.
Software
- CUDA
NVIDIA parallel computing platform. Most mature ecosystem for scientific computing. NVIDIA GPUs only.
- ROCm
AMD platform for GPU computing. Open source CUDA alternative. Includes HIP (CUDA translation), ROCm libraries.
- HIP
Heterogeneous Interface for Portability. AMD API similar to CUDA. Many CUDA programs auto-convert. Enables NVIDIA/AMD portability.
- OpenCL
Open Computing Language. Cross-vendor standard. Works on all GPUs and CPUs. More portable than CUDA but generally less optimized.
- WebGPU
Modern web standard for browser GPU access. Cross-platform JavaScript. 2-5x slower than native. Suitable for visualization, not production.
- cuFFT
CUDA FFT library. NVIDIA optimized, 10-100x faster than CPU FFT.
- cuBLAS
CUDA linear algebra library. Optimized matrix operations. Underlies PyTorch/TensorFlow.
- cuDNN
CUDA deep neural network library. Optimized convolution, pooling, normalization. Required for efficient deep learning.
- NCCL
NVIDIA Collective Communications Library. Optimized multi-GPU communication. All-reduce, broadcast, gather. Critical for distributed training.
- CuPy
NumPy-compatible GPU library. Drop-in replacement, arrays stored in GPU memory, operations on GPU.
- Numba
Python JIT compiler using LLVM. Compiles Python functions to machine code.
@jitfor CPU,@cuda.jitfor GPU kernels.- JAX
Google research library. NumPy-like API with automatic differentiation and GPU/TPU support. Functional programming style.
- PyTorch
Deep learning framework. Dynamic computation graphs. Python-first. Research-focused. Extensive ecosystem.
- TensorFlow
Google deep learning framework. Multi-platform. Production-focused. TensorFlow Lite for mobile.
- LLVM
Low Level Virtual Machine. Compiler infrastructure converting any programming language to intermediate representation (IR). Separates language frontend (Python, C++, Rust) from machine backend (x86, ARM, GPU). Python compiler outputs LLVM IR, NVIDIA backend converts IR to PTX. Enables language-independent GPU support. Used by Numba, Swift, Rust, Julia. See “Why do we need LLVM, PTX, SASS, and nvcc?” for complete compilation flow explanation.
- PTX
Parallel Thread Execution. NVIDIA portable GPU assembly language that provides architecture-independent intermediate representation between LLVM IR and machine code. See “Why do we need LLVM, PTX, SASS, and nvcc?” for detailed explanation.
- SASS
GPU-specific machine code optimized for exact architecture. Generated from PTX at runtime or compile time. See “Why do we need LLVM, PTX, SASS, and nvcc?” for detailed explanation.
- nvcc
NVIDIA CUDA Compiler that orchestrates GPU compilation by splitting CUDA code into host and device parts. See “Why do we need LLVM, PTX, SASS, and nvcc?” for detailed workflow.
Performance
- FLOPS
Floating Point Operations Per Second. Performance metric measuring arithmetic calculation rate (GFLOPS equals billion per second, TFLOPS equals trillion per second). Relevant for compute-bound operations like matrix multiply.
- Tensor Core
Specialized hardware for matrix multiply-accumulate. 5-20x speedup for deep learning over regular cores.
- Mixed precision
Lower precision (FP16, FP8) computation with higher precision (FP32) master copy. Reduces memory, increases speed, minimal accuracy loss.
- Batch size
Samples processed together in one pass. Larger batches improve GPU utilization but need more memory. Typical 16-256 (training), 64-1024 (inference).
- Bandwidth
Data transfer rate (GB/s). Memory bandwidth determines how fast data moves. DDR5: ~80 GB/s. GDDR6: ~800 GB/s. HBM3: ~3,000 GB/s.
- Latency
Time delay between request and response. L1 cache: 1 ns. DRAM: 100 ns. SSD: 100,000 ns. Lower = faster response.
- Coalescing
Memory access pattern where threads access consecutive addresses. Enables efficient transfers. Uncoalesced (scattered) = many slow transactions.
- Stream
Sequence of operations on GPU. Multiple streams execute concurrently, enabling overlap of computation and transfer.
- Pinned memory
Page-locked host memory. Enables faster DMA transfers to GPU (2-3x speedup). Required for async transfers.
- All-reduce
Distributed operation combining values from all nodes and broadcasting result. Critical for gradient averaging in training.
- Profiling
Measuring where time spent. Nsight Systems: timeline view. Nsight Compute: kernel metrics. Profile before optimizing.
- Bottleneck
Slowest component limiting performance. Could be memory bandwidth (most common), compute, PCIe, or CPU. Identify via profiling, optimize first.
References
S. Kandel, C. Jacobsen, and S. Vogt. Ssc-cdi: a memory-efficient, multi-gpu package for ptychography with extreme data. Journal of Imaging, 10(11):286, 2024. URL: https://doi.org/10.3390/jimaging10110286, doi:10.3390/jimaging10110286.
Saugat Kandel, Chris Jacobsen, and Stefan Vogt. Scalable and accurate multi-gpu-based image reconstruction of large-scale ptychography data. Scientific Reports, 12:3571, 2022. URL: https://doi.org/10.1038/s41598-022-09430-3, doi:10.1038/s41598-022-09430-3.
A. Konečná, L. H. G. Tizei, O. Stéphan, and M. Kociak. Ptychopy: gpu framework for ptychographic data analysis. arXiv preprint, 2022. URL: https://arxiv.org/abs/2202.03144, doi:10.48550/arXiv.2202.03144.
S. Marchesini, H. Krishnan, B. J. Daurer, D. A. Shapiro, T. Perciano, J. A. Sethian, and K. H. Downing. Sharp: a distributed gpu-based ptychographic solver. Journal of Applied Crystallography, 49(4):1245–1252, 2016. URL: https://doi.org/10.1107/S1600576716008074, doi:10.1107/S1600576716008074.
S. B. Mehta, Y. S. G. Nashed, T. Peterka, and C. Jacobsen. Deep learning at the edge enables real-time streaming ptychographic imaging. Nature Communications, 14:5257, 2023. URL: https://doi.org/10.1038/s41467-023-41496-z, doi:10.1038/s41467-023-41496-z.
C. Ophus, J. Ciston, and C. T. Nelson. Real-time integration center of mass (ricom) reconstruction for 4d-stem. Ultramicroscopy, 231:113101, 2021. URL: https://doi.org/10.1016/j.ultramic.2021.113101, doi:10.1016/j.ultramic.2021.113101.
P. M. Pelz, C. Groschner, and M. C. Scott. Phase reconstruction using fast binary 4d stem data. Applied Physics Letters, 116:024101, 2020. URL: https://doi.org/10.1063/1.5143213, doi:10.1063/1.5143213.
S. E. Zeltmann, A. Müller, K. C. Bustillo, B. Savitzky, L. Hughes, A. M. Minor, and C. Ophus. Free, flexible and fast: orientation mapping using the multi-core and gpu-accelerated template matching capabilities in the python-based open source 4d-stem analysis toolbox. Ultramicroscopy, 237:113517, 2022. URL: https://doi.org/10.1016/j.ultramic.2022.113517, doi:10.1016/j.ultramic.2022.113517.