Performance & Optimization#
Performance documentation for the quantem.widget data pipeline, focused on 4D-STEM and 5D-STEM workflows on Apple Silicon.
Data Pipeline Architecture#
Raw Float32 Pipeline#
All widgets send raw float32 data to JavaScript. Normalization, log scale, auto-contrast percentile clipping, histogram computation, and colormap LUT application all happen in JS for instant interactivity. Python never pre-renders colormapped images — it only sends the raw numerical data.
Show4DSTEM uses PyTorch MPS for virtual imaging computation (tensordot/sparse indexing for BF/ADF/custom mask integration), keeping the heavy math on GPU while JS handles display.
5D-STEM Eager Loading#
Show4DSTEM loads the full 5D tensor to GPU at init time. This trades slower initialization for instant frame switching during interactive work.
How It Works#
The full
(n_frames, scan_r, scan_c, det_r, det_c)tensor is copied to MPS at widget creation._data[frame_idx]returns a GPU tensor view (0ms) — not a copy. Frame switching is instantaneous.Virtual imaging (
tensordotwith the ROI mask) runs entirely on GPU, so BF/ADF/custom ROI updates during drag are real-time.
MPS INT_MAX Fallback#
PyTorch MPS has a hard limit of INT_MAX (2^31 - 1 = 2,147,483,647) elements per tensor. When the total element count exceeds this, Show4DSTEM automatically falls back to CPU torch tensors.
Real-world 5D dataset sizes:
Config |
Shape |
Elements |
Memory |
Backend |
|---|---|---|---|---|
det_bin=8, 10 files |
10 x 256 x 256 x 24 x 24 |
377M |
~1.5 GB |
MPS |
det_bin=4, 10 files |
10 x 256 x 256 x 48 x 48 |
1.5B |
~6 GB |
MPS |
det_bin=2, 10 files |
10 x 256 x 256 x 96 x 96 |
6.0B |
~24 GB |
CPU (exceeds INT_MAX) |
Init and Frame Switching Benchmarks#
Measured with synthetic 5D data on Apple M5 (24 GB):
Config |
Init (numpy→MPS) |
Global min/max |
Frame switch |
|
|---|---|---|---|---|
det_bin=8 (1.4 GB) |
253 ms |
137 ms |
7 µs |
31 ms |
det_bin=4 (5.6 GB) |
7.7 s |
177 ms |
8 µs |
75 ms |
Frame switching is a tensor view (7–8 µs) — effectively instant. The init cost scales with tensor size, but this is a one-time cost at widget creation.
Comparison with previous lazy loading approach:
Strategy |
Latency per frame switch |
Notes |
|---|---|---|
Eager GPU (current) |
7 µs |
Tensor view, no copy |
Lazy NumPy→MPS copy |
28 ms |
|
Lazy with |
96 ms |
Unnecessary contiguous copy overhead |
Eager loading eliminates per-frame latency entirely, making 5D time/tilt series exploration feel instantaneous.
Virtual Imaging Performance#
Show4DSTEM computes virtual images by integrating diffraction patterns over a mask (BF disk, ADF annulus, or custom ROI). The implementation uses tensordot with sparse indexing for small masks.
256×256×96×96 (det_bin=2)#
Method |
MPS |
CPU torch |
NumPy |
Notes |
|---|---|---|---|---|
tensordot (BF, 317 px) |
22 ms |
34 ms |
34 ms |
Default path |
sparse sum (BF, 317 px) |
5 ms |
21 ms |
64 ms |
Used for small masks |
tensordot (ADF, 952 px) |
23 ms |
34 ms |
34 ms |
|
elementwise |
127 ms |
— |
222 ms |
Avoided |
256×256×192×192 (no binning, 9.2 GB)#
Method |
CPU torch |
NumPy |
Notes |
|---|---|---|---|
tensordot (BF) |
139 ms |
1,918 ms |
MPS unavailable (>INT_MAX) |
sparse sum (BF, 1257 px) |
85 ms |
374 ms |
At det_bin=2, MPS sparse sum at 5ms gives ~200fps during ROI drag. Even at full detector resolution where MPS is unavailable, CPU torch sparse at 85ms is usable. No debouncing is needed — the user sees real-time virtual image updates as they move the detector ROI.
IO.arina_file GPU Pipeline#
IO.arina_file reads bitshuffle+LZ4 compressed 4D-STEM data using Metal GPU decompression on Apple Silicon.
Double-Buffered Architecture#
The pipeline uses double buffering to overlap IO and decompression:
CPU reads compressed chunk N+1 from disk
GPU decompresses chunk N via Metal compute shaders
These run concurrently — disk IO is fully hidden behind GPU work
The bottleneck is GPU decompression, not disk IO: decompressing 262k frames of bitshuffle+LZ4 takes ~1.5s on M5, while the 1.7 GB disk read at 8.2 GB/s SSD throughput completes in ~0.2s.
Buffer Sizing#
Compressed buffer allocation uses a conservative formula:
buffer_size = max(256 MB, max_frames * frame_bytes // 4)
The worst observed compression ratio is ~7:1. Using // 4 (4:1 ratio) provides headroom for poorly-compressing datasets.
Early Validation#
The pipeline checks file existence before starting the GPU pipeline, failing fast for incomplete datasets rather than discovering missing chunks mid-decompression.
Benchmarks (Apple M5, 24 GB)#
IO.arina_file Single File#
SnMoS2 dataset: 262,144 frames, 192 x 192 detector pixels.
Config |
Output Shape |
Memory |
Load Time |
|---|---|---|---|
det_bin=2 |
512 x 512 x 96 x 96 |
9.0 GB |
1.8 s |
det_bin=4 |
512 x 512 x 48 x 48 |
2.3 GB |
1.7 s |
det_bin=8 |
512 x 512 x 24 x 24 |
0.6 GB |
1.8 s |
Load time is dominated by GPU decompression and is nearly constant across bin factors — binning happens after decompression.
IO.arina_folder (Multi-File 5D)#
Korean sample: 12 files, ~65k frames each.
Config |
Output Shape |
Memory |
Load |
+Show4DSTEM |
|---|---|---|---|---|
det_bin=8 (10 files) |
10 x 256 x 256 x 24 x 24 |
1.5 GB |
9.5 s |
11.0 s |
det_bin=4 (10 files) |
10 x 256 x 256 x 48 x 48 |
6.0 GB |
10.8 s |
16.3 s |
The “+Show4DSTEM” column includes widget initialization (MPS tensor copy + initial virtual image computation).
Memory Guidelines#
Estimating Memory Usage#
Memory for a 4D-STEM dataset in float32:
memory_bytes = scan_r * scan_c * (det_r / bin)^2 * 4
For 5D datasets (time/tilt series), multiply by n_frames.
Examples (512 x 512 scan):
det_bin |
Detector |
Per-Frame 4D |
10-Frame 5D |
|---|---|---|---|
8 |
24 x 24 |
0.6 GB |
6.0 GB |
4 |
48 x 48 |
2.3 GB |
23 GB |
2 |
96 x 96 |
9.0 GB |
90 GB |
1 |
192 x 192 |
36 GB |
360 GB |
Automatic Bin Selection#
det_bin="auto" picks the smallest bin factor that fits in available RAM, balancing resolution against memory constraints.
Scan Binning#
scan_bin=2 halves scan resolution in each dimension, reducing the 4D data size by 4x. This is useful for quick survey analysis before committing to full-resolution processing.
Tips for Users#
Quick survey: use
det_bin=8for fast loading and exploration. Switch todet_bin=2for publication-quality analysis once you have identified regions of interest.Scan binning:
scan_bin=2quarters the 4D data size at the cost of halved spatial resolution. Useful for large-area surveys.Large datasets (>16 GB): use
det_bin=4or higher to stay within MPS memory limits. Thedet_bin="auto"option handles this automatically.5D frame switching is instant: the full tensor is on GPU, so scrubbing through time/tilt frames has zero latency. There is no need to precompute virtual images for each frame.
Auto-center detection: sums all diffraction frames across all scan positions (and all time/tilt frames for 5D). This may take a few seconds for large datasets but only runs once at initialization.
Hot pixel filtering: adds negligible overhead (~1% of load time). Leave it enabled unless you have a specific reason to skip it.
Memory Management#
4D-STEM datasets are large (1–10+ GB). When switching between datasets in a notebook session, you must explicitly free both the widget’s GPU tensor and the source numpy array. Python’s garbage collector alone is not enough — the MPS allocator caches freed buffers and does not return them to the system until torch.mps.empty_cache() is called.
Using free()#
Show4DSTEM provides a free() convenience method that handles the full cleanup:
w.free() # deletes MPS tensor, runs gc, flushes MPS cache
del result # free the source numpy IOResult array
Manual cleanup#
If you need fine-grained control:
import gc, torch
del w # delete widget (releases reference to MPS tensor)
del result # delete IOResult (releases numpy array)
gc.collect() # trigger Python garbage collection
torch.mps.empty_cache() # flush MPS allocator cache back to system
Why del alone isn’t enough#
The MPS allocator maintains an internal free-list of GPU buffers. When a PyTorch tensor is deleted, the underlying Metal buffer is returned to this free-list — not to the operating system. Subsequent torch allocations reuse these cached buffers (which is fast), but if you’re loading a new dataset with a different size, the cached buffers are useless and just waste memory. torch.mps.empty_cache() drains this free-list, making the memory available for new allocations.
Monitoring memory#
torch.mps.current_allocated_memory() / 1e9 # GB currently in use
torch.mps.driver_allocated_memory() / 1e9 # GB allocated by Metal driver