# Performance & Optimization

Performance documentation for the quantem.widget data pipeline, focused on 4D-STEM and 5D-STEM workflows on Apple Silicon.

## Data Pipeline Architecture

### Metal → NumPy → PyTorch MPS (the unavoidable copy)

The end-to-end data path on Apple Silicon:

```
Disk (bitshuffle+LZ4)
  → Metal compute shader decompresses into MTLBuffer (StorageModeShared)
    → np.frombuffer() wraps the Metal buffer as a numpy view (zero-copy, same pointer)
      → torch.from_numpy().to("mps") copies into PyTorch's own MTLBuffer (memcpy!)
```

The last step is a **memcpy within unified DRAM** — both the Metal decompressor's output buffer and PyTorch's MPS buffer sit in the same physical memory, but PyTorch MPS cannot adopt external Metal buffers. There is no `torch.from_metal_buffer()` API. This copy is the primary init bottleneck:

| Data size | memcpy time | Notes |
|-----------|-------------|-------|
| 0.6 GB (det_bin=8) | ~30 ms | Negligible |
| 2.3 GB (det_bin=4) | ~110 ms | Acceptable |
| 9.0 GB (det_bin=2) | ~1.2 s | Main bottleneck |
| 5.6 GB (5D, det_bin=4) | ~7.7 s | Significant but one-time |

This copy is necessary because MPS virtual imaging (`tensordot` at 22ms) is significantly faster than CPU alternatives (34ms numpy, 139ms CPU torch at full 192×192 detector). At the unbinned 256×256×192×192 size, CPU torch tensordot is 139ms vs MPS at 22ms — MPS is required for interactive drag.

**Why not skip PyTorch and stay on numpy?** For small detectors (det_bin=4+), numpy tensordot at 34ms is usable. But at full detector resolution (192×192), numpy tensordot takes 1.9 seconds — completely unusable for interactive work. MPS is essential.

**Why not write directly to a PyTorch MPS buffer?** PyTorch MPS tensors are allocated through PyTorch's internal Metal allocator. Our Metal compute shaders use their own `MTLBuffer` allocations. PyTorch has no API to wrap an external Metal buffer as an MPS tensor (`torch.from_dlpack` does not support MPS). Until PyTorch adds this, the memcpy is unavoidable.

### Raw Float32 Pipeline

All widgets send **raw float32** data to JavaScript. Normalization, log scale, auto-contrast percentile clipping, histogram computation, and colormap LUT application all happen in JS for instant interactivity. Python never pre-renders colormapped images — it only sends the raw numerical data.

Show4DSTEM uses PyTorch MPS for virtual imaging computation (`tensordot`/sparse indexing for BF/ADF/custom mask integration), keeping the heavy math on GPU while JS handles display.

## 5D-STEM Eager Loading

Show4DSTEM loads the full 5D tensor to GPU at init time. This trades slower initialization for instant frame switching during interactive work.

### How It Works

- The full `(n_frames, scan_r, scan_c, det_r, det_c)` tensor is copied to MPS at widget creation.
- `_data[frame_idx]` returns a **GPU tensor view** (0ms) — not a copy. Frame switching is instantaneous.
- Virtual imaging (`tensordot` with the ROI mask) runs entirely on GPU, so BF/ADF/custom ROI updates during drag are real-time.

### MPS INT_MAX Fallback

PyTorch MPS has a hard limit of `INT_MAX` (2^31 - 1 = 2,147,483,647) elements per tensor. When the total element count exceeds this, Show4DSTEM automatically falls back to CPU torch tensors.

**Real-world 5D dataset sizes:**

| Config | Shape | Elements | Memory | Backend |
|--------|-------|----------|--------|---------|
| det_bin=8, 10 files | 10 x 256 x 256 x 24 x 24 | 377M | ~1.5 GB | MPS |
| det_bin=4, 10 files | 10 x 256 x 256 x 48 x 48 | 1.5B | ~6 GB | MPS |
| det_bin=2, 10 files | 10 x 256 x 256 x 96 x 96 | 6.0B | ~24 GB | CPU (exceeds INT_MAX) |

### Init and Frame Switching Benchmarks

Measured with synthetic 5D data on Apple M5 (24 GB):

| Config | Init (numpy→MPS) | Global min/max | Frame switch | `auto_detect_center` sum |
|--------|-------------------|----------------|--------------|--------------------------|
| det_bin=8 (1.4 GB) | 253 ms | 137 ms | **7 µs** | 31 ms |
| det_bin=4 (5.6 GB) | 7.7 s | 177 ms | **8 µs** | 75 ms |

Frame switching is a tensor view (7–8 µs) — effectively instant. The init cost scales with tensor size, but this is a one-time cost at widget creation.

**Comparison with previous lazy loading approach:**

| Strategy | Latency per frame switch | Notes |
|----------|--------------------------|-------|
| Eager GPU (current) | **7 µs** | Tensor view, no copy |
| Lazy NumPy→MPS copy | 28 ms | `torch.from_numpy().to("mps")` per frame |
| Lazy with `.copy()` | 96 ms | Unnecessary contiguous copy overhead |

Eager loading eliminates per-frame latency entirely, making 5D time/tilt series exploration feel instantaneous.

## Virtual Imaging Performance

Show4DSTEM computes virtual images by integrating diffraction patterns over a mask (BF disk, ADF annulus, or custom ROI). The implementation uses `tensordot` with sparse indexing for small masks.

### 256×256×96×96 (det_bin=2)

| Method | MPS | CPU torch | NumPy | Notes |
|--------|-----|-----------|-------|-------|
| tensordot (BF, 317 px) | **22 ms** | 34 ms | 34 ms | Default path |
| sparse sum (BF, 317 px) | **5 ms** | 21 ms | 64 ms | Used for small masks |
| tensordot (ADF, 952 px) | **23 ms** | 34 ms | 34 ms | |
| elementwise | 127 ms | — | 222 ms | Avoided |

### 256×256×192×192 (no binning, 9.2 GB)

| Method | CPU torch | NumPy | Notes |
|--------|-----------|-------|-------|
| tensordot (BF) | 139 ms | **1,918 ms** | MPS unavailable (>INT_MAX) |
| sparse sum (BF, 1257 px) | **85 ms** | 374 ms | |

At det_bin=2, MPS sparse sum at 5ms gives ~200fps during ROI drag. Even at full detector resolution where MPS is unavailable, CPU torch sparse at 85ms is usable. **No debouncing is needed** — the user sees real-time virtual image updates as they move the detector ROI.

## IO.arina_file GPU Pipeline

`IO.arina_file` reads bitshuffle+LZ4 compressed 4D-STEM data using Metal GPU decompression on Apple Silicon.

### Double-Buffered Architecture

The pipeline uses double buffering to overlap IO and decompression:

1. **CPU** reads compressed chunk N+1 from disk
2. **GPU** decompresses chunk N via Metal compute shaders
3. These run concurrently — disk IO is fully hidden behind GPU work

The bottleneck is GPU decompression, not disk IO: decompressing 262k frames of bitshuffle+LZ4 takes ~1.5s on M5, while the 1.7 GB disk read at 8.2 GB/s SSD throughput completes in ~0.2s.

### Buffer Sizing

Compressed buffer allocation uses a conservative formula:

```
buffer_size = max(256 MB, max_frames * frame_bytes // 4)
```

The worst observed compression ratio is ~7:1. Using `// 4` (4:1 ratio) provides headroom for poorly-compressing datasets.

### Early Validation

The pipeline checks file existence before starting the GPU pipeline, failing fast for incomplete datasets rather than discovering missing chunks mid-decompression.

## Benchmarks (Apple M5, 24 GB)

### IO.arina_file Single File

SnMoS2 dataset: 262,144 frames, 192 x 192 detector pixels.

| Config | Output Shape | Memory | Load Time |
|--------|-------------|--------|-----------|
| det_bin=2 | 512 x 512 x 96 x 96 | 9.0 GB | 1.8 s |
| det_bin=4 | 512 x 512 x 48 x 48 | 2.3 GB | 1.7 s |
| det_bin=8 | 512 x 512 x 24 x 24 | 0.6 GB | 1.8 s |

Load time is dominated by GPU decompression and is nearly constant across bin factors — binning happens after decompression.

### IO.arina_folder (Multi-File 5D)

Korean sample: 12 files, ~65k frames each.

| Config | Output Shape | Memory | Load | +Show4DSTEM |
|--------|-------------|--------|------|-------------|
| det_bin=8 (10 files) | 10 x 256 x 256 x 24 x 24 | 1.5 GB | 9.5 s | 11.0 s |
| det_bin=4 (10 files) | 10 x 256 x 256 x 48 x 48 | 6.0 GB | 10.8 s | 16.3 s |

The "+Show4DSTEM" column includes widget initialization (MPS tensor copy + initial virtual image computation).

## Memory Guidelines

### Estimating Memory Usage

Memory for a 4D-STEM dataset in float32:

```
memory_bytes = scan_r * scan_c * (det_r / bin)^2 * 4
```

For 5D datasets (time/tilt series), multiply by `n_frames`.

**Examples (512 x 512 scan):**

| det_bin | Detector | Per-Frame 4D | 10-Frame 5D |
|---------|----------|-------------|-------------|
| 8 | 24 x 24 | 0.6 GB | 6.0 GB |
| 4 | 48 x 48 | 2.3 GB | 23 GB |
| 2 | 96 x 96 | 9.0 GB | 90 GB |
| 1 | 192 x 192 | 36 GB | 360 GB |

### Automatic Bin Selection

`det_bin="auto"` picks the smallest bin factor that fits in available RAM, balancing resolution against memory constraints.

### Scan Binning

`scan_bin=2` halves scan resolution in each dimension, reducing the 4D data size by 4x. This is useful for quick survey analysis before committing to full-resolution processing.

## Tips for Users

1. **Quick survey**: use `det_bin=8` for fast loading and exploration. Switch to `det_bin=2` for publication-quality analysis once you have identified regions of interest.

2. **Scan binning**: `scan_bin=2` quarters the 4D data size at the cost of halved spatial resolution. Useful for large-area surveys.

3. **Large datasets (>16 GB)**: use `det_bin=4` or higher to stay within MPS memory limits. The `det_bin="auto"` option handles this automatically.

4. **5D frame switching is instant**: the full tensor is on GPU, so scrubbing through time/tilt frames has zero latency. There is no need to precompute virtual images for each frame.

5. **Auto-center detection**: sums all diffraction frames across all scan positions (and all time/tilt frames for 5D). This may take a few seconds for large datasets but only runs once at initialization.

6. **Hot pixel filtering**: adds negligible overhead (~1% of load time). Leave it enabled unless you have a specific reason to skip it.

## Memory Management

4D-STEM datasets are large (1–10+ GB). When switching between datasets in a notebook session, you must explicitly free both the widget's GPU tensor and the source numpy array. Python's garbage collector alone is not enough — the MPS allocator caches freed buffers and does not return them to the system until `torch.mps.empty_cache()` is called.

### Using `free()`

Show4DSTEM provides a `free()` convenience method that handles the full cleanup:

```python
w.free()          # deletes MPS tensor, runs gc, flushes MPS cache
del result        # free the source numpy IOResult array
```

### Manual cleanup

If you need fine-grained control:

```python
import gc, torch

del w             # delete widget (releases reference to MPS tensor)
del result        # delete IOResult (releases numpy array)
gc.collect()      # trigger Python garbage collection
torch.mps.empty_cache()  # flush MPS allocator cache back to system
```

### Why `del` alone isn't enough

The MPS allocator maintains an internal free-list of GPU buffers. When a PyTorch tensor is deleted, the underlying Metal buffer is returned to this free-list — not to the operating system. Subsequent `torch` allocations reuse these cached buffers (which is fast), but if you're loading a new dataset with a different size, the cached buffers are useless and just waste memory. `torch.mps.empty_cache()` drains this free-list, making the memory available for new allocations.

### Monitoring memory

```python
torch.mps.current_allocated_memory() / 1e9   # GB currently in use
torch.mps.driver_allocated_memory() / 1e9    # GB allocated by Metal driver
```