# Uploading datasets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bobleesj/quantem.data/blob/main/notebooks/upload.ipynb)

All datasets in quantem.data are hosted on [Hugging Face Hub](https://huggingface.co/datasets/bobleesj/quantem-data).
Uploads create a **Pull Request** by default — the data is reviewed before merging into the public catalog.

## Prerequisites

1. Create a free [Hugging Face account](https://huggingface.co/join)
2. Create an access token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (needs write access)
3. Log in from your terminal:

```bash
huggingface-cli login
```

## Install

```bash
pip install --pre -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ quantem-data
```

## Naming convention

Dataset names follow a **material-first** convention: `{material}_{descriptor}`.

| Rule | Example | Bad example |
|------|---------|-------------|
| Lowercase, underscores only | `srtio3_lamella` | `SrTiO3-Lamella` |
| Material first | `gold_nanoparticle` | `nanoparticle_gold` |
| Descriptor second (morphology, orientation) | `silicon_110` | `110_silicon` |
| Lab suffix only to disambiguate | `srtio3_lamella_ncem` | `ncem_srtio3` |
| No resolution, binning, or year in name | `graphene_monolayer` | `graphene_256x256_2024` |

Resolution, binning, year, and instrument details go in the JSON metadata — not the name.

## Upload from Python

```python
import numpy as np
from quantem.data import upload

# Your data (NumPy array)
data = np.load("my_hrtem_image.npy")

# Upload — creates a PR on Hugging Face Hub
upload(
    data,
    name="silicon_110_hrtem",
    technique="hrtem",
    description="Silicon [110] zone axis, HRTEM at 200 kV",
    contributor="Jane Doe",
)
```

Output:

```
Created PR to add silicon_110_hrtem (0.2 MB)
Review: https://huggingface.co/datasets/bobleesj/quantem-data/discussions/1
```

The PR link takes you to the Hugging Face discussion page where the maintainer can review your data and metadata, then merge it.

## Preview before uploading

Use `preview_upload()` to validate naming, metadata, and check for duplicates before submitting:

```python
from quantem.data import preview_upload

errors = preview_upload(
    data,
    name="silicon_110_hrtem",
    technique="hrtem",
    description="Silicon [110] zone axis, HRTEM at 200 kV",
    contributor="Jane Doe",
)

if errors:
    for e in errors:
        print(f"  - {e}")
else:
    print("Ready to upload!")
```

`preview_upload()` checks:
- Naming convention (lowercase, underscores, material-first)
- Valid technique folder
- Metadata schema compliance
- Array shape consistency
- Duplicate name detection on HF Hub

Fix any errors before calling `upload()`.

## Upload from the command line

```bash
quantem-data upload my_data.npy \
    --name silicon_110_hrtem \
    --technique hrtem \
    --description "Silicon [110] zone axis" \
    --contributor "Jane Doe"
```

By default this creates a PR. Add `--direct` to commit directly (requires write access to the repo).

## Valid techniques

Each dataset belongs to a technique folder:

| technique | data type | widget |
|-----------|-----------|--------|
| `4dstem` | 4D-STEM diffraction | Show4DSTEM, Show4D |
| `hrtem` | high-resolution TEM | Show2D, Mark2D |
| `eels` | electron energy loss | Show1D |
| `tomo` | tomography | Show3DVolume |
| `diffraction` | diffraction patterns | Show2D |
| `image` | virtual/derived images | Show2D, Mark2D |
| `complex` | ptychography | ShowComplex2D |

## Metadata schema

Every uploaded dataset gets a paired `.json` sidecar with metadata.

**Required fields:**

| field | description |
|-------|-------------|
| `schema_version` | current: `"1.0"` |
| `name` | must match the dataset name |
| `technique` | must be one of the valid techniques above |
| `description` | one-line human description |
| `data.shape` | must match the actual array shape |
| `data.dtype` | e.g. `"float32"` |
| `attribution.contributor` | who uploaded the data |
| `attribution.license` | must be open, e.g. `"CC-BY-4.0"` |

**Optional fields:**

| field | description |
|-------|-------------|
| `instrument.microscope` | e.g. `"JEOL JEM-2100F"` |
| `instrument.voltage_kv` | accelerating voltage |
| `instrument.detector` | e.g. `"Gatan OneView"` |
| `calibration.pixel_size` | with `pixel_size_unit` |
| `processing.source` | provenance info |
| `attribution.institution` | lab or university |
| `attribution.date` | upload date |
| `attribution.doi` | publication DOI if applicable |

## Custom metadata

By default, `upload()` generates a metadata template. For full control, pass your own metadata dict or JSON file:

```python
from quantem.data import upload, make_template

# Generate a template and customize it
meta = make_template(
    name="silicon_110_hrtem",
    technique="hrtem",
    shape=(512, 512),
    description="Silicon [110] zone axis, HRTEM at 200 kV",
    contributor="Jane Doe",
)
meta["instrument"] = {
    "microscope": "JEOL JEM-2100F",
    "voltage_kv": 200,
    "detector": "Gatan OneView",
}
meta["calibration"] = {
    "pixel_size": 0.15,
    "pixel_size_unit": "angstrom",
}

upload(data, name="silicon_110_hrtem", technique="hrtem", metadata=meta)
```

Or from a JSON file:

```python
upload(data, name="silicon_110_hrtem", technique="hrtem", metadata="metadata.json")
```

## Validate before uploading

```python
from quantem.data import validate, make_template

meta = make_template(name="test", technique="hrtem", shape=(256, 256))
errors = validate(meta)
if errors:
    for e in errors:
        print(f"  - {e}")
else:
    print("Valid!")
```

## Update existing metadata

To update metadata for a dataset that's already uploaded:

```python
from quantem.data import update_metadata

update_metadata("silicon_110_hrtem", {
    "calibration": {"pixel_size": 0.148, "pixel_size_unit": "angstrom"},
})
```

This also creates a PR by default.

## Direct commits (maintainers only)

If you have write access to the repo, you can skip the PR:

```python
upload(data, name="...", technique="...", create_pr=False)
update_metadata("...", {...}, create_pr=False)
```

```bash
quantem-data upload data.npy --name ... --technique ... --direct
```