# Benchmark Ownership

`ufp.benchmarks` is an expert benchmark API for speed gates, benchmark
automation, and performance investigations. Benchmarks are not part of default
CI, but they are the acceptance layer for refactors that may affect runtime
behavior. Smoke tests in
`tests/speed/test_benchmarking.py` protect entry points and result shape; benchmark
runs protect relative timing.

## Quick Smoke Checks

Run these after changing benchmark modules or public benchmark exports:

```sh
python -m pytest tests/speed/test_benchmarking.py
```

Run speed gates after touching three-body evaluation, least-squares assembly,
block matrices, or cache warming:

```sh
python -m pytest tests/speed
tox -e speed
```

## Area To Benchmark Map

| Refactor area | Required checks |
| --- | --- |
| Pair and two-body term evaluation | Two-body tests, training tests that use pair terms, and speed gates. |
| Three-body bucketing or evaluator dispatch | `tests/terms/test_threebody_*.py`, `tests/leastsquares/test_periodic_assembly.py`, speed gates, and a dynamic three-body benchmark comparison. |
| Three-body feature caches or memmap caches | Three-body cache reuse tests, training cache tests, speed gates, and a three-body cache benchmark comparison. |
| Least-squares assembly or block matrices | Least-squares periodic tests, alchemical tests, speed gates, and least-squares-vs-training benchmark comparison. |
| Training batch caching | Training tests, workflow example tests, and speed gates. |
| Runtime backend option parsing | Three-body tests, least-squares periodic tests, benchmark smoke tests, and explicit environment override tests. |
| Examples and docs only | Docs build or targeted example tests; no benchmark is required unless executable workflow code changes. |

## Benchmark Commands

Least-squares versus training toy benchmark:

```sh
python -m ufp.benchmarks._leastsquares_vs_training --scenario triangle_pair_threebody --device cpu --dtype float64 --training-epochs 4 --cg-checkpoints 1,2,3,4
```

Named A/B checkpoints for least-squares and training:

```sh
python -m ufp.benchmarks._leastsquares_vs_training --scenario pair_only --checkpoint baseline --checkpoint cached_neighbor_lists --device cpu
```

Three-body dynamic and cache benchmarks currently expose Python entry points.
Use a short script when comparing backends or refactors:

```python
from ufp.benchmarks import (
    run_threebody_cache_benchmark,
    run_threebody_dynamic_breakdown_benchmark,
)

print(
    run_threebody_dynamic_breakdown_benchmark(
        scenario="ternary_alloy",
        backend="torch",
        device="cpu",
        dtype="float64",
        repeats=20,
        warmup=5,
    )
)
print(
    run_threebody_cache_benchmark(
        scenario="ternary_alloy",
        backend="torch",
        device="cpu",
        dtype="float64",
        repeats=20,
        warmup=5,
    )
)
```

If native C++ or CUDA kernels are part of the change, build the extension first
and repeat the relevant commands with `backend="native"` or CUDA devices where
available.

## Acceptance Rules

- Compare against the same machine, device, dtype, scenario, and repeat counts.
- Keep correctness tests as the first gate. A faster run with changed numerical
  behavior is not accepted.
- Treat `tests/speed/` as a protected contract. Do not relax a gate
  as part of a refactor unless the team has agreed that the measured workload or
  hardware assumption changed.
- For CPU-only refactors, CPU benchmark parity is sufficient unless the code
  also changes CUDA dispatch or tensor-device movement.
- For backend dispatch refactors, verify both available and unavailable native
  extension paths so fallback behavior remains explicit.