Benchmark Ownership

ufp.benchmarks is an expert benchmark API for speed gates, benchmark automation, and performance investigations. Benchmarks are not part of default CI, but they are the acceptance layer for refactors that may affect runtime behavior. Smoke tests in tests/speed/test_benchmarking.py protect entry points and result shape; benchmark runs protect relative timing.

Quick Smoke Checks

Run these after changing benchmark modules or public benchmark exports:

python -m pytest tests/speed/test_benchmarking.py

Run speed gates after touching three-body evaluation, least-squares assembly, block matrices, or cache warming:

python -m pytest tests/speed
tox -e speed

Area To Benchmark Map

Refactor area

Required checks

Pair and two-body term evaluation

Two-body tests, training tests that use pair terms, and speed gates.

Three-body bucketing or evaluator dispatch

tests/terms/test_threebody_*.py, tests/leastsquares/test_periodic_assembly.py, speed gates, and a dynamic three-body benchmark comparison.

Three-body feature caches or memmap caches

Three-body cache reuse tests, training cache tests, speed gates, and a three-body cache benchmark comparison.

Least-squares assembly or block matrices

Least-squares periodic tests, alchemical tests, speed gates, and least-squares-vs-training benchmark comparison.

Training batch caching

Training tests, workflow example tests, and speed gates.

Runtime backend option parsing

Three-body tests, least-squares periodic tests, benchmark smoke tests, and explicit environment override tests.

Examples and docs only

Docs build or targeted example tests; no benchmark is required unless executable workflow code changes.

Benchmark Commands

Least-squares versus training toy benchmark:

python -m ufp.benchmarks._leastsquares_vs_training --scenario triangle_pair_threebody --device cpu --dtype float64 --training-epochs 4 --cg-checkpoints 1,2,3,4

Named A/B checkpoints for least-squares and training:

python -m ufp.benchmarks._leastsquares_vs_training --scenario pair_only --checkpoint baseline --checkpoint cached_neighbor_lists --device cpu

Three-body dynamic and cache benchmarks currently expose Python entry points. Use a short script when comparing backends or refactors:

from ufp.benchmarks import (
    run_threebody_cache_benchmark,
    run_threebody_dynamic_breakdown_benchmark,
)

print(
    run_threebody_dynamic_breakdown_benchmark(
        scenario="ternary_alloy",
        backend="torch",
        device="cpu",
        dtype="float64",
        repeats=20,
        warmup=5,
    )
)
print(
    run_threebody_cache_benchmark(
        scenario="ternary_alloy",
        backend="torch",
        device="cpu",
        dtype="float64",
        repeats=20,
        warmup=5,
    )
)

If native C++ or CUDA kernels are part of the change, build the extension first and repeat the relevant commands with backend="native" or CUDA devices where available.

Acceptance Rules

  • Compare against the same machine, device, dtype, scenario, and repeat counts.

  • Keep correctness tests as the first gate. A faster run with changed numerical behavior is not accepted.

  • Treat tests/speed/ as a protected contract. Do not relax a gate as part of a refactor unless the team has agreed that the measured workload or hardware assumption changed.

  • For CPU-only refactors, CPU benchmark parity is sufficient unless the code also changes CUDA dispatch or tensor-device movement.

  • For backend dispatch refactors, verify both available and unavailable native extension paths so fallback behavior remains explicit.