# Workflow Management

This page collects the non-hot-path orchestration pieces used by examples and
small studies. These helpers make staged experiments reproducible; they are not
alternate model-evaluation kernels.

## Stages

`ufp.workflows` exposes small stage objects for explicit scripts:

- `ProjectStage` runs an offline projection helper and stores its result in the
  context mapping;
- `LinearFitStage` wraps `LinearFitter`;
- `ResidualizeStage` materializes residual labels;
- `TrainStage` runs gradient training, optionally with coefficient freezes;
- `ValidateStage` records evaluation metrics.

Stages declare required inputs, produced outputs, and metadata, but users still
own the sequence and context updates:

```python
from ufp.workflows import LinearFitStage, workflow_stage_metadata

context = {"model": model, "fit_samples": samples}
stage = LinearFitStage(fit_kwargs={"batch_size": 16})
result = stage.run(context)
result.update_context(context)

workflow_stage_metadata([result], name="pair-refit")
```

## Checkpoints

Use workflow checkpoints when a staged script needs enough metadata to validate
that a later reload matches the model layout:

```python
from ufp.workflows import save_workflow_checkpoint

save_workflow_checkpoint(
    "workflow.pt",
    model,
    fit_blocks=fit_selectors,
    freeze_blocks=freeze_selectors,
    stage_metadata=stage_metadata,
    validation_metrics=metrics,
)
```

Checkpoints include package version, model and term metadata, coefficient
layout, selector metadata, fixed-coefficient hashes, stage metadata, projection
diagnostics, validation metrics, user metadata, and the model `state_dict`.

## Residualization

`materialize_residual_dataset()` writes residual energy, force, or stress labels
into an `ASEAtomsDataset`. Use it when a longer training run should subtract
frozen priors or fixed spline blocks once, then optimize on the residual labels.
Residual metadata records selectors, target weights, units, frozen-term state
hashes, and optional projection metadata so stale residual data can be rejected.

## Prepared Geometry

`ufp.workflows.prepared` can materialize tensorized geometry, neighbor lists,
pair categories, optional triplet-cache metadata, and strict source signatures.
It is intentionally imported directly from `ufp.workflows.prepared` rather than
exported from top-level `ufp.workflows`.

Prepared geometry is useful for cache-reuse experiments and workflow validation.
It is not a runtime input path for model evaluation, and it should not acquire
hot-path checks or tensor transformations that belong inside terms.

## Caching

Large least-squares or three-body studies can write assembled batches,
normal-equation components, CG checkpoints, and dense feature caches. Cache
manifests include enough metadata to reject incompatible sample sets, target
weights, dtypes, layouts, coefficient selections, fixed-coefficient values, and
regularization semantics.

Use `ufp.cache` for settings-addressed cache identities and human-readable
cache summaries. Top-level `ufp` convenience exports expose the same common
helpers for scripts. `ufp.workflows.cache` is a compatibility alias for older
workflow code; it is not the owner of cache identity policy.

Use disk-backed caches for repeated solves over fixed geometries. Prefer
ordinary in-memory assembly for small models, early debugging, and one-off
experiments.

## Regularization Tuning

`ufp.workflows.regularization` adds a reusable layer for choosing linear
least-squares ridge weights. It first estimates a scale from the weighted design
matrix,

$$
\lambda_g = \alpha \frac{\operatorname{trace}(\mathbf A_g^\mathsf T \mathbf A_g)}
{n_g},
$$

then searches log-spaced candidates for `ridge`, `onebody_ridge`,
`twobody_ridge`, and `threebody_ridge`. Pair and triplet counts are useful
diagnostics, but the default is based on design-block scale because that is what
sets the data curvature seen by each coefficient group.

```python
from ufp.workflows import (
    RegularizationSearchConfig,
    save_workflow_checkpoint,
    tune_linear_regularization,
    workflow_stage_metadata,
)

search = tune_linear_regularization(
    make_model,
    dataset,
    config=RegularizationSearchConfig(
        stage_subset_sizes=(64, 256),
        cache_directory="regularization-cache",
        refit_full=True,
    ),
    fitter_kwargs={
        "fit_energy": True,
        "fit_forces": True,
        "solver": "normal_equation_direct",
        "dtype": dtype,
    },
    fit_kwargs={"batch_size": 64},
)

stage_metadata = workflow_stage_metadata(
    [search.metadata],
    name="regularization-search",
)
save_workflow_checkpoint(
    "regularized-workflow.pt",
    search.final_model,
    stage_metadata=stage_metadata,
    validation_metrics=search.metadata,
)
```

When no validation split is present, tuning carves a deterministic validation
subset from the training indices and leaves holdout indices untouched. Candidate
fits use isolated models from `model_factory`, so search trials do not mutate a
caller-owned model.