Workflow Management

This page collects the non-hot-path orchestration pieces used by examples and small studies. These helpers make staged experiments reproducible; they are not alternate model-evaluation kernels.

Stages

ufp.workflows exposes small stage objects for explicit scripts:

  • ProjectStage runs an offline projection helper and stores its result in the context mapping;

  • LinearFitStage wraps LinearFitter;

  • ResidualizeStage materializes residual labels;

  • TrainStage runs gradient training, optionally with coefficient freezes;

  • ValidateStage records evaluation metrics.

Stages declare required inputs, produced outputs, and metadata, but users still own the sequence and context updates:

from ufp.workflows import LinearFitStage, workflow_stage_metadata

context = {"model": model, "fit_samples": samples}
stage = LinearFitStage(fit_kwargs={"batch_size": 16})
result = stage.run(context)
result.update_context(context)

workflow_stage_metadata([result], name="pair-refit")

Checkpoints

Use workflow checkpoints when a staged script needs enough metadata to validate that a later reload matches the model layout:

from ufp.workflows import save_workflow_checkpoint

save_workflow_checkpoint(
    "workflow.pt",
    model,
    fit_blocks=fit_selectors,
    freeze_blocks=freeze_selectors,
    stage_metadata=stage_metadata,
    validation_metrics=metrics,
)

Checkpoints include package version, model and term metadata, coefficient layout, selector metadata, fixed-coefficient hashes, stage metadata, projection diagnostics, validation metrics, user metadata, and the model state_dict.

Residualization

materialize_residual_dataset() writes residual energy, force, or stress labels into an ASEAtomsDataset. Use it when a longer training run should subtract frozen priors or fixed spline blocks once, then optimize on the residual labels. Residual metadata records selectors, target weights, units, frozen-term state hashes, and optional projection metadata so stale residual data can be rejected.

Prepared Geometry

ufp.workflows.prepared can materialize tensorized geometry, neighbor lists, pair categories, optional triplet-cache metadata, and strict source signatures. It is intentionally imported directly from ufp.workflows.prepared rather than exported from top-level ufp.workflows.

Prepared geometry is useful for cache-reuse experiments and workflow validation. It is not a runtime input path for model evaluation, and it should not acquire hot-path checks or tensor transformations that belong inside terms.

Caching

Large least-squares or three-body studies can write assembled batches, normal-equation components, CG checkpoints, and dense feature caches. Cache manifests include enough metadata to reject incompatible sample sets, target weights, dtypes, layouts, coefficient selections, fixed-coefficient values, and regularization semantics.

Use ufp.cache for settings-addressed cache identities and human-readable cache summaries. Top-level ufp convenience exports expose the same common helpers for scripts. ufp.workflows.cache is a compatibility alias for older workflow code; it is not the owner of cache identity policy.

Use disk-backed caches for repeated solves over fixed geometries. Prefer ordinary in-memory assembly for small models, early debugging, and one-off experiments.

Regularization Tuning

ufp.workflows.regularization adds a reusable layer for choosing linear least-squares ridge weights. It first estimates a scale from the weighted design matrix,

\[ \lambda_g = \alpha \frac{\operatorname{trace}(\mathbf A_g^\mathsf T \mathbf A_g)} {n_g}, \]

then searches log-spaced candidates for ridge, onebody_ridge, twobody_ridge, and threebody_ridge. Pair and triplet counts are useful diagnostics, but the default is based on design-block scale because that is what sets the data curvature seen by each coefficient group.

from ufp.workflows import (
    RegularizationSearchConfig,
    save_workflow_checkpoint,
    tune_linear_regularization,
    workflow_stage_metadata,
)

search = tune_linear_regularization(
    make_model,
    dataset,
    config=RegularizationSearchConfig(
        stage_subset_sizes=(64, 256),
        cache_directory="regularization-cache",
        refit_full=True,
    ),
    fitter_kwargs={
        "fit_energy": True,
        "fit_forces": True,
        "solver": "normal_equation_direct",
        "dtype": dtype,
    },
    fit_kwargs={"batch_size": 64},
)

stage_metadata = workflow_stage_metadata(
    [search.metadata],
    name="regularization-search",
)
save_workflow_checkpoint(
    "regularized-workflow.pt",
    search.final_model,
    stage_metadata=stage_metadata,
    validation_metrics=search.metadata,
)

When no validation split is present, tuning carves a deterministic validation subset from the training indices and leaves holdout indices untouched. Candidate fits use isolated models from model_factory, so search trials do not mutate a caller-owned model.