Eval Harness & Metrics¶

run_benchmark executes a model over a sequence of BenchmarkFixture objects and returns a BenchmarkReport whose metrics dict contains the standard OM-018 metric bundle.

Metric Bundle¶

Metric	Path	Gating?	Description
Latency p50	`latency.p50_ms`	No	Median steady-state fixture latency in ms.
Latency p95	`latency.p95_ms`	No	95th-percentile steady-state fixture latency in ms.
Latency count	`latency.count`	No	Number of steady-state fixtures (excludes cold start).
Cold-start latency	`latency.cold_start_ms`	No	Wall-clock latency of the first fixture call in ms.
Peak RSS	`resources.peak_rss_bytes`	No	Peak resident set size in bytes during the run.

Edge Metrics¶

cold_start_ms¶

The harness records the wall-clock latency of the first fixture call separately. The default runner keeps a shared model loader for the duration of the benchmark run, so that first call encloses model and tokenizer loading plus the first forward pass. Later fixture calls reuse the warmed loader and feed the steady-state latency summary. The value is surfaced at:

report.metrics['latency']['cold_start_ms']

It is excluded from the steady-state p50_ms, p95_ms, and count values.

Reported, not gating

cold_start_ms does not participate in any release gate. It is an observability metric intended to track model-load overhead over time — not a pass/fail criterion.

report = run_benchmark(fixtures, suite="my-suite", model_name="my-model", runner=runner)
cold_ms = report.metrics["latency"]["cold_start_ms"]
print(f"Cold-start latency: {cold_ms:.1f} ms")