Comparing runs

The diff view shows you what changed between two runs of the same experiment, item by item, so you can see what a change moved before it ships.

It is the same comparison the CI gate and regression alerting act on, shown as a table you can read.

Comparing two runs: the pass-rate movement, improved and regressed counts, a significance verdict, and a per-case delta of every evaluator score

Get a diff in one call

Compare a candidate run against a baseline run:

curl 'http://localhost:8080/api/v1/experiments/{experimentId}/runs/{candidateRunId}/diff?baselineRunId={baselineRunId}'

You get back a summary (the headline movement) and a page of cases (one row per item).

Two roles matter here:

The candidate is the run under review (the new side).
The baseline is what you compare against (the old side, usually the previous successful run).

baselineRunId is required. Both runs must be terminal. Comparing an in-flight run would be misleading, so the API returns 409 if either run has not finished.

Open the diff in the UI

From a run, open the comparison against its baseline. You land on this page in the web UI:

/experiments/{experimentId}/runs/{candidateRunId}/diff

The candidate is the run you opened. The baseline is the run you compare it against.

Filter the case list

By default the case list returns every item. Add the status parameter to narrow it:

curl 'http://localhost:8080/api/v1/experiments/{experimentId}/runs/{candidateRunId}/diff?baselineRunId={baselineRunId}&status=REGRESSED'

`status` value	Returns
`ALL` (default)	Every item
`REGRESSED`	Items that got worse
`IMPROVED`	Items that got better
`CHANGED`	Items that regressed or improved

The case list is pageable. Use the standard page and size query parameters.

Read the summary

The summary reports the whole-run movement.

Field	Meaning
`baselinePassRate`, `candidatePassRate`, `passRateDelta`	Pass rate on each side, and candidate minus baseline
`significant`	Whether the pass-rate change is statistically significant, not noise
`improvedCount`, `regressedCount`, `unchangedCount`	How items moved between the runs
`addedCount`, `removedCount`	Items present in only one of the two runs
`pairing`	How items were matched: `dataset_item_id` (matched one to one by id) or `positional` (matched by position)

Read a case

Each case is one item compared across the two runs. A case carries:

status: REGRESSED, IMPROVED, UNCHANGED, ADDED, or REMOVED.
passFlip: true when the item flipped between pass and fail.
input: the item's input text.
evaluators: the per-evaluator deltas, so you can see which evaluator moved.

Each entry in evaluators has the evaluator name, its baselineMean and candidateMean, the delta (candidate minus baseline), a per-evaluator status (IMPROVED, REGRESSED, or UNCHANGED), and a significant flag for that evaluator's change.

How significance gating works

A change counts as a regression only when it clears two bars:

It is beyond a small epsilon (not a rounding wobble).
It is statistically significant.

The test depends on the data:

McNemar's test for single-run pass/fail flips.
A paired permutation test with a bootstrap interval otherwise.

A noisy judge nudging one item does not register as a regression. That is what keeps the gate and alerts from flaking. The significant flag in the summary is that same gate, surfaced so you can tell a real move from sampling noise.

Next steps

CI regression gate: turn this comparison into a build pass or fail.
Regression alerting: get a webhook when a comparison regresses.

For AI agentsView as Markdown

Get a diff in one call​

Open the diff in the UI​

Filter the case list​

Read the summary​

Read a case​

How significance gating works​

Next steps​