Skip to main content

Comparing runs

The diff view shows you what changed between two runs of the same experiment, item by item, so you can see what a change moved before it ships.

It is the same comparison the CI gate and regression alerting act on, shown as a table you can read.

Comparing two runs: the pass-rate movement, improved and regressed counts, a significance verdict, and a per-case delta of every evaluator score

Get a diff in one call

Compare a candidate run against a baseline run:

curl 'http://localhost:8080/api/v1/experiments/{experimentId}/runs/{candidateRunId}/diff?baselineRunId={baselineRunId}'

You get back a summary (the headline movement) and a page of cases (one row per item).

Two roles matter here:

  • The candidate is the run under review (the new side).
  • The baseline is what you compare against (the old side, usually the previous successful run).

baselineRunId is required. Both runs must be terminal. Comparing an in-flight run would be misleading, so the API returns 409 if either run has not finished.

Open the diff in the UI

From a run, open the comparison against its baseline. You land on this page in the web UI:

/experiments/{experimentId}/runs/{candidateRunId}/diff

The candidate is the run you opened. The baseline is the run you compare it against.

Filter the case list

By default the case list returns every item. Add the status parameter to narrow it:

curl 'http://localhost:8080/api/v1/experiments/{experimentId}/runs/{candidateRunId}/diff?baselineRunId={baselineRunId}&status=REGRESSED'
status valueReturns
ALL (default)Every item
REGRESSEDItems that got worse
IMPROVEDItems that got better
CHANGEDItems that regressed or improved

The case list is pageable. Use the standard page and size query parameters.

Read the summary

The summary reports the whole-run movement.

FieldMeaning
baselinePassRate, candidatePassRate, passRateDeltaPass rate on each side, and candidate minus baseline
significantWhether the pass-rate change is statistically significant, not noise
improvedCount, regressedCount, unchangedCountHow items moved between the runs
addedCount, removedCountItems present in only one of the two runs
pairingHow items were matched: dataset_item_id (matched one to one by id) or positional (matched by position)

Read a case

Each case is one item compared across the two runs. A case carries:

  • status: REGRESSED, IMPROVED, UNCHANGED, ADDED, or REMOVED.
  • passFlip: true when the item flipped between pass and fail.
  • input: the item's input text.
  • evaluators: the per-evaluator deltas, so you can see which evaluator moved.

Each entry in evaluators has the evaluator name, its baselineMean and candidateMean, the delta (candidate minus baseline), a per-evaluator status (IMPROVED, REGRESSED, or UNCHANGED), and a significant flag for that evaluator's change.

How significance gating works

A change counts as a regression only when it clears two bars:

  1. It is beyond a small epsilon (not a rounding wobble).
  2. It is statistically significant.

The test depends on the data:

  • McNemar's test for single-run pass/fail flips.
  • A paired permutation test with a bootstrap interval otherwise.

A noisy judge nudging one item does not register as a regression. That is what keeps the gate and alerts from flaking. The significant flag in the summary is that same gate, surfaced so you can tell a real move from sampling noise.

Next steps

For AI agentsView as Markdown