CI regression gate

Fail a build when an eval run scores worse than a baseline run. You call one endpoint with the run you just ingested, and the server returns a single passed boolean your pipeline can branch on.

The gate only fails on a real regression. A change counts as a regression only when it clears a small epsilon and passes a significance test (McNemar for single-run pass/fail, a paired permutation test with a bootstrap interval otherwise). A noisy judge will not flake your pipeline.

Call the endpoint

POST /api/v1/experiments/{experimentId}/gate

Send the run you want to check:

{
  "candidateRunId": "<run you just ingested>",
  "baselineRunId": "<optional explicit baseline>",
  "baselineBranch": "<optional, e.g. master>"
}

candidateRunId is the only required field. The run must be terminal (SUCCESS or FAILED).

Leave baselineRunId out and the server picks one for you. It resolves the most recent successful run of the same experiment on the same dataset version. Set baselineBranch to limit that search to one branch.

When no baseline exists, the verdict is NO_BASELINE and passed is true. A first run cannot regress.

The gate is a POST, so it needs a write-capable API key when the server has DOKIMOS_API_KEY set.

Read the response

The response is a flat GateResult. Branch your build on passed:

{
  "status": "PASS | FAIL | NO_BASELINE",
  "passed": true,
  "candidateRunId": "...",
  "baselineRunId": "...",
  "pairing": "dataset_item_id | positional | none",
  "baselinePassRate": 0.88,
  "candidatePassRate": 0.82,
  "passRateDelta": -0.06,
  "significant": true,
  "improvedCount": 3,
  "regressedCount": 5,
  "unchangedCount": 40,
  "addedCount": 0,
  "removedCount": 0,
  "regressedEvaluators": [
    {
      "evaluator": "faithfulness",
      "baselineMean": 0.91,
      "candidateMean": 0.70,
      "delta": -0.21,
      "pValue": 0.011
    }
  ],
  "cases": [
    {
      "datasetItemId": "...",
      "index": "...",
      "evaluatorDrops": [
        {
          "evaluator": "faithfulness",
          "baselineMean": 1.0,
          "candidateMean": 0.0,
          "delta": -1.0
        }
      ]
    }
  ],
  "casesTruncated": false
}

What the key fields mean:

Field	Meaning
`passed`	The single boolean CI branches on. `false` only when `status` is `FAIL`.
`status`	`PASS`, `FAIL`, or `NO_BASELINE`.
`pairing`	How items were matched: `dataset_item_id`, `positional`, or `none` (for `NO_BASELINE`).
`passRateDelta`	Candidate pass rate minus baseline pass rate.
`significant`	Whether the pass-rate change passed the significance test.
`regressedCount`	The authoritative count of significantly regressed items.
`regressedEvaluators`	Every evaluator flagged as a significant regression.
`cases`	Up to 50 regressed items with their per-evaluator score drops.
`casesTruncated`	`true` when `regressedCount` is larger than the returned `cases` list.

Cases pair by dataset_item_id when both runs ran against the same dataset version and every item is linked. Otherwise pairing falls back to position. The cases list is capped at 50, so read regressedCount for the real total and check casesTruncated to know whether the cap was hit.

Run it from GitHub Actions

A composite action under .github/actions/eval-gate calls the endpoint for you. It writes a job summary, posts a sticky pull-request comment, and fails the step on a FAIL verdict.

- name: Eval gate
  uses: dokimos-dev/dokimos/.github/actions/eval-gate@v0
  with:
    server-url: ${{ secrets.DOKIMOS_SERVER_URL }}
    api-key: ${{ secrets.DOKIMOS_API_KEY }}
    experiment-id: ${{ env.EXPERIMENT_ID }}
    candidate-run-id: ${{ env.RUN_ID }}
    baseline-branch: master

candidate-run-id is the run id you get back when your test job reports results through DokimosServerReporter.

Two inputs let you soften the gate:

Set fail-on-regression: "false" to post the comment without blocking the merge.
Set comment: "false" to skip the PR comment.

This page covers the server-based gate. If you would rather keep the baseline in git and run the gate as an ordinary test with no server, see Regression gate (server-free).

Next steps

Comparing runs: read the same comparison item by item in the web UI
Regression alerting: get a webhook on the same regression the gate fails on
Server datasets: pin a run to a dataset version so the gate compares like for like

For AI agentsView as Markdown

Call the endpoint​

Read the response​

Run it from GitHub Actions​

Next steps​

Call the endpoint

Read the response

Run it from GitHub Actions

Next steps