Skip to main content

MCP Server

Run Dokimos evaluations straight from a chat with your AI agent, no code and no build.

The Dokimos MCP server exposes the evaluation framework as tools for LLM agents. Connect it to any Model Context Protocol client (Claude Desktop, Claude Code, Cursor, and others). Then you can run evaluations, list past runs, compare runs, and inspect failures by asking in plain language.

Run with Docker

The published image ships everything the server needs. You do not need a JDK or a local build. Add this block to your MCP client config:

{
"mcpServers": {
"dokimos": {
"command": "docker",
"args": [
"run", "-i", "--rm",
"-e", "OPENAI_API_KEY",
"-v", "dokimos-mcp:/home/dokimos/.dokimos",
"-v", "/absolute/path/to/datasets:/data:ro",
"ghcr.io/dokimos-dev/dokimos-mcp-server:latest"
],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}

Replace two values:

  • OPENAI_API_KEY: your OpenAI key. The run_evaluation tool calls OpenAI and needs it.
  • /absolute/path/to/datasets: the folder on your machine that holds your dataset files.

Three flags do the work:

  • -i keeps stdin open. The server speaks JSON-RPC over stdin and stdout, so this flag is required.
  • -v dokimos-mcp:/home/dokimos/.dokimos mounts a named volume that persists run results across restarts. This keeps list_experiments and compare_runs working.
  • -v /absolute/path/to/datasets:/data:ro makes your dataset files visible inside the container, read-only. Inside the container they live under /data, so pass dataset_path as an in-container path, for example /data/qa-pairs.json.

Restart your MCP client after editing the config. The four Dokimos tools then show up in the client.

No Docker?

Build a self-contained JAR from source and run it with java -jar. See the module README.

Tools

The server provides four tools. Each one maps to one thing you ask for in chat.

run_evaluation

Loads a dataset, calls the model for each example, evaluates the outputs, and returns summary metrics plus a run ID. Save the run ID. You pass it to the other tools.

ParameterTypeRequiredDefaultDescription
dataset_pathstringyesPath to the dataset file (.json, .csv, or .jsonl)
modelstringnogpt-5.5OpenAI model name
temperaturenumbernomodel defaultSampling temperature, 0.0 to 2.0. Omitted when unset, so the model uses its own default
evaluatorstringnoexact_matchexact_match or llm_judge
criteriastringnoEvaluation criteria. Used by the llm_judge evaluator
thresholdnumberno0.7Score threshold for pass/fail
experiment_namestringnomcp-evaluationName for this experiment

list_experiments

Lists past evaluation runs with their run IDs, timestamps, dataset names, and summary metrics. Filter by dataset name when you only want one dataset's history.

ParameterTypeRequiredDefaultDescription
limitintegerno20Maximum number of runs to return
dataset_namestringnoFilter to runs that used this dataset name

compare_runs

Compares two runs side by side. Reports metric deltas and flags regressions. Treat run_id_a as the baseline and run_id_b as the new run.

ParameterTypeRequiredDefaultDescription
run_id_astringyesFirst run ID (baseline)
run_id_bstringyesSecond run ID (comparison)

get_failing_queries

Returns the examples from a run whose evaluator scores fell below a threshold. Each result includes the input, expected output, actual output, and per-evaluator detail.

ParameterTypeRequiredDefaultDescription
run_idstringyesRun ID to inspect
thresholdnumberno0.5Score below which a query counts as failing

Storage

Runs persist to ~/.dokimos/mcp-results.json. Inside the container that path is /home/dokimos/.dokimos. The named volume in the Docker config mounts there, so history survives restarts.

Example session

Once connected, drive evaluations by asking. A typical flow:

> Run an evaluation on /data/qa-pairs.json using gpt-5.5 with the llm_judge evaluator
> Show me the failing queries from that run
> Now compare it with run abc123

The first message runs run_evaluation and returns a run ID. The second runs get_failing_queries on that run. The third runs compare_runs against an earlier run.

For AI agentsView as Markdown