Changelog

What shipped and when. For exact version diffs, see the GitHub releases.

v0.23.0

LatestJune 2026

Tool calls on conversation turns: an assistant message now carries the tool calls it made, so a multi-turn conversation feeds the agent tool evaluators per turn, with no LLM. The trajectory judge can also reason over tool usage.

Added

Tool calls on assistant turns. Message now carries a typed List<ToolCall>, set via Message.assistant(content, toolCalls) or the builder's assistantMessage(content, toolCalls) (and the Kotlin DSL's assistant(content, toolCalls)). A tool-free turn is unchanged, and the three-argument Message constructor still resolves, so existing code compiles and runs as before.
Per-turn tool evaluation. ConversationTrajectory.toolCallsByTurn() returns one tool-call list per assistant turn (in order) to score each turn with the deterministic agent evaluators, and toolCalls() flattens them into one list. toTestCase()/toTestCase(tools) build a deterministic case (input is the last user message), while toTestCase(tools, tasks) builds the judge case for TaskCompletionEvaluator and ToolArgumentHallucinationEvaluator over the whole conversation, with tool calls rendered name-only so their arguments stay out of the hallucination grounding. Also toAgentTrace()/toAgentOutputs() for the standard agent output map.
Tool calls in the trajectory transcript. toText() and toJson() render each turn's tool calls, and TrajectoryEvaluator.includeToolCalls(true) adds them to the judge prompt. Both are off by default for tool-free conversations, whose rendered output stays byte-identical to prior versions.

v0.22.0

June 2026

A server-free regression gate: commit a baseline next to your test and fail the build when quality drops, with the same verdict locally and in CI. No server, account, or API key for the gate itself.

Added

Server-free regression gate. Assertions.assertNoRegression(result, "name") compares a fresh experiment result against a committed baseline at src/test/resources/dokimos/baselines/<name>.json and throws on a real regression. Two guards fire it: a significance test on the aggregate pass rate and per-evaluator means (quiet on judge noise), and a localized-severity check that catches a single item breaking hard. The first local run writes the baseline and fails once so you review and commit it.
Kotlin assertNoRegression. ExperimentResult.assertNoRegression(...) is available as a Kotlin extension so the gate reads naturally from a Kotlin test.
CI report action. The eval-gate-report composite action renders each per-baseline verdict JSON under target/dokimos into the job summary and a sticky PR comment, and fails the step on a regression. Pair it with if: always() so the comment posts even after a failing build.

v0.21.0

June 2026

Cost, token, and latency metrics across all five framework adapters with a pluggable pricing seam, plus two new agent integrations (Embabel and Spring AI Alibaba) that capture an agent run as an AgentTrace for the agent evaluators.

Added

Cost, token & latency metrics. Capture per-call tokensIn/tokensOut, costUsd, and latencyMs across all five adapters via measured tasks (measuredTask/measuredAsyncTask/measuredTextTask, and EmbabelTraceCollector.callMetrics). Cost is composed at capture time through a pluggable PriceTable seam in dokimos-core — Dokimos ships no price data, you supply the map. The run detail rolls up Total Tokens, Total Cost, and Avg Latency.
Partial cost-coverage signal. When a run mixes priced and unpriced items, the run-detail Total Cost card shows an N/M items priced subtitle so a partial total is never mistaken for a complete one. Computed at read time on RunDetails — no new column, no migration.
Embabel integration. dokimos-embabel captures an Embabel agent run as an AgentTrace through an AgenticEventListener: EmbabelSupport.attach(...), run the agent, then collector.trace(). Requires Java 21, since Embabel ships Java 21 bytecode; the rest of Dokimos stays on Java 17.
Spring AI Alibaba integration. dokimos-spring-ai-alibaba folds a graph run's OverAllState messages into one AgentTrace with per-turn result windowing, reusing the Spring AI message extraction. SpringAiAlibabaSupport.toAgentTrace(...).

v0.20.0

June 2026

Typed, structured outputs end to end and non-blocking async task execution: return a POJO from a task, match it structurally, and drive an experiment from suspend or reactive code without a thread per example.

Added

Typed structured output. Task.typed(fn) lets a task return a record, list, or other POJO under the "output" key, and EvalTestCase.actualOutputAs(...) / expectedOutputAs(...) read it back type-safely via a Class<T> or an OutputType<T> super-type token for generics like List<Whisky>. A failed conversion throws DokimosTypeConversionException.
StructuralMatchEvaluator. Compares an expected structure against the actual one. STRICT requires the exact field set and array order; LENIENT allows extra fields and ignores array order. Both compare numbers by value. Scores the fraction of matching leaf paths by default, or call binary() for a 1.0/0.0 all-or-nothing score.
Async task execution. AsyncTask returns a CompletableFuture<TaskResult>, and Experiment.builder().asyncTask(...) runs it through a bounded async path that caps in-flight invocations with parallelism(int) — no thread parked per example.
Async and reactive adapters. Spring AI adds asyncTask(...) and reactiveTask(...), LangChain4j adds asyncTask(...) and asyncRagTask(...), and Koog adds asTask(...) / asTextTask(...). Each has an overload that takes an Executor so calls run on a pool you control.
Kotlin task DSL and ToolCall.resultJson. Kotlin adds typedTask<T> { ... } for returning a POJO directly and suspendTask { ... } for a suspend body, and ToolCall.Builder.resultJson(Object) serializes a structured tool result to compact JSON.
Spring AI tool-eval example. A runnable Spring AI whisky-agent example that exercises the agent tool evaluators end to end.

Changed

Judge renders structured output as JSON. LLMJudgeEvaluator renders a non-String output as pretty-printed JSON so the judge sees a parseable structured value; String and primitive output is rendered verbatim as before.
task and asyncTask are mutually exclusive. Experiment.builder().build() now rejects configuring both a synchronous task/measuredTask and an asyncTask, instead of silently running the async path and ignoring the sync one.
Consistent null handling in LangChain4j RAG tasks. ragTask and asyncRagTask coerce a null model response to an empty string under the output key, matching simpleTask and asyncTask.

Fixed

Parallel executor shutdown. The parallel experiment executor shuts down forcibly when a run fails, so worker threads no longer leak.

v0.19.0

June 2, 2026

Hardens the core: per-item failure isolation, RFC 4180 CSV, prose-tolerant judges, retry and observability for the server reporter, and a run of correctness fixes.

Added

Dataset.load. Dataset.load(uriOrPath) resolves a dataset from a local path or a URI through the same resolver registry the SDK uses, so a plain path and a dataset://name@version URI load the same way.
Measured tasks. Experiment.builder().measuredTask(...) takes a MeasuredTask that returns outputs plus optional CallMetrics, carried through to each ItemResult so cost, tokens, and latency land next to the score.
Server reporter failure visibility. DokimosServerReporter exposes getFailedItemCount() and an onItemDeliveryFailure(...) callback, plus an opt-in spoolDirectory(...) that appends permanently undelivered batches to a durable file.
JUnit recorder and typed metadata. A test method can take a DatasetItemRecorder parameter to record actual outputs and eval results per invocation, and @MetadataEntry(key, value) replaces the alternating-string metadata form with a typed pair.
Kotlin and LangChain4j helpers. Kotlin adds an evalCase(input, actualOutput, expectedOutput) factory and a metadata(Map) DSL form; LangChain4j adds simpleTask(model, outputKey) to name the output key, and AgentEvalCase.builder() gives agent test cases a typed builder.

Changed

Per-item failure isolation. An experiment isolates a failing example so one bad item records its error and the run continues instead of aborting.
RFC 4180 CSV. Dataset CSV loading parses quoted fields per RFC 4180, so a value may contain the delimiter, a newline, or a doubled quote.
Prose-tolerant judges. LLM judge replies are parsed by extracting the JSON, so a judge may wrap its verdict in preamble or trailing prose, and LLMJudgeEvaluator normalizes a custom scoreRange onto 0..1.
Trajectory compares arguments. ToolTrajectoryEvaluator now defaults to a tolerant argument matcher; pass ArgumentMatcher.of(ArgMatchMode.IGNORE) to compare tool names and order only.
Reporter retries and stricter builder. The server reporter retries an HTTP 429 with its Retry-After hint, and Experiment.builder() rejects an empty dataset or zero evaluators.

Fixed

Spring AI score. EvaluationResponse.getScore() returns the real evaluation score instead of leaving the field unset.
Null responses. LangChain4j simpleTask no longer throws on a null model response, and HallucinationEvaluator reports a missing verdict instead of a raw NullPointerException.
Tool-call validity numerics. Tool-call validity accepts whole-number doubles for integer parameters and matches numeric enums by value.
MCP store and client. The MCP result store writes atomically, and the per-call OpenAI client is closed after each run.
JUnit reported example. The example reported by the JUnit extension is tied to the actual invocation.

v0.17.0

May 31, 2026

Closes the production evaluation loop end to end, with full multi-tenant data isolation and standards-based OTLP trace ingestion.

Added

Tenant data isolation. A scoped API key can carry a tenant and then reads and writes only its own tenant's data plus shared rows. Tenant repositories expose only scoped finders, so an unscoped load does not compile, and a keyless read sees shared rows only. No-key and legacy single-key deployments are unchanged.
Protobuf OTLP traces. POST /api/v1/traces accepts the application/x-protobuf encoding alongside JSON, so a standard OpenTelemetry SDK or collector works without reconfiguring its exporter.

v0.16.0

May 30, 2026

The server grows from a results viewer into a production evaluation platform: server-owned datasets, a CI regression gate with run diffing, a server-side LLM judge, and trace ingestion.

Added

Server datasets. Hold datasets on the server, versioned and shared, and pin a test to an exact version with a dataset://name@version URI. The SDK resolver caches offline, so a pinned version still resolves when the server is briefly unreachable.
CI regression gate and run diff. The server fails your build when a run regresses against its baseline, significance-gated so a noisy judge does not flake the pipeline, with an item-by-item diff view and a reusable GitHub Action.
Server LLM judge. Score runs and traces on the server with a stored connection that speaks the vendor-neutral Open Responses API (Chat Completions as fallback), plus a judge-vs-human alignment metric.
Production traces and online evals. Ingest OTLP traces from your running app and score matching spans as they arrive, using the same judge as offline experiments.
Regression alerting. Get a signed webhook when a run regresses, on the same comparison the CI gate acts on.
Review and curation. Review the items evaluators got wrong, annotate them, and promote them into a new dataset version.
Role-scoped API keys. Issue VIEWER / EDITOR / ADMIN keys alongside the single-key mode. Reads stay open, writes need EDITOR, key management needs ADMIN.
Per-item cost, token, and latency metrics. Track spend and speed next to quality on every item result.

For AI agentsView as Markdown