Changelog
What shipped and when. For exact version diffs, see the GitHub releases.
v0.23.0
LatestJune 2026Tool calls on conversation turns: an assistant message now carries the tool calls it made, so a multi-turn conversation feeds the agent tool evaluators per turn, with no LLM. The trajectory judge can also reason over tool usage.
- Tool calls on assistant turns.
Messagenow carries a typedList<ToolCall>, set viaMessage.assistant(content, toolCalls)or the builder'sassistantMessage(content, toolCalls)(and the Kotlin DSL'sassistant(content, toolCalls)). A tool-free turn is unchanged, and the three-argumentMessageconstructor still resolves, so existing code compiles and runs as before. - Per-turn tool evaluation.
ConversationTrajectory.toolCallsByTurn()returns one tool-call list per assistant turn (in order) to score each turn with the deterministic agent evaluators, andtoolCalls()flattens them into one list.toTestCase()/toTestCase(tools)build a deterministic case (input is the last user message), whiletoTestCase(tools, tasks)builds the judge case forTaskCompletionEvaluatorandToolArgumentHallucinationEvaluatorover the whole conversation, with tool calls rendered name-only so their arguments stay out of the hallucination grounding. AlsotoAgentTrace()/toAgentOutputs()for the standard agent output map. - Tool calls in the trajectory transcript.
toText()andtoJson()render each turn's tool calls, andTrajectoryEvaluator.includeToolCalls(true)adds them to the judge prompt. Both are off by default for tool-free conversations, whose rendered output stays byte-identical to prior versions.
v0.22.0
June 2026A server-free regression gate: commit a baseline next to your test and fail the build when quality drops, with the same verdict locally and in CI. No server, account, or API key for the gate itself.
- Server-free regression gate.
Assertions.assertNoRegression(result, "name")compares a fresh experiment result against a committed baseline atsrc/test/resources/dokimos/baselines/<name>.jsonand throws on a real regression. Two guards fire it: a significance test on the aggregate pass rate and per-evaluator means (quiet on judge noise), and a localized-severity check that catches a single item breaking hard. The first local run writes the baseline and fails once so you review and commit it. - Kotlin assertNoRegression.
ExperimentResult.assertNoRegression(...)is available as a Kotlin extension so the gate reads naturally from a Kotlin test. - CI report action. The
eval-gate-reportcomposite action renders each per-baseline verdict JSON undertarget/dokimosinto the job summary and a sticky PR comment, and fails the step on a regression. Pair it withif: always()so the comment posts even after a failing build.
v0.21.0
June 2026Cost, token, and latency metrics across all five framework adapters with a pluggable pricing seam, plus two new agent integrations (Embabel and Spring AI Alibaba) that capture an agent run as an AgentTrace for the agent evaluators.
- Cost, token & latency metrics. Capture per-call
tokensIn/tokensOut,costUsd, andlatencyMsacross all five adapters via measured tasks (measuredTask/measuredAsyncTask/measuredTextTask, andEmbabelTraceCollector.callMetrics). Cost is composed at capture time through a pluggablePriceTableseam indokimos-core— Dokimos ships no price data, you supply the map. The run detail rolls up Total Tokens, Total Cost, and Avg Latency. - Partial cost-coverage signal. When a run mixes priced and unpriced items, the run-detail Total Cost card shows an
N/M items pricedsubtitle so a partial total is never mistaken for a complete one. Computed at read time onRunDetails— no new column, no migration. - Embabel integration.
dokimos-embabelcaptures an Embabel agent run as anAgentTracethrough anAgenticEventListener:EmbabelSupport.attach(...), run the agent, thencollector.trace(). Requires Java 21, since Embabel ships Java 21 bytecode; the rest of Dokimos stays on Java 17. - Spring AI Alibaba integration.
dokimos-spring-ai-alibabafolds a graph run'sOverAllStatemessages into oneAgentTracewith per-turn result windowing, reusing the Spring AI message extraction.SpringAiAlibabaSupport.toAgentTrace(...).
v0.20.0
June 2026Typed, structured outputs end to end and non-blocking async task execution: return a POJO from a task, match it structurally, and drive an experiment from suspend or reactive code without a thread per example.
- Typed structured output.
Task.typed(fn)lets a task return a record, list, or other POJO under the"output"key, andEvalTestCase.actualOutputAs(...)/expectedOutputAs(...)read it back type-safely via aClass<T>or anOutputType<T>super-type token for generics likeList<Whisky>. A failed conversion throwsDokimosTypeConversionException. - StructuralMatchEvaluator. Compares an expected structure against the actual one.
STRICTrequires the exact field set and array order;LENIENTallows extra fields and ignores array order. Both compare numbers by value. Scores the fraction of matching leaf paths by default, or callbinary()for a 1.0/0.0 all-or-nothing score. - Async task execution.
AsyncTaskreturns aCompletableFuture<TaskResult>, andExperiment.builder().asyncTask(...)runs it through a bounded async path that caps in-flight invocations withparallelism(int)— no thread parked per example. - Async and reactive adapters. Spring AI adds
asyncTask(...)andreactiveTask(...), LangChain4j addsasyncTask(...)andasyncRagTask(...), and Koog addsasTask(...)/asTextTask(...). Each has an overload that takes anExecutorso calls run on a pool you control. - Kotlin task DSL and ToolCall.resultJson. Kotlin adds
typedTask<T> { ... }for returning a POJO directly andsuspendTask { ... }for a suspend body, andToolCall.Builder.resultJson(Object)serializes a structured tool result to compact JSON. - Spring AI tool-eval example. A runnable Spring AI whisky-agent example that exercises the agent tool evaluators end to end.
- Judge renders structured output as JSON.
LLMJudgeEvaluatorrenders a non-String output as pretty-printed JSON so the judge sees a parseable structured value; String and primitive output is rendered verbatim as before. - task and asyncTask are mutually exclusive.
Experiment.builder().build()now rejects configuring both a synchronoustask/measuredTaskand anasyncTask, instead of silently running the async path and ignoring the sync one. - Consistent null handling in LangChain4j RAG tasks.
ragTaskandasyncRagTaskcoerce a null model response to an empty string under the output key, matchingsimpleTaskandasyncTask.
- Parallel executor shutdown. The parallel experiment executor shuts down forcibly when a run fails, so worker threads no longer leak.
v0.19.0
June 2, 2026Hardens the core: per-item failure isolation, RFC 4180 CSV, prose-tolerant judges, retry and observability for the server reporter, and a run of correctness fixes.
- Dataset.load.
Dataset.load(uriOrPath)resolves a dataset from a local path or a URI through the same resolver registry the SDK uses, so a plain path and adataset://name@versionURI load the same way. - Measured tasks.
Experiment.builder().measuredTask(...)takes aMeasuredTaskthat returns outputs plus optionalCallMetrics, carried through to eachItemResultso cost, tokens, and latency land next to the score. - Server reporter failure visibility.
DokimosServerReporterexposesgetFailedItemCount()and anonItemDeliveryFailure(...)callback, plus an opt-inspoolDirectory(...)that appends permanently undelivered batches to a durable file. - JUnit recorder and typed metadata. A test method can take a
DatasetItemRecorderparameter to record actual outputs and eval results per invocation, and@MetadataEntry(key, value)replaces the alternating-string metadata form with a typed pair. - Kotlin and LangChain4j helpers. Kotlin adds an
evalCase(input, actualOutput, expectedOutput)factory and ametadata(Map)DSL form; LangChain4j addssimpleTask(model, outputKey)to name the output key, andAgentEvalCase.builder()gives agent test cases a typed builder.
- Per-item failure isolation. An experiment isolates a failing example so one bad item records its error and the run continues instead of aborting.
- RFC 4180 CSV. Dataset CSV loading parses quoted fields per RFC 4180, so a value may contain the delimiter, a newline, or a doubled quote.
- Prose-tolerant judges. LLM judge replies are parsed by extracting the JSON, so a judge may wrap its verdict in preamble or trailing prose, and
LLMJudgeEvaluatornormalizes a customscoreRangeonto 0..1. - Trajectory compares arguments.
ToolTrajectoryEvaluatornow defaults to a tolerant argument matcher; passArgumentMatcher.of(ArgMatchMode.IGNORE)to compare tool names and order only. - Reporter retries and stricter builder. The server reporter retries an HTTP 429 with its
Retry-Afterhint, andExperiment.builder()rejects an empty dataset or zero evaluators.
- Spring AI score.
EvaluationResponse.getScore()returns the real evaluation score instead of leaving the field unset. - Null responses. LangChain4j
simpleTaskno longer throws on a null model response, andHallucinationEvaluatorreports a missing verdict instead of a raw NullPointerException. - Tool-call validity numerics. Tool-call validity accepts whole-number doubles for integer parameters and matches numeric enums by value.
- MCP store and client. The MCP result store writes atomically, and the per-call OpenAI client is closed after each run.
- JUnit reported example. The example reported by the JUnit extension is tied to the actual invocation.
v0.17.0
May 31, 2026Closes the production evaluation loop end to end, with full multi-tenant data isolation and standards-based OTLP trace ingestion.
- Tenant data isolation. A scoped API key can carry a tenant and then reads and writes only its own tenant's data plus shared rows. Tenant repositories expose only scoped finders, so an unscoped load does not compile, and a keyless read sees shared rows only. No-key and legacy single-key deployments are unchanged.
- Protobuf OTLP traces.
POST /api/v1/tracesaccepts theapplication/x-protobufencoding alongside JSON, so a standard OpenTelemetry SDK or collector works without reconfiguring its exporter.
v0.16.0
May 30, 2026The server grows from a results viewer into a production evaluation platform: server-owned datasets, a CI regression gate with run diffing, a server-side LLM judge, and trace ingestion.
- Server datasets. Hold datasets on the server, versioned and shared, and pin a test to an exact version with a
dataset://name@versionURI. The SDK resolver caches offline, so a pinned version still resolves when the server is briefly unreachable. - CI regression gate and run diff. The server fails your build when a run regresses against its baseline, significance-gated so a noisy judge does not flake the pipeline, with an item-by-item diff view and a reusable GitHub Action.
- Server LLM judge. Score runs and traces on the server with a stored connection that speaks the vendor-neutral Open Responses API (Chat Completions as fallback), plus a judge-vs-human alignment metric.
- Production traces and online evals. Ingest OTLP traces from your running app and score matching spans as they arrive, using the same judge as offline experiments.
- Regression alerting. Get a signed webhook when a run regresses, on the same comparison the CI gate acts on.
- Review and curation. Review the items evaluators got wrong, annotate them, and promote them into a new dataset version.
- Role-scoped API keys. Issue VIEWER / EDITOR / ADMIN keys alongside the single-key mode. Reads stay open, writes need EDITOR, key management needs ADMIN.
- Per-item cost, token, and latency metrics. Track spend and speed next to quality on every item result.