Skip to main content

Dokimos Overview

Dokimos lets you score, track, and regression-test the responses of your LLM application in Java and Kotlin, so you know when a prompt or model change made things better or worse.

It is an open-source evaluation framework. It works with Spring AI, Spring AI Alibaba, LangChain4j, Koog, Embabel, or plain Java, and it helps you:

  1. Build and manage datasets in code, from files, or with custom sources
  2. Run experiments with built-in evaluators, or your own custom evaluators
  3. Evaluate AI agents, including their tool calls and execution traces
  4. Capture per-call cost, tokens, and latency and roll them up per run
  5. Work with typed, structured data end to end, from task output to evaluator
  6. Run evals in a test-driven way with JUnit parameterized tests
  7. Track experiment results over time with an optional server and web UI

Dokimos is framework agnostic. The core depends on no AI framework, so it works with any LLM client, or none at all. The Spring AI, Spring AI Alibaba, LangChain4j, Koog, and Embabel modules are thin, optional bridges that capture a run in one line. You never need them to use Dokimos.

Dokimos brings the evaluation tooling that Python developers have to the Java ecosystem.

See it run

Here is a complete experiment. It runs your LLM application against three examples, scores each answer with an LLM judge, and prints the pass rate.

import dev.dokimos.core.*;
import dev.dokimos.core.evaluators.LLMJudgeEvaluator;
import java.util.List;
import java.util.Map;

// 1. Build a dataset.
Dataset dataset = Dataset.builder()
.name("Product Support Questions")
.addExample(Example.of(
"How do I reset my password?",
"Click 'Forgot Password' on the login page and follow the email instructions"
))
.addExample(Example.of(
"Where can I track my order?",
"Go to your account dashboard and click on 'Order History'"
))
.addExample(Example.of(
"What payment methods do you accept?",
"We accept credit cards, PayPal, and bank transfers"
))
.build();

// 2. Define the task that calls your application.
Task task = example -> {
String answer = customerSupportBot.generateAnswer(example.input());
return Map.of("output", answer);
};

// 3. Pick evaluators.
List<Evaluator> evaluators = List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful and accurate?")
.judge(judge)
.threshold(0.8)
.build()
);

// 4. Run the experiment.
ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

// 5. Read the results.
System.out.println("Pass rate: " + String.format("%.2f%%", result.passRate() * 100));
System.out.println("Total examples: " + result.totalCount());
System.out.println("Passed: " + result.passCount());
System.out.println("Failed: " + result.failCount());

Want the full walkthrough, from adding the dependency to running this in a test? Read the Getting started Guide.

To see what you can build, explore the examples module.

Using a coding agent?

Paste this prompt to get a first eval written against your own code.

Start with a coding agent
I want to evaluate my LLM or AI-agent code with Dokimos, the LLM evaluation framework for Java and Kotlin.

Read https://dokimos.dev/llms-full.txt for the full API, then:
1. Add the dev.dokimos:dokimos-junit test dependency (version 0.23.0) to my build.
2. Look at what my app does (RAG, chatbot, or a tool-using agent) and pick the right evaluators.
3. Write a JUnit test that runs my code and asserts quality with Dokimos, using @DatasetSource if I have a dataset.
4. For a tool-using agent, capture the run as an AgentTrace and check the tool calls.

Keep it to one working test I can run with my existing build, and tell me how to run it.

Structured and typed data

A task can return a real domain object, such as a record, a POJO, or a list. Dokimos compares it structurally, so numbers compare by value and formatting and key order do not count. You read it back type-safely in custom evaluators, LLM judges, tool-call results, and metadata. See the Structured and Typed Data guide for the whole pipeline in one place.

The production eval loop

The optional server closes the loop from a single run to a system that holds quality steady over time. You can:

See the server overview for how the pieces fit together.

For AI agents

Point a coding agent at the machine-readable docs. llms.txt indexes the documentation, and llms-full.txt is the whole thing in one file. Every page also has a Markdown version, linked from its footer under "For AI agents".

What's next

We are expanding Dokimos with features that make evaluation in Java easier:

  • More built-in evaluators: Additional evaluators for common patterns like misuse detection and more.
  • Test Data Generation: Use LLMs to generate synthetic test datasets for evaluation.
  • SPI (Service Provider Interface): Plug in custom implementations for storage, metrics, and reporting.
  • CLI: Command-line tools for running experiments, managing datasets, and generating reports.

Want to see something else? Open an issue or contribute!

For AI agentsView as Markdown