Skip to main content

Evaluation Overview

This page shows you how Dokimos scores the output of your LLM application, so you can measure quality, catch regressions, and compare changes with numbers instead of guesses.

Run your first evaluation

Here is a full, runnable example. It builds a small dataset, runs your application against it, scores the answers with an LLM judge, and prints a pass rate. Copy it, swap in your own customerSupportBot and judge, and run it.

import dev.dokimos.core.*;
import dev.dokimos.core.evaluators.LLMJudgeEvaluator;
import java.util.List;
import java.util.Map;

// 1. Build a dataset: inputs paired with the expected answers.
Dataset dataset = Dataset.builder()
.name("Product Support Questions")
.addExample(Example.of(
"How do I reset my password?",
"Click 'Forgot Password' on the login page and follow the email instructions"
))
.addExample(Example.of(
"Where can I track my order?",
"Go to your account dashboard and click on 'Order History'"
))
.build();

// 2. Define the task: this calls your application for each example.
Task task = example -> {
String answer = customerSupportBot.generateAnswer(example.input());
return Map.of("output", answer);
};

// 3. Pick an evaluator to score each output.
List<Evaluator> evaluators = List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful and accurate?")
.judge(judge)
.threshold(0.8)
.build()
);

// 4. Run the experiment and read the results.
ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

System.out.println("Pass rate: " + String.format("%.2f%%", result.passRate() * 100));
System.out.println("Passed: " + result.passCount());
System.out.println("Failed: " + result.failCount());

That is the whole loop: a dataset goes in, a scored result comes out. The rest of this page explains the pieces.

What evaluation gives you

Evaluation scores the responses of an AI application against metrics that fit your use case. You run it to:

  • Find where your application is strong and where it is weak.
  • Check that outputs match what users expect.
  • Reduce the risk of shipping bad or unsafe responses.
  • Decide which model, prompt, or retrieval setup to ship.

Scores turn "this feels better" into a number you can track over time.

Core concepts

Dokimos evaluates LLM applications in Java and Kotlin. It runs offline evaluation: you score your application against a curated dataset. This fits benchmarking and regression testing during development, and it runs well inside a CI/CD pipeline to measure current performance and catch regressions.

Four concepts make up the framework:

  • Datasets: A collection of data points used for evaluation. Load them programmatically, from files, or from a custom source. (In the example above, that is Dataset.builder().)
  • Examples: One data point in a dataset. Each example holds an input (such as a prompt) and an expected output (the correct response). (That is Example.of(...).)
  • Evaluators: The code that scores how well your application did. Dokimos ships built-in evaluators for common tasks, and you can write your own. (That is LLMJudgeEvaluator above.)
  • Experiments: One run of an evaluation: a dataset plus a task plus evaluators. You can run experiments test-driven, often with parameterized tests. (That is Experiment.builder().)

Experiments

An experiment is the unit you run. It ties a dataset to a task and a set of evaluators, then produces scored results. Experiments plug into testing frameworks like JUnit, so you can run evaluation as part of your normal development workflow.

For useful experiments:

  • Use datasets that reflect real-world inputs.
  • Pick evaluators that match what you care about (accuracy, helpfulness, format, and so on).
  • Read the results to find what to improve.

Next steps

Now go deeper on each piece:

For AI agentsView as Markdown