Skip to main content

Evaluators

An evaluator scores one of your LLM's outputs and tells you if it passes. Each evaluator returns a score from 0.0 to 1.0 and compares it against a threshold you set. Use this page to pick a built-in evaluator, configure it, and read its result.

Start with a built-in evaluator for common checks (exact matches, regex patterns, LLM judging, RAG grounding, retrieval quality). Write a custom one when none of them fit.

The Evaluator interface

Every evaluator implements Evaluator. It has three methods: score a test case, report its name, and report its threshold.

public interface Evaluator {
EvalResult evaluate(EvalTestCase testCase);
String name();
double threshold();
}

Evaluators that extend BaseEvaluator can also run asynchronously. Call evaluateAsync to get a CompletableFuture.

// Async using common fork-join pool
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase);

// Async with custom executor
ExecutorService executor = Executors.newFixedThreadPool(4);
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase, executor);

Every call returns an EvalResult. It holds:

  • score: numeric score (0.0 to 1.0)
  • success: whether the score meets the threshold
  • reason: explanation of the score
  • metadata: extra evaluation data

Built-in evaluators

ExactMatchEvaluator

Checks if the output matches the expected result exactly. Use it when there is one correct answer.

Evaluator evaluator = ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build();

Returns 1.0 if the strings match, 0.0 otherwise.

When to use: math calculations, code generation, or any case where the output is a string that should come back exactly as expected.

note

ExactMatchEvaluator compares the string forms of the outputs (toString()). For a structured output (a record, Map, or list) use StructuralMatchEvaluator instead. It compares the values structurally and ignores formatting and numeric representation (5 vs 5.0).

RegexEvaluator

Checks if the output matches a pattern. Use it to validate format when the exact content can vary.

Evaluator dateFormat = RegexEvaluator.builder()
.name("Date Format")
.pattern("\\d{4}-\\d{2}-\\d{2}") // YYYY-MM-DD
.threshold(1.0)
.build();

Evaluator emailFormat = RegexEvaluator.builder()
.name("Email Format")
.pattern("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")
.ignoreCase(true)
.threshold(1.0)
.build();

When to use: validating dates, emails, phone numbers, IDs, or URLs, where the exact value varies but the pattern stays the same.

LLMJudgeEvaluator

Uses a second LLM to score outputs against criteria you write in plain language. Use it for quality checks that rules cannot capture.

JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator helpfulness = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Is the answer helpful and complete? Does it actually solve the user's problem?")
.evaluationParams(List.of(
EvalTestCaseParam.INPUT,
EvalTestCaseParam.ACTUAL_OUTPUT
))
.threshold(0.8)
.judge(judge)
.build();

The evaluator sends your criteria and the test case to the judge model, which returns a score between 0 and 1. The reply is parsed leniently. A one-sentence preamble or trailing prose around the JSON is dropped, so a usable judgment is not lost to a formatting quirk.

A structured output (a record, Map, or list) is rendered to the judge as pretty-printed JSON, so you can judge a structured value directly. String and primitive output is passed through verbatim.

By default the judge scores on a 0..1 scale. To let it work on a different range, set scoreRange(min, max). The reported score is normalized back to 0..1, so your threshold always stays on the 0..1 scale.

Evaluator helpfulness = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Rate the answer's helpfulness.")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.scoreRange(1, 5) // judge replies 1..5; score is normalized to 0..1
.threshold(0.8)
.judge(judge)
.build();

When to use: semantic correctness, helpfulness, tone, clarity, or any quality you can describe in words more easily than in code.

StructuralMatchEvaluator

Compares the actual output against the expected output as JSON structures, not as opaque strings. Both sides are normalized to a JSON tree first, so a record, a Map, or a JSON string all compare object against object. This is the right tool for structured output (extraction results, function-call arguments, typed POJOs) where reformatting, key ordering, or numeric representation should not count as a difference.

Numbers compare by value, not representation: 5 equals 5.0, and 1.0 equals 1.00, in both modes. Plain string equality of the serialized form would flag those as mismatches. Structural comparison does not.

record Invoice(String id, double total, List<String> items) {}

Evaluator structural = StructuralMatchEvaluator.builder()
.name("Invoice Match")
.threshold(1.0)
.build(); // STRICT mode, outputKey "output", partial scoring

var testCase = EvalTestCase.builder()
.expectedOutput("output", new Invoice("INV-1", 42.0, List.of("a", "b")))
.actualOutput("output", new Invoice("INV-1", 42.00, List.of("a", "b")))
.build();

EvalResult result = structural.evaluate(testCase);
// result.score() == 1.0 because 42.0 and 42.00 are value-equal

Comparison modes

Set the mode with .mode(...) using StructuralMatchMode:

  • STRICT (the default) requires the exact field set and exact array order. An extra field in the actual output is a mismatch and lowers the score. A null value is distinct from a missing field.
  • LENIENT allows extra actual fields (the actual object may be a superset of the expected one) and ignores array order, comparing arrays as multisets. [1, 1, 2] does not match [1, 2], but order does not matter. A null value and a missing field are treated as equal.
Evaluator lenient = StructuralMatchEvaluator.builder()
.name("Extraction Match")
.mode(StructuralMatchMode.LENIENT) // tolerate extra fields, ignore array order
.threshold(0.9)
.build();

Scoring

By default the score is the fraction of matching leaf paths in [0.0, 1.0], so one wrong field on a large object is a partial miss, not a total failure. In STRICT the denominator is the union of expected and actual leaf paths (extra fields lower the score). In LENIENT the denominator is the expected leaf paths only.

Call .binary() for an exact-contract gate. The score collapses to 1.0 when the structures match completely and 0.0 when anything differs. Pair it with threshold(1.0) when the output contract must be satisfied exactly.

Evaluator contract = StructuralMatchEvaluator.builder()
.name("Schema Contract")
.binary() // 1.0 if everything matches, 0.0 otherwise
.threshold(1.0)
.build();

By default the evaluator reads both sides from the "output" key of the expected and actual output maps. Use .outputKey(...) to read from a different key. The expected value is required. If it is absent, the evaluator throws.

tip

This evaluator pairs with the typed output accessors on EvalTestCase (actualOutputAs(...) and expectedOutputAs(...)). Store your structured result under a map key as a record or Map, compare it structurally here, and read it back as a typed object elsewhere. See the Structured & Typed Data hub for the whole pipeline end to end.

When to use: structured or JSON output (extraction results, tool-call arguments, typed response objects) where you care about the data, not its textual formatting, and where numeric representation differences (5 vs 5.0) should never count as a regression.

FaithfulnessEvaluator

Checks if the output is grounded in the provided context. Use it in RAG systems to make sure the LLM is not making things up.

JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator faithfulness = FaithfulnessEvaluator.builder()
.threshold(0.8)
.judge(judge)
.contextKey("retrievedContext") // Where to find the context in outputs
.includeReason(true)
.build();

The evaluator:

  1. Breaks the output into individual claims.
  2. Checks each claim against the retrieved context.
  3. Calculates score = (supported claims) / (total claims).

When to use: any RAG system where accuracy matters. If your LLM answers from retrieved documents, use this to catch hallucinations.

HallucinationEvaluator

Detects output that the context does not support. FaithfulnessEvaluator measures how much is grounded. This evaluator measures the share of content that is hallucinated.

JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator hallucination = HallucinationEvaluator.builder()
.threshold(0.3) // Allow at most 30% hallucinated content
.judge(judge)
.contextKey("context")
.includeReason(true)
.build();

The evaluator:

  1. Breaks the output into individual statements.
  2. Checks if the context supports each statement.
  3. Calculates score = (unsupported statements) / (total statements).

Important: for this evaluator, lower scores are better (0.0 means no hallucinations). Success is score <= threshold.

When to use: when you need to measure and cap the hallucination rate, especially in high-stakes applications where any fabricated information is a problem.

ContextualRelevanceEvaluator

Measures how relevant the retrieved context chunks are to the user's query. Use it to evaluate retrieval quality in RAG systems.

JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator relevance = ContextualRelevanceEvaluator.builder()
.threshold(0.5)
.judge(judge)
.retrievalContextKey("retrievalContext")
.includeReason(true)
.strictMode(false) // Set to true for threshold of 1.0
.build();

The evaluator:

  1. Scores each context chunk on its own (0.0 to 1.0) for relevance to the query.
  2. Calculates the final score as the mean of all chunk scores.
  3. Stores the individual chunk scores in the result metadata.
var testCase = EvalTestCase.builder()
.input("What are symptoms of dehydration?")
.actualOutput("retrievalContext", List.of(
"Dehydration symptoms include thirst and fatigue.", // Highly relevant
"The Pacific Ocean is the largest ocean.", // Irrelevant
"Severe dehydration can cause dizziness." // Highly relevant
))
.build();

EvalResult result = relevance.evaluate(testCase);
// result.score() ≈ 0.63 (average of individual scores)
// result.metadata().get("contextScores") contains per-chunk details

When to use: evaluating retrieval quality in RAG pipelines. It tells you when your retriever returns irrelevant documents that could confuse the LLM or dilute the answer.

PrecisionEvaluator

Measures what fraction of retrieved items are actually relevant. Needs ground truth labels.

Evaluator precision = PrecisionEvaluator.builder()
.name("retrieval-precision")
.retrievedKey("retrievedDocs") // Key in actualOutputs
.expectedKey("relevantDocs") // Key in expectedOutputs (ground truth)
.matchingStrategy(MatchingStrategy.byEquality())
.threshold(0.8)
.build();

Formula: precision = |relevant ∩ retrieved| / |retrieved|

A precision of 1.0 means every retrieved item was relevant (no false positives).

When to use: when you need to minimize noise in retrieved results. High precision matters when downstream processing is expensive or when irrelevant items could mislead the LLM.

RecallEvaluator

Measures what fraction of relevant items were actually retrieved. Needs ground truth labels.

Evaluator recall = RecallEvaluator.builder()
.name("retrieval-recall")
.retrievedKey("retrievedDocs")
.expectedKey("relevantDocs")
.matchingStrategy(MatchingStrategy.byEquality())
.threshold(0.8)
.build();

Formula: recall = |relevant ∩ retrieved| / |relevant|

A recall of 1.0 means all relevant items were found (no false negatives).

When to use: when missing relevant information is costly. High recall matters for complete answers or when the user expects full coverage.

Matching strategies

Both PrecisionEvaluator and RecallEvaluator support several strategies for matching retrieved items to ground truth.

// Simple equality (default, for string IDs)
MatchingStrategy.byEquality()

// Case-insensitive string matching
MatchingStrategy.caseInsensitive()

// Match by a specific field (for Map/JSON objects)
MatchingStrategy.byField("id")

// Match by multiple fields (e.g., knowledge graph triples)
MatchingStrategy.byFields("subject", "predicate", "object")

// Substring containment matching
MatchingStrategy.byContainment(true) // normalized

// LLM-based semantic matching (most flexible, most expensive)
MatchingStrategy.llmBased(judge)

// Combine strategies
MatchingStrategy.anyOf(strategy1, strategy2) // OR
MatchingStrategy.allOf(strategy1, strategy2) // AND

Example with knowledge graph triples:

var precision = PrecisionEvaluator.builder()
.retrievedKey("triples")
.expectedKey("relevantTriples")
.matchingStrategy(MatchingStrategy.byFields("subject", "predicate", "object"))
.build();

var testCase = EvalTestCase.builder()
.input("Who founded Microsoft?")
.actualOutput("triples", List.of(
Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft")
))
.expectedOutput("relevantTriples", List.of(
Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft"),
Map.of("subject", "Paul Allen", "predicate", "co-founded", "object", "Microsoft")
))
.build();

Agent evaluators

Dokimos ships specialized evaluators for AI agents that use tools. They cover task completion, tool call validation, argument hallucination detection, and tool definition quality.

See the dedicated Agent Evaluation guide for full documentation.

Common configuration

Every evaluator supports these settings.

Name sets how the evaluator shows up in results.

.name("Answer Quality")

Threshold sets the minimum score needed to pass.

.threshold(0.8)  // Needs 80% or higher

Evaluation parameters set which fields the evaluator reads.

.evaluationParams(List.of(
EvalTestCaseParam.INPUT, // The user's question
EvalTestCaseParam.EXPECTED_OUTPUT, // What you expect
EvalTestCaseParam.ACTUAL_OUTPUT, // What the LLM actually said
))

Creating custom evaluators

When no built-in evaluator fits, write your own by extending BaseEvaluator. Override runEvaluation and return an EvalResult.

public class ResponseLengthEvaluator extends BaseEvaluator {

private final int minLength;
private final int maxLength;

public ResponseLengthEvaluator(String name, int minLength, int maxLength) {
super(name, 1.0, List.of(EvalTestCaseParam.ACTUAL_OUTPUT));
this.minLength = minLength;
this.maxLength = maxLength;
}

@Override
protected EvalResult runEvaluation(EvalTestCase testCase) {
String output = testCase.actualOutput();
int length = output.length();

boolean withinBounds = length >= minLength && length <= maxLength;
double score = withinBounds ? 1.0 : 0.0;
String reason = String.format("Output length %d (expected %d-%d)",
length, minLength, maxLength);

return EvalResult.builder()
.name(name())
.score(score)
.threshold(threshold())
.reason(reason)
.build();
}
}

// Usage
Evaluator lengthCheck = new ResponseLengthEvaluator("Length Check", 50, 200);

For very simple checks, implement the Evaluator interface directly.

Combining multiple evaluators

Most applications need to pass several quality checks. Put the evaluators in a list and run them together.

List<Evaluator> evaluators = List.of(
// Check if the answer is correct
LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer factually correct?")
.threshold(0.85)
.judge(judge)
.build(),

// Check if it's grounded in retrieved docs (RAG)
FaithfulnessEvaluator.builder()
.threshold(0.80)
.judge(judge)
.contextKey("retrievedContext")
.build(),

// Check if it follows the required format
RegexEvaluator.builder()
.name("Format Check")
.pattern("^[A-Z].*\\.$") // Must start with capital and end with period
.threshold(1.0)
.build()
);

An output passes only if it meets all the thresholds. This lets you enforce several quality dimensions at once.

Best practices

Pick the right evaluator for the job

  • Use ExactMatch when there is only one correct answer (math, data extraction).
  • Use Regex for format validation (dates, emails, IDs).
  • Use StructuralMatch for structured or JSON output where formatting and numeric representation should not count as differences (see the Structured & Typed Data hub).
  • Use LLMJudge for semantic quality (helpfulness, clarity, tone).
  • Use Faithfulness for RAG systems to measure how grounded the output is.
  • Use Hallucination to measure and cap fabricated content.
  • Use ContextualRelevance to evaluate retrieval quality without ground truth.
  • Use Precision/Recall when you have ground truth labels for relevant items.
  • Use Agent evaluators to evaluate AI agents that use tools (task completion, tool validity, argument hallucination, tool reliability).
  • Build custom evaluators for domain-specific requirements.

Start with looser thresholds

Do not aim for perfection right away. Start around 0.7 to 0.8 and tighten as your system improves. A threshold of 1.0 fails on any imperfection.

Write specific criteria for LLM judges

Be clear about what you are scoring.

// Good (specific and measurable)
.criteria("Does the answer correctly explain the refund process and mention the 30-day policy?")

// Bad (too vague)
.criteria("Is this good?")

Use multiple evaluators for important outputs

Check each aspect on its own: correctness, format, grounding, tone. This shows you exactly where things go wrong.

Test your evaluators

Confirm your evaluators behave on known examples before you rely on them.

@Test
void faithfulnessEvaluatorShouldCatchHallucination() {
var testCase = EvalTestCase.builder()
.actualOutput("The product costs $500") // Made up
.metadata(Map.of("context", List.of("The product costs $100")))
.build();

var result = faithfulnessEvaluator.evaluate(testCase);

// Should fail because claim isn't in context
assertFalse(result.success());
}

Using evaluator results

evaluate returns an EvalResult with the score, the pass status, and an explanation. Read them directly.

EvalResult result = evaluator.evaluate(testCase);

System.out.println("Score: " + result.score());
System.out.println("Passed: " + result.success());
System.out.println("Reason: " + result.reason());

In experiments, analyze results across all examples.

ExperimentResult experimentResult = experiment.run();

// Average scores per evaluator
double avgCorrectness = experimentResult.averageScore("Correctness");
double avgFaithfulness = experimentResult.averageScore("Faithfulness");

// Dig into individual results
for (ItemResult item : experimentResult.itemResults()) {
for (EvalResult eval : item.evalResults()) {
if (!eval.success()) {
System.out.println("Failed: " + eval.name() + " (" + eval.reason() + ")");
}
}
}

In JUnit tests, a failing evaluator fails the test.

@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void shouldProduceQualityAnswers(Example example) {
String answer = aiService.generate(example.input());
var testCase = example.toTestCase(answer);

// Fails test if evaluators don't pass
Assertions.assertEval(testCase, evaluators);
}
For AI agentsView as Markdown