# Evaluators

> An evaluator scores one of your LLM's outputs and tells you if it passes. Each evaluator returns a score from 0.0 to 1.0 and compares it against a threshold you set. Use this page to pick a built-in evaluator, configure it, and read its result.

# Evaluators

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

An evaluator scores one of your LLM's outputs and tells you if it passes. Each evaluator returns a score from 0.0 to 1.0 and compares it against a threshold you set. Use this page to pick a built-in evaluator, configure it, and read its result.

Start with a built-in evaluator for common checks (exact matches, regex patterns, LLM judging, RAG grounding, retrieval quality). Write a custom one when none of them fit.

## The Evaluator interface

Every evaluator implements `Evaluator`. It has three methods: score a test case, report its name, and report its threshold.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
public interface Evaluator {
    EvalResult evaluate(EvalTestCase testCase);
    String name();
    double threshold();
}
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
interface Evaluator {
    fun evaluate(testCase: EvalTestCase): EvalResult
    fun name(): String
    fun threshold(): Double
}
```

  </TabItem>
</Tabs>

Evaluators that extend `BaseEvaluator` can also run asynchronously. Call `evaluateAsync` to get a `CompletableFuture`.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
// Async using common fork-join pool
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase);

// Async with custom executor
ExecutorService executor = Executors.newFixedThreadPool(4);
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase, executor);
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
// Async using common fork-join pool
val evalResult = evaluator.evaluateAsync(testCase).await()

// Async with custom executor
val executor = Executors.newFixedThreadPool(4)
val evalResult2 = evaluator.evaluateAsync(testCase, executor).await()
```

  </TabItem>
</Tabs>

Every call returns an `EvalResult`. It holds:

- **score**: numeric score (0.0 to 1.0)
- **success**: whether the score meets the threshold
- **reason**: explanation of the score
- **metadata**: extra evaluation data

## Built-in evaluators

### ExactMatchEvaluator

Checks if the output matches the expected result exactly. Use it when there is one correct answer.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
Evaluator evaluator = ExactMatchEvaluator.builder()
    .name("Exact Match")
    .threshold(1.0)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val evaluator = exactMatch {
    name = "Exact Match"
    threshold = 1.0
}
```

  </TabItem>
</Tabs>

Returns `1.0` if the strings match, `0.0` otherwise.

**When to use:** math calculations, code generation, or any case where the output is a string that should come back exactly as expected.

:::note
`ExactMatchEvaluator` compares the **string forms** of the outputs (`toString()`). For a structured output (a record, `Map`, or list) use [`StructuralMatchEvaluator`](#structuralmatchevaluator) instead. It compares the values structurally and ignores formatting and numeric representation (`5` vs `5.0`).
:::

### RegexEvaluator

Checks if the output matches a pattern. Use it to validate format when the exact content can vary.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
Evaluator dateFormat = RegexEvaluator.builder()
    .name("Date Format")
    .pattern("\\d{4}-\\d{2}-\\d{2}")  // YYYY-MM-DD
    .threshold(1.0)
    .build();

Evaluator emailFormat = RegexEvaluator.builder()
    .name("Email Format")
    .pattern("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")
    .ignoreCase(true)
    .threshold(1.0)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val dateFormat = regex {
    name = "Date Format"
    pattern = "\\d{4}-\\d{2}-\\d{2}"  // YYYY-MM-DD
    threshold = 1.0
}

val emailFormat = regex {
    name = "Email Format"
    pattern = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
    ignoreCase = true
    threshold = 1.0
}
```

  </TabItem>
</Tabs>

**When to use:** validating dates, emails, phone numbers, IDs, or URLs, where the exact value varies but the pattern stays the same.

### LLMJudgeEvaluator

Uses a second LLM to score outputs against criteria you write in plain language. Use it for quality checks that rules cannot capture.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator helpfulness = LLMJudgeEvaluator.builder()
    .name("Helpfulness")
    .criteria("Is the answer helpful and complete? Does it actually solve the user's problem?")
    .evaluationParams(List.of(
        EvalTestCaseParam.INPUT,
        EvalTestCaseParam.ACTUAL_OUTPUT
    ))
    .threshold(0.8)
    .judge(judge)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }

val helpfulness: Evaluator = llmJudge(judge) {
    name = "Helpfulness"
    criteria = "Is the answer helpful and complete? Does it actually solve the user's problem?"
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
    threshold = 0.8
}
```

  </TabItem>
</Tabs>

The evaluator sends your criteria and the test case to the judge model, which returns a score between 0 and 1. The reply is parsed leniently. A one-sentence preamble or trailing prose around the JSON is dropped, so a usable judgment is not lost to a formatting quirk.

A structured output (a record, `Map`, or list) is rendered to the judge as pretty-printed JSON, so you can judge a structured value directly. String and primitive output is passed through verbatim.

By default the judge scores on a 0..1 scale. To let it work on a different range, set `scoreRange(min, max)`. The reported score is normalized back to 0..1, so your `threshold` always stays on the 0..1 scale.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
Evaluator helpfulness = LLMJudgeEvaluator.builder()
    .name("Helpfulness")
    .criteria("Rate the answer's helpfulness.")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .scoreRange(1, 5)  // judge replies 1..5; score is normalized to 0..1
    .threshold(0.8)
    .judge(judge)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val helpfulness: Evaluator = llmJudge(judge) {
    name = "Helpfulness"
    criteria = "Rate the answer's helpfulness."
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
    scoreRange(1.0, 5.0)  // judge replies 1..5; score is normalized to 0..1
    threshold = 0.8
}
```

  </TabItem>
</Tabs>

**When to use:** semantic correctness, helpfulness, tone, clarity, or any quality you can describe in words more easily than in code.

### StructuralMatchEvaluator

Compares the actual output against the expected output as **JSON structures**, not as opaque strings. Both sides are normalized to a JSON tree first, so a record, a `Map`, or a JSON string all compare object against object. This is the right tool for structured output (extraction results, function-call arguments, typed POJOs) where reformatting, key ordering, or numeric representation should not count as a difference.

Numbers compare **by value, not representation**: `5` equals `5.0`, and `1.0` equals `1.00`, in both modes. Plain string equality of the serialized form would flag those as mismatches. Structural comparison does not.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
record Invoice(String id, double total, List<String> items) {}

Evaluator structural = StructuralMatchEvaluator.builder()
    .name("Invoice Match")
    .threshold(1.0)
    .build();  // STRICT mode, outputKey "output", partial scoring

var testCase = EvalTestCase.builder()
    .expectedOutput("output", new Invoice("INV-1", 42.0, List.of("a", "b")))
    .actualOutput("output", new Invoice("INV-1", 42.00, List.of("a", "b")))
    .build();

EvalResult result = structural.evaluate(testCase);
// result.score() == 1.0 because 42.0 and 42.00 are value-equal
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
data class Invoice(val id: String, val total: Double, val items: List<String>)

val structural: Evaluator = StructuralMatchEvaluator.builder()
    .name("Invoice Match")
    .threshold(1.0)
    .build()  // STRICT mode, outputKey "output", partial scoring

val testCase = EvalTestCase.builder()
    .expectedOutput("output", Invoice("INV-1", 42.0, listOf("a", "b")))
    .actualOutput("output", Invoice("INV-1", 42.00, listOf("a", "b")))
    .build()

val result = structural.evaluate(testCase)
// result.score() == 1.0 because 42.0 and 42.00 are value-equal
```

  </TabItem>
</Tabs>

#### Comparison modes

Set the mode with `.mode(...)` using `StructuralMatchMode`:

- **`STRICT`** (the default) requires the **exact field set** and **exact array order**. An extra field in the actual output is a mismatch and lowers the score. A `null` value is distinct from a missing field.
- **`LENIENT`** allows **extra actual fields** (the actual object may be a superset of the expected one) and ignores array order, comparing arrays as **multisets**. `[1, 1, 2]` does not match `[1, 2]`, but order does not matter. A `null` value and a missing field are treated as equal.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
Evaluator lenient = StructuralMatchEvaluator.builder()
    .name("Extraction Match")
    .mode(StructuralMatchMode.LENIENT)  // tolerate extra fields, ignore array order
    .threshold(0.9)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val lenient: Evaluator = StructuralMatchEvaluator.builder()
    .name("Extraction Match")
    .mode(StructuralMatchMode.LENIENT)  // tolerate extra fields, ignore array order
    .threshold(0.9)
    .build()
```

  </TabItem>
</Tabs>

#### Scoring

By default the score is the **fraction of matching leaf paths** in `[0.0, 1.0]`, so one wrong field on a large object is a partial miss, not a total failure. In `STRICT` the denominator is the union of expected and actual leaf paths (extra fields lower the score). In `LENIENT` the denominator is the expected leaf paths only.

Call `.binary()` for an **exact-contract gate**. The score collapses to `1.0` when the structures match completely and `0.0` when anything differs. Pair it with `threshold(1.0)` when the output contract must be satisfied exactly.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
Evaluator contract = StructuralMatchEvaluator.builder()
    .name("Schema Contract")
    .binary()          // 1.0 if everything matches, 0.0 otherwise
    .threshold(1.0)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val contract: Evaluator = StructuralMatchEvaluator.builder()
    .name("Schema Contract")
    .binary()          // 1.0 if everything matches, 0.0 otherwise
    .threshold(1.0)
    .build()
```

  </TabItem>
</Tabs>

By default the evaluator reads both sides from the `"output"` key of the expected and actual output maps. Use `.outputKey(...)` to read from a different key. The expected value is required. If it is absent, the evaluator throws.

:::tip
This evaluator pairs with the typed output accessors on `EvalTestCase` (`actualOutputAs(...)` and `expectedOutputAs(...)`). Store your structured result under a map key as a record or `Map`, compare it structurally here, and read it back as a typed object elsewhere. See the [Structured & Typed Data](./structured-typed-data.md) hub for the whole pipeline end to end.
:::

**When to use:** structured or JSON output (extraction results, tool-call arguments, typed response objects) where you care about the data, not its textual formatting, and where numeric representation differences (`5` vs `5.0`) should never count as a regression.

### FaithfulnessEvaluator

Checks if the output is grounded in the provided context. Use it in RAG systems to make sure the LLM is not making things up.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator faithfulness = FaithfulnessEvaluator.builder()
    .threshold(0.8)
    .judge(judge)
    .contextKey("retrievedContext")  // Where to find the context in outputs
    .includeReason(true)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }

val faithfulness: Evaluator = faithfulness(judge) {
    threshold = 0.8
    contextKey = "retrievedContext"  // Where to find the context in outputs
    includeReason = true
}
```

  </TabItem>
</Tabs>

The evaluator:

1. Breaks the output into individual claims.
2. Checks each claim against the retrieved context.
3. Calculates score = (supported claims) / (total claims).

**When to use:** any RAG system where accuracy matters. If your LLM answers from retrieved documents, use this to catch hallucinations.

### HallucinationEvaluator

Detects output that the context does not support. `FaithfulnessEvaluator` measures how much is grounded. This evaluator measures the share of content that is hallucinated.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator hallucination = HallucinationEvaluator.builder()
    .threshold(0.3)  // Allow at most 30% hallucinated content
    .judge(judge)
    .contextKey("context")
    .includeReason(true)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }

val hallucination: Evaluator = hallucination(judge) {
    threshold = 0.3  // Allow at most 30% hallucinated content
    contextKey = "context"
    includeReason = true
}
```

  </TabItem>
</Tabs>

The evaluator:

1. Breaks the output into individual statements.
2. Checks if the context supports each statement.
3. Calculates score = (unsupported statements) / (total statements).

**Important:** for this evaluator, **lower scores are better** (0.0 means no hallucinations). Success is `score <= threshold`.

**When to use:** when you need to measure and cap the hallucination rate, especially in high-stakes applications where any fabricated information is a problem.

### ContextualRelevanceEvaluator

Measures how relevant the retrieved context chunks are to the user's query. Use it to evaluate retrieval quality in RAG systems.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
JudgeLM judge = prompt -> judgeModel.generate(prompt);

Evaluator relevance = ContextualRelevanceEvaluator.builder()
    .threshold(0.5)
    .judge(judge)
    .retrievalContextKey("retrievalContext")
    .includeReason(true)
    .strictMode(false)  // Set to true for threshold of 1.0
    .build();
```

The evaluator:

1. Scores each context chunk on its own (0.0 to 1.0) for relevance to the query.
2. Calculates the final score as the mean of all chunk scores.
3. Stores the individual chunk scores in the result metadata.

```java
var testCase = EvalTestCase.builder()
    .input("What are symptoms of dehydration?")
    .actualOutput("retrievalContext", List.of(
        "Dehydration symptoms include thirst and fatigue.",  // Highly relevant
        "The Pacific Ocean is the largest ocean.",           // Irrelevant
        "Severe dehydration can cause dizziness."            // Highly relevant
    ))
    .build();

EvalResult result = relevance.evaluate(testCase);
// result.score() ≈ 0.63 (average of individual scores)
// result.metadata().get("contextScores") contains per-chunk details
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }

val relevance: Evaluator = contextualRelevance(judge) {
    threshold = 0.5
    retrievalContextKey = "retrievalContext"
    includeReason = true
    strictMode = false  // Set to true for threshold of 1.0
}
```

The evaluator:

1. Scores each context chunk on its own (0.0 to 1.0) for relevance to the query.
2. Calculates the final score as the mean of all chunk scores.
3. Stores the individual chunk scores in the result metadata.

```kotlin
val testCase = EvalTestCase(
    input = "What are symptoms of dehydration?",
    actualOutputs = mapOf("retrievalContext" to listOf(
        "Dehydration symptoms include thirst and fatigue.",  // Highly relevant
        "The Pacific Ocean is the largest ocean.",           // Irrelevant
        "Severe dehydration can cause dizziness."            // Highly relevant
    )))

val result = relevance.evaluate(testCase)
// result.score() ≈ 0.63 (average of individual scores)
// result.metadata()["contextScores"] contains per-chunk details
```

  </TabItem>
</Tabs>

**When to use:** evaluating retrieval quality in RAG pipelines. It tells you when your retriever returns irrelevant documents that could confuse the LLM or dilute the answer.

### PrecisionEvaluator

Measures what fraction of retrieved items are actually relevant. Needs ground truth labels.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
Evaluator precision = PrecisionEvaluator.builder()
    .name("retrieval-precision")
    .retrievedKey("retrievedDocs")   // Key in actualOutputs
    .expectedKey("relevantDocs")     // Key in expectedOutputs (ground truth)
    .matchingStrategy(MatchingStrategy.byEquality())
    .threshold(0.8)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val precision: Evaluator = precision {
    name = "retrieval-precision"
    retrievedKey = "retrievedDocs"   // Key in actualOutputs
    expectedKey = "relevantDocs"     // Key in expectedOutputs (ground truth)
    matchingStrategy = MatchingStrategy.byEquality()
    threshold = 0.8
}
```

  </TabItem>
</Tabs>

**Formula:** `precision = |relevant ∩ retrieved| / |retrieved|`

A precision of 1.0 means every retrieved item was relevant (no false positives).

**When to use:** when you need to minimize noise in retrieved results. High precision matters when downstream processing is expensive or when irrelevant items could mislead the LLM.

### RecallEvaluator

Measures what fraction of relevant items were actually retrieved. Needs ground truth labels.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
Evaluator recall = RecallEvaluator.builder()
    .name("retrieval-recall")
    .retrievedKey("retrievedDocs")
    .expectedKey("relevantDocs")
    .matchingStrategy(MatchingStrategy.byEquality())
    .threshold(0.8)
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val recall: Evaluator = recall {
    name = "retrieval-recall"
    retrievedKey = "retrievedDocs"
    expectedKey = "relevantDocs"
    matchingStrategy = MatchingStrategy.byEquality()
    threshold = 0.8
}
```

  </TabItem>
</Tabs>

**Formula:** `recall = |relevant ∩ retrieved| / |relevant|`

A recall of 1.0 means all relevant items were found (no false negatives).

**When to use:** when missing relevant information is costly. High recall matters for complete answers or when the user expects full coverage.

### Matching strategies

Both `PrecisionEvaluator` and `RecallEvaluator` support several strategies for matching retrieved items to ground truth.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
// Simple equality (default, for string IDs)
MatchingStrategy.byEquality()

// Case-insensitive string matching
MatchingStrategy.caseInsensitive()

// Match by a specific field (for Map/JSON objects)
MatchingStrategy.byField("id")

// Match by multiple fields (e.g., knowledge graph triples)
MatchingStrategy.byFields("subject", "predicate", "object")

// Substring containment matching
MatchingStrategy.byContainment(true)  // normalized

// LLM-based semantic matching (most flexible, most expensive)
MatchingStrategy.llmBased(judge)

// Combine strategies
MatchingStrategy.anyOf(strategy1, strategy2)  // OR
MatchingStrategy.allOf(strategy1, strategy2)  // AND
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
// Simple equality (default, for string IDs)
MatchingStrategy.byEquality()

// Case-insensitive string matching
MatchingStrategy.caseInsensitive()

// Match by a specific field (for Map/JSON objects)
MatchingStrategy.byField("id")

// Match by multiple fields (e.g., knowledge graph triples)
MatchingStrategy.byFields("subject", "predicate", "object")

// Substring containment matching
MatchingStrategy.byContainment(normalize = true)

// LLM-based semantic matching (most flexible, most expensive)
MatchingStrategy.llmBased(judge)

// Combine strategies
MatchingStrategy.anyOf(strategy1, strategy2)  // OR
MatchingStrategy.allOf(strategy1, strategy2)  // AND
```

  </TabItem>
</Tabs>

**Example with knowledge graph triples:**

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
var precision = PrecisionEvaluator.builder()
    .retrievedKey("triples")
    .expectedKey("relevantTriples")
    .matchingStrategy(MatchingStrategy.byFields("subject", "predicate", "object"))
    .build();

var testCase = EvalTestCase.builder()
    .input("Who founded Microsoft?")
    .actualOutput("triples", List.of(
        Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft")
    ))
    .expectedOutput("relevantTriples", List.of(
        Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft"),
        Map.of("subject", "Paul Allen", "predicate", "co-founded", "object", "Microsoft")
    ))
    .build();
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val precision = precision {
    retrievedKey = "triples"
    expectedKey = "relevantTriples"
    matchingStrategy = MatchingStrategy.byFields("subject", "predicate", "object")
}

val testCase = EvalTestCase(
    input = "Who founded Microsoft?",
    actualOutputs = mapOf("triples" to listOf(
      mapOf("subject" to "Bill Gates", "predicate" to "founded", "object" to "Microsoft")
    )),
    expectedOutputs = mapOf("relevantTriples" to listOf(
      mapOf("subject" to "Bill Gates", "predicate" to "founded", "object" to "Microsoft"),
      mapOf("subject" to "Paul Allen", "predicate" to "co-founded", "object" to "Microsoft")
    )))
```

  </TabItem>
</Tabs>

### Agent evaluators

Dokimos ships specialized evaluators for AI agents that use tools. They cover task completion, tool call validation, argument hallucination detection, and tool definition quality.

See the dedicated **[Agent Evaluation](./agent-evaluation)** guide for full documentation.

## Common configuration

Every evaluator supports these settings.

**Name** sets how the evaluator shows up in results.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
.name("Answer Quality")
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
name = "Answer Quality"
```

  </TabItem>
</Tabs>

**Threshold** sets the minimum score needed to pass.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
.threshold(0.8)  // Needs 80% or higher
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
threshold = 0.8  // Needs 80% or higher
```

  </TabItem>
</Tabs>

**Evaluation parameters** set which fields the evaluator reads.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
.evaluationParams(List.of(
    EvalTestCaseParam.INPUT,           // The user's question
    EvalTestCaseParam.EXPECTED_OUTPUT, // What you expect
    EvalTestCaseParam.ACTUAL_OUTPUT,   // What the LLM actually said
))
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
params(
    EvalTestCaseParam.INPUT,           // The user's question
    EvalTestCaseParam.EXPECTED_OUTPUT, // What you expect
    EvalTestCaseParam.ACTUAL_OUTPUT,   // What the LLM actually said
)
```

  </TabItem>
</Tabs>

## Creating custom evaluators

When no built-in evaluator fits, write your own by extending `BaseEvaluator`. Override `runEvaluation` and return an `EvalResult`.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
public class ResponseLengthEvaluator extends BaseEvaluator {
    
    private final int minLength;
    private final int maxLength;
    
    public ResponseLengthEvaluator(String name, int minLength, int maxLength) {
        super(name, 1.0, List.of(EvalTestCaseParam.ACTUAL_OUTPUT));
        this.minLength = minLength;
        this.maxLength = maxLength;
    }
    
    @Override
    protected EvalResult runEvaluation(EvalTestCase testCase) {
        String output = testCase.actualOutput();
        int length = output.length();
        
        boolean withinBounds = length >= minLength && length <= maxLength;
        double score = withinBounds ? 1.0 : 0.0;
        String reason = String.format("Output length %d (expected %d-%d)",
            length, minLength, maxLength);
        
        return EvalResult.builder()
            .name(name())
            .score(score)
            .threshold(threshold())
            .reason(reason)
            .build();
    }
}

// Usage
Evaluator lengthCheck = new ResponseLengthEvaluator("Length Check", 50, 200);
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
class ResponseLengthEvaluator(
    private val minLength: Int,
    private val maxLength: Int,
    private val evaluatorName: String = "Length Check"
) : BaseEvaluator(evaluatorName, 1.0, listOf(EvalTestCaseParam.ACTUAL_OUTPUT)) {

    override fun runEvaluation(testCase: EvalTestCase): EvalResult {
        val output = testCase.actualOutput()
        val length = output.length

        val withinBounds = length in minLength..maxLength
        val score = if (withinBounds) 1.0 else 0.0
        val reason = "Output length $length (expected $minLength-$maxLength)"

        return EvalResult(
          name = name(),
          score = score,
          threshold = threshold(),
          reason = reason,
        )
    }
}

// Usage
val lengthCheck: Evaluator = ResponseLengthEvaluator(50, 200)
```

  </TabItem>
</Tabs>

For very simple checks, implement the `Evaluator` interface directly.

## Combining multiple evaluators

Most applications need to pass several quality checks. Put the evaluators in a list and run them together.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
List<Evaluator> evaluators = List.of(
    // Check if the answer is correct
    LLMJudgeEvaluator.builder()
        .name("Correctness")
        .criteria("Is the answer factually correct?")
        .threshold(0.85)
        .judge(judge)
        .build(),
    
    // Check if it's grounded in retrieved docs (RAG)
    FaithfulnessEvaluator.builder()
        .threshold(0.80)
        .judge(judge)
        .contextKey("retrievedContext")
        .build(),
    
    // Check if it follows the required format
    RegexEvaluator.builder()
        .name("Format Check")
        .pattern("^[A-Z].*\\.$")  // Must start with capital and end with period
        .threshold(1.0)
        .build()
);
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val evaluators: List<Evaluator> = evaluators {
    // Check if the answer is correct
    llmJudge(judge) {
        name = "Correctness"
        criteria = "Is the answer factually correct?"
        threshold = 0.85
    }

    // Check if it's grounded in retrieved docs (RAG)
    faithfulness(judge) {
        threshold = 0.80
        contextKey = "retrievedContext"
    }

    // Check if it follows the required format
    regex {
        name = "Format Check"
        pattern = "^[A-Z].*\\.$"  // Must start with capital and end with period
        threshold = 1.0
    }
}
```

  </TabItem>
</Tabs>

An output passes only if it meets **all** the thresholds. This lets you enforce several quality dimensions at once.

## Best practices

### Pick the right evaluator for the job

- Use **ExactMatch** when there is only one correct answer (math, data extraction).
- Use **Regex** for format validation (dates, emails, IDs).
- Use **StructuralMatch** for structured or JSON output where formatting and numeric representation should not count as differences (see the [Structured & Typed Data](./structured-typed-data.md) hub).
- Use **LLMJudge** for semantic quality (helpfulness, clarity, tone).
- Use **Faithfulness** for RAG systems to measure how grounded the output is.
- Use **Hallucination** to measure and cap fabricated content.
- Use **ContextualRelevance** to evaluate retrieval quality without ground truth.
- Use **Precision/Recall** when you have ground truth labels for relevant items.
- Use **[Agent evaluators](./agent-evaluation)** to evaluate AI agents that use tools (task completion, tool validity, argument hallucination, tool reliability).
- Build **custom evaluators** for domain-specific requirements.

### Start with looser thresholds

Do not aim for perfection right away. Start around 0.7 to 0.8 and tighten as your system improves. A threshold of 1.0 fails on any imperfection.

### Write specific criteria for LLM judges

Be clear about what you are scoring.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
// Good (specific and measurable)
.criteria("Does the answer correctly explain the refund process and mention the 30-day policy?")

// Bad (too vague)
.criteria("Is this good?")
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
// Good (specific and measurable)
criteria = "Does the answer correctly explain the refund process and mention the 30-day policy?"

// Bad (too vague)
criteria = "Is this good?"
```

  </TabItem>
</Tabs>

### Use multiple evaluators for important outputs

Check each aspect on its own: correctness, format, grounding, tone. This shows you exactly where things go wrong.

### Test your evaluators

Confirm your evaluators behave on known examples before you rely on them.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
@Test
void faithfulnessEvaluatorShouldCatchHallucination() {
    var testCase = EvalTestCase.builder()
        .actualOutput("The product costs $500")  // Made up
        .metadata(Map.of("context", List.of("The product costs $100")))
        .build();
    
    var result = faithfulnessEvaluator.evaluate(testCase);
    
    // Should fail because claim isn't in context
    assertFalse(result.success());
}
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
@Test
fun faithfulnessEvaluatorShouldCatchHallucination() {
    val testCase = EvalTestCase.builder()
        .actualOutput("The product costs $500")  // Made up
        .metadata(mapOf("context" to listOf("The product costs $100")))
        .build()

    val result = faithfulnessEvaluator.evaluate(testCase)

    // Should fail because claim isn't in context
    assertFalse(result.success())
}
```

  </TabItem>
</Tabs>

## Using evaluator results

`evaluate` returns an `EvalResult` with the score, the pass status, and an explanation. Read them directly.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
EvalResult result = evaluator.evaluate(testCase);

System.out.println("Score: " + result.score());
System.out.println("Passed: " + result.success());
System.out.println("Reason: " + result.reason());
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val result = evaluator.evaluate(testCase)

println("Score: ${result.score()}")
println("Passed: ${result.success()}")
println("Reason: ${result.reason()}")
```

  </TabItem>
</Tabs>

In experiments, analyze results across all examples.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
ExperimentResult experimentResult = experiment.run();

// Average scores per evaluator
double avgCorrectness = experimentResult.averageScore("Correctness");
double avgFaithfulness = experimentResult.averageScore("Faithfulness");

// Dig into individual results
for (ItemResult item : experimentResult.itemResults()) {
    for (EvalResult eval : item.evalResults()) {
        if (!eval.success()) {
            System.out.println("Failed: " + eval.name() + " (" + eval.reason() + ")");
        }
    }
}
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
val experimentResult = experiment.run()

// Average scores per evaluator
val avgCorrectness = experimentResult.averageScore("Correctness")
val avgFaithfulness = experimentResult.averageScore("Faithfulness")

// Dig into individual results
experimentResult.itemResults().forEach { item ->
    item.evalResults()
        .filterNot { eval -> eval.success() }
        .forEach { eval ->
            println("Failed: ${eval.name()} (${eval.reason()})")
        }
}
```

  </TabItem>
</Tabs>

In JUnit tests, a failing evaluator fails the test.

<Tabs groupId="lang" defaultValue="java">
  <TabItem value="java" label="Java">

```java
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void shouldProduceQualityAnswers(Example example) {
    String answer = aiService.generate(example.input());
    var testCase = example.toTestCase(answer);
    
    // Fails test if evaluators don't pass
    Assertions.assertEval(testCase, evaluators);
}
```

  </TabItem>
  <TabItem value="kotlin" label="Kotlin">

```kotlin
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
fun shouldProduceQualityAnswers(example: Example) {
    val answer = aiService.generate(example.input())
    val testCase = example.toTestCase(answer)

    // Fails test if evaluators don't pass
    Assertions.assertEval(testCase, evaluators)
}
```

  </TabItem>
</Tabs>
