# Evaluators > An evaluator scores one of your LLM's outputs and tells you if it passes. Each evaluator returns a score from 0.0 to 1.0 and compares it against a threshold you set. Use this page to pick a built-in evaluator, configure it, and read its result. # Evaluators import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; An evaluator scores one of your LLM's outputs and tells you if it passes. Each evaluator returns a score from 0.0 to 1.0 and compares it against a threshold you set. Use this page to pick a built-in evaluator, configure it, and read its result. Start with a built-in evaluator for common checks (exact matches, regex patterns, LLM judging, RAG grounding, retrieval quality). Write a custom one when none of them fit. ## The Evaluator interface Every evaluator implements `Evaluator`. It has three methods: score a test case, report its name, and report its threshold. ```java public interface Evaluator { EvalResult evaluate(EvalTestCase testCase); String name(); double threshold(); } ``` ```kotlin interface Evaluator { fun evaluate(testCase: EvalTestCase): EvalResult fun name(): String fun threshold(): Double } ``` Evaluators that extend `BaseEvaluator` can also run asynchronously. Call `evaluateAsync` to get a `CompletableFuture`. ```java // Async using common fork-join pool CompletableFuture future = evaluator.evaluateAsync(testCase); // Async with custom executor ExecutorService executor = Executors.newFixedThreadPool(4); CompletableFuture future = evaluator.evaluateAsync(testCase, executor); ``` ```kotlin // Async using common fork-join pool val evalResult = evaluator.evaluateAsync(testCase).await() // Async with custom executor val executor = Executors.newFixedThreadPool(4) val evalResult2 = evaluator.evaluateAsync(testCase, executor).await() ``` Every call returns an `EvalResult`. It holds: - **score**: numeric score (0.0 to 1.0) - **success**: whether the score meets the threshold - **reason**: explanation of the score - **metadata**: extra evaluation data ## Built-in evaluators ### ExactMatchEvaluator Checks if the output matches the expected result exactly. Use it when there is one correct answer. ```java Evaluator evaluator = ExactMatchEvaluator.builder() .name("Exact Match") .threshold(1.0) .build(); ``` ```kotlin val evaluator = exactMatch { name = "Exact Match" threshold = 1.0 } ``` Returns `1.0` if the strings match, `0.0` otherwise. **When to use:** math calculations, code generation, or any case where the output is a string that should come back exactly as expected. :::note `ExactMatchEvaluator` compares the **string forms** of the outputs (`toString()`). For a structured output (a record, `Map`, or list) use [`StructuralMatchEvaluator`](#structuralmatchevaluator) instead. It compares the values structurally and ignores formatting and numeric representation (`5` vs `5.0`). ::: ### RegexEvaluator Checks if the output matches a pattern. Use it to validate format when the exact content can vary. ```java Evaluator dateFormat = RegexEvaluator.builder() .name("Date Format") .pattern("\\d{4}-\\d{2}-\\d{2}") // YYYY-MM-DD .threshold(1.0) .build(); Evaluator emailFormat = RegexEvaluator.builder() .name("Email Format") .pattern("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}") .ignoreCase(true) .threshold(1.0) .build(); ``` ```kotlin val dateFormat = regex { name = "Date Format" pattern = "\\d{4}-\\d{2}-\\d{2}" // YYYY-MM-DD threshold = 1.0 } val emailFormat = regex { name = "Email Format" pattern = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}" ignoreCase = true threshold = 1.0 } ``` **When to use:** validating dates, emails, phone numbers, IDs, or URLs, where the exact value varies but the pattern stays the same. ### LLMJudgeEvaluator Uses a second LLM to score outputs against criteria you write in plain language. Use it for quality checks that rules cannot capture. ```java JudgeLM judge = prompt -> judgeModel.generate(prompt); Evaluator helpfulness = LLMJudgeEvaluator.builder() .name("Helpfulness") .criteria("Is the answer helpful and complete? Does it actually solve the user's problem?") .evaluationParams(List.of( EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT )) .threshold(0.8) .judge(judge) .build(); ``` ```kotlin val judge = JudgeLM { prompt -> judgeModel.generate(prompt) } val helpfulness: Evaluator = llmJudge(judge) { name = "Helpfulness" criteria = "Is the answer helpful and complete? Does it actually solve the user's problem?" params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT) threshold = 0.8 } ``` The evaluator sends your criteria and the test case to the judge model, which returns a score between 0 and 1. The reply is parsed leniently. A one-sentence preamble or trailing prose around the JSON is dropped, so a usable judgment is not lost to a formatting quirk. A structured output (a record, `Map`, or list) is rendered to the judge as pretty-printed JSON, so you can judge a structured value directly. String and primitive output is passed through verbatim. By default the judge scores on a 0..1 scale. To let it work on a different range, set `scoreRange(min, max)`. The reported score is normalized back to 0..1, so your `threshold` always stays on the 0..1 scale. ```java Evaluator helpfulness = LLMJudgeEvaluator.builder() .name("Helpfulness") .criteria("Rate the answer's helpfulness.") .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)) .scoreRange(1, 5) // judge replies 1..5; score is normalized to 0..1 .threshold(0.8) .judge(judge) .build(); ``` ```kotlin val helpfulness: Evaluator = llmJudge(judge) { name = "Helpfulness" criteria = "Rate the answer's helpfulness." params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT) scoreRange(1.0, 5.0) // judge replies 1..5; score is normalized to 0..1 threshold = 0.8 } ``` **When to use:** semantic correctness, helpfulness, tone, clarity, or any quality you can describe in words more easily than in code. ### StructuralMatchEvaluator Compares the actual output against the expected output as **JSON structures**, not as opaque strings. Both sides are normalized to a JSON tree first, so a record, a `Map`, or a JSON string all compare object against object. This is the right tool for structured output (extraction results, function-call arguments, typed POJOs) where reformatting, key ordering, or numeric representation should not count as a difference. Numbers compare **by value, not representation**: `5` equals `5.0`, and `1.0` equals `1.00`, in both modes. Plain string equality of the serialized form would flag those as mismatches. Structural comparison does not. ```java record Invoice(String id, double total, List items) {} Evaluator structural = StructuralMatchEvaluator.builder() .name("Invoice Match") .threshold(1.0) .build(); // STRICT mode, outputKey "output", partial scoring var testCase = EvalTestCase.builder() .expectedOutput("output", new Invoice("INV-1", 42.0, List.of("a", "b"))) .actualOutput("output", new Invoice("INV-1", 42.00, List.of("a", "b"))) .build(); EvalResult result = structural.evaluate(testCase); // result.score() == 1.0 because 42.0 and 42.00 are value-equal ``` ```kotlin data class Invoice(val id: String, val total: Double, val items: List) val structural: Evaluator = StructuralMatchEvaluator.builder() .name("Invoice Match") .threshold(1.0) .build() // STRICT mode, outputKey "output", partial scoring val testCase = EvalTestCase.builder() .expectedOutput("output", Invoice("INV-1", 42.0, listOf("a", "b"))) .actualOutput("output", Invoice("INV-1", 42.00, listOf("a", "b"))) .build() val result = structural.evaluate(testCase) // result.score() == 1.0 because 42.0 and 42.00 are value-equal ``` #### Comparison modes Set the mode with `.mode(...)` using `StructuralMatchMode`: - **`STRICT`** (the default) requires the **exact field set** and **exact array order**. An extra field in the actual output is a mismatch and lowers the score. A `null` value is distinct from a missing field. - **`LENIENT`** allows **extra actual fields** (the actual object may be a superset of the expected one) and ignores array order, comparing arrays as **multisets**. `[1, 1, 2]` does not match `[1, 2]`, but order does not matter. A `null` value and a missing field are treated as equal. ```java Evaluator lenient = StructuralMatchEvaluator.builder() .name("Extraction Match") .mode(StructuralMatchMode.LENIENT) // tolerate extra fields, ignore array order .threshold(0.9) .build(); ``` ```kotlin val lenient: Evaluator = StructuralMatchEvaluator.builder() .name("Extraction Match") .mode(StructuralMatchMode.LENIENT) // tolerate extra fields, ignore array order .threshold(0.9) .build() ``` #### Scoring By default the score is the **fraction of matching leaf paths** in `[0.0, 1.0]`, so one wrong field on a large object is a partial miss, not a total failure. In `STRICT` the denominator is the union of expected and actual leaf paths (extra fields lower the score). In `LENIENT` the denominator is the expected leaf paths only. Call `.binary()` for an **exact-contract gate**. The score collapses to `1.0` when the structures match completely and `0.0` when anything differs. Pair it with `threshold(1.0)` when the output contract must be satisfied exactly. ```java Evaluator contract = StructuralMatchEvaluator.builder() .name("Schema Contract") .binary() // 1.0 if everything matches, 0.0 otherwise .threshold(1.0) .build(); ``` ```kotlin val contract: Evaluator = StructuralMatchEvaluator.builder() .name("Schema Contract") .binary() // 1.0 if everything matches, 0.0 otherwise .threshold(1.0) .build() ``` By default the evaluator reads both sides from the `"output"` key of the expected and actual output maps. Use `.outputKey(...)` to read from a different key. The expected value is required. If it is absent, the evaluator throws. :::tip This evaluator pairs with the typed output accessors on `EvalTestCase` (`actualOutputAs(...)` and `expectedOutputAs(...)`). Store your structured result under a map key as a record or `Map`, compare it structurally here, and read it back as a typed object elsewhere. See the [Structured & Typed Data](./structured-typed-data.md) hub for the whole pipeline end to end. ::: **When to use:** structured or JSON output (extraction results, tool-call arguments, typed response objects) where you care about the data, not its textual formatting, and where numeric representation differences (`5` vs `5.0`) should never count as a regression. ### FaithfulnessEvaluator Checks if the output is grounded in the provided context. Use it in RAG systems to make sure the LLM is not making things up. ```java JudgeLM judge = prompt -> judgeModel.generate(prompt); Evaluator faithfulness = FaithfulnessEvaluator.builder() .threshold(0.8) .judge(judge) .contextKey("retrievedContext") // Where to find the context in outputs .includeReason(true) .build(); ``` ```kotlin val judge = JudgeLM { prompt -> judgeModel.generate(prompt) } val faithfulness: Evaluator = faithfulness(judge) { threshold = 0.8 contextKey = "retrievedContext" // Where to find the context in outputs includeReason = true } ``` The evaluator: 1. Breaks the output into individual claims. 2. Checks each claim against the retrieved context. 3. Calculates score = (supported claims) / (total claims). **When to use:** any RAG system where accuracy matters. If your LLM answers from retrieved documents, use this to catch hallucinations. ### HallucinationEvaluator Detects output that the context does not support. `FaithfulnessEvaluator` measures how much is grounded. This evaluator measures the share of content that is hallucinated. ```java JudgeLM judge = prompt -> judgeModel.generate(prompt); Evaluator hallucination = HallucinationEvaluator.builder() .threshold(0.3) // Allow at most 30% hallucinated content .judge(judge) .contextKey("context") .includeReason(true) .build(); ``` ```kotlin val judge = JudgeLM { prompt -> judgeModel.generate(prompt) } val hallucination: Evaluator = hallucination(judge) { threshold = 0.3 // Allow at most 30% hallucinated content contextKey = "context" includeReason = true } ``` The evaluator: 1. Breaks the output into individual statements. 2. Checks if the context supports each statement. 3. Calculates score = (unsupported statements) / (total statements). **Important:** for this evaluator, **lower scores are better** (0.0 means no hallucinations). Success is `score <= threshold`. **When to use:** when you need to measure and cap the hallucination rate, especially in high-stakes applications where any fabricated information is a problem. ### ContextualRelevanceEvaluator Measures how relevant the retrieved context chunks are to the user's query. Use it to evaluate retrieval quality in RAG systems. ```java JudgeLM judge = prompt -> judgeModel.generate(prompt); Evaluator relevance = ContextualRelevanceEvaluator.builder() .threshold(0.5) .judge(judge) .retrievalContextKey("retrievalContext") .includeReason(true) .strictMode(false) // Set to true for threshold of 1.0 .build(); ``` The evaluator: 1. Scores each context chunk on its own (0.0 to 1.0) for relevance to the query. 2. Calculates the final score as the mean of all chunk scores. 3. Stores the individual chunk scores in the result metadata. ```java var testCase = EvalTestCase.builder() .input("What are symptoms of dehydration?") .actualOutput("retrievalContext", List.of( "Dehydration symptoms include thirst and fatigue.", // Highly relevant "The Pacific Ocean is the largest ocean.", // Irrelevant "Severe dehydration can cause dizziness." // Highly relevant )) .build(); EvalResult result = relevance.evaluate(testCase); // result.score() ≈ 0.63 (average of individual scores) // result.metadata().get("contextScores") contains per-chunk details ``` ```kotlin val judge = JudgeLM { prompt -> judgeModel.generate(prompt) } val relevance: Evaluator = contextualRelevance(judge) { threshold = 0.5 retrievalContextKey = "retrievalContext" includeReason = true strictMode = false // Set to true for threshold of 1.0 } ``` The evaluator: 1. Scores each context chunk on its own (0.0 to 1.0) for relevance to the query. 2. Calculates the final score as the mean of all chunk scores. 3. Stores the individual chunk scores in the result metadata. ```kotlin val testCase = EvalTestCase( input = "What are symptoms of dehydration?", actualOutputs = mapOf("retrievalContext" to listOf( "Dehydration symptoms include thirst and fatigue.", // Highly relevant "The Pacific Ocean is the largest ocean.", // Irrelevant "Severe dehydration can cause dizziness." // Highly relevant ))) val result = relevance.evaluate(testCase) // result.score() ≈ 0.63 (average of individual scores) // result.metadata()["contextScores"] contains per-chunk details ``` **When to use:** evaluating retrieval quality in RAG pipelines. It tells you when your retriever returns irrelevant documents that could confuse the LLM or dilute the answer. ### PrecisionEvaluator Measures what fraction of retrieved items are actually relevant. Needs ground truth labels. ```java Evaluator precision = PrecisionEvaluator.builder() .name("retrieval-precision") .retrievedKey("retrievedDocs") // Key in actualOutputs .expectedKey("relevantDocs") // Key in expectedOutputs (ground truth) .matchingStrategy(MatchingStrategy.byEquality()) .threshold(0.8) .build(); ``` ```kotlin val precision: Evaluator = precision { name = "retrieval-precision" retrievedKey = "retrievedDocs" // Key in actualOutputs expectedKey = "relevantDocs" // Key in expectedOutputs (ground truth) matchingStrategy = MatchingStrategy.byEquality() threshold = 0.8 } ``` **Formula:** `precision = |relevant ∩ retrieved| / |retrieved|` A precision of 1.0 means every retrieved item was relevant (no false positives). **When to use:** when you need to minimize noise in retrieved results. High precision matters when downstream processing is expensive or when irrelevant items could mislead the LLM. ### RecallEvaluator Measures what fraction of relevant items were actually retrieved. Needs ground truth labels. ```java Evaluator recall = RecallEvaluator.builder() .name("retrieval-recall") .retrievedKey("retrievedDocs") .expectedKey("relevantDocs") .matchingStrategy(MatchingStrategy.byEquality()) .threshold(0.8) .build(); ``` ```kotlin val recall: Evaluator = recall { name = "retrieval-recall" retrievedKey = "retrievedDocs" expectedKey = "relevantDocs" matchingStrategy = MatchingStrategy.byEquality() threshold = 0.8 } ``` **Formula:** `recall = |relevant ∩ retrieved| / |relevant|` A recall of 1.0 means all relevant items were found (no false negatives). **When to use:** when missing relevant information is costly. High recall matters for complete answers or when the user expects full coverage. ### Matching strategies Both `PrecisionEvaluator` and `RecallEvaluator` support several strategies for matching retrieved items to ground truth. ```java // Simple equality (default, for string IDs) MatchingStrategy.byEquality() // Case-insensitive string matching MatchingStrategy.caseInsensitive() // Match by a specific field (for Map/JSON objects) MatchingStrategy.byField("id") // Match by multiple fields (e.g., knowledge graph triples) MatchingStrategy.byFields("subject", "predicate", "object") // Substring containment matching MatchingStrategy.byContainment(true) // normalized // LLM-based semantic matching (most flexible, most expensive) MatchingStrategy.llmBased(judge) // Combine strategies MatchingStrategy.anyOf(strategy1, strategy2) // OR MatchingStrategy.allOf(strategy1, strategy2) // AND ``` ```kotlin // Simple equality (default, for string IDs) MatchingStrategy.byEquality() // Case-insensitive string matching MatchingStrategy.caseInsensitive() // Match by a specific field (for Map/JSON objects) MatchingStrategy.byField("id") // Match by multiple fields (e.g., knowledge graph triples) MatchingStrategy.byFields("subject", "predicate", "object") // Substring containment matching MatchingStrategy.byContainment(normalize = true) // LLM-based semantic matching (most flexible, most expensive) MatchingStrategy.llmBased(judge) // Combine strategies MatchingStrategy.anyOf(strategy1, strategy2) // OR MatchingStrategy.allOf(strategy1, strategy2) // AND ``` **Example with knowledge graph triples:** ```java var precision = PrecisionEvaluator.builder() .retrievedKey("triples") .expectedKey("relevantTriples") .matchingStrategy(MatchingStrategy.byFields("subject", "predicate", "object")) .build(); var testCase = EvalTestCase.builder() .input("Who founded Microsoft?") .actualOutput("triples", List.of( Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft") )) .expectedOutput("relevantTriples", List.of( Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft"), Map.of("subject", "Paul Allen", "predicate", "co-founded", "object", "Microsoft") )) .build(); ``` ```kotlin val precision = precision { retrievedKey = "triples" expectedKey = "relevantTriples" matchingStrategy = MatchingStrategy.byFields("subject", "predicate", "object") } val testCase = EvalTestCase( input = "Who founded Microsoft?", actualOutputs = mapOf("triples" to listOf( mapOf("subject" to "Bill Gates", "predicate" to "founded", "object" to "Microsoft") )), expectedOutputs = mapOf("relevantTriples" to listOf( mapOf("subject" to "Bill Gates", "predicate" to "founded", "object" to "Microsoft"), mapOf("subject" to "Paul Allen", "predicate" to "co-founded", "object" to "Microsoft") ))) ``` ### Agent evaluators Dokimos ships specialized evaluators for AI agents that use tools. They cover task completion, tool call validation, argument hallucination detection, and tool definition quality. See the dedicated **[Agent Evaluation](./agent-evaluation)** guide for full documentation. ## Common configuration Every evaluator supports these settings. **Name** sets how the evaluator shows up in results. ```java .name("Answer Quality") ``` ```kotlin name = "Answer Quality" ``` **Threshold** sets the minimum score needed to pass. ```java .threshold(0.8) // Needs 80% or higher ``` ```kotlin threshold = 0.8 // Needs 80% or higher ``` **Evaluation parameters** set which fields the evaluator reads. ```java .evaluationParams(List.of( EvalTestCaseParam.INPUT, // The user's question EvalTestCaseParam.EXPECTED_OUTPUT, // What you expect EvalTestCaseParam.ACTUAL_OUTPUT, // What the LLM actually said )) ``` ```kotlin params( EvalTestCaseParam.INPUT, // The user's question EvalTestCaseParam.EXPECTED_OUTPUT, // What you expect EvalTestCaseParam.ACTUAL_OUTPUT, // What the LLM actually said ) ``` ## Creating custom evaluators When no built-in evaluator fits, write your own by extending `BaseEvaluator`. Override `runEvaluation` and return an `EvalResult`. ```java public class ResponseLengthEvaluator extends BaseEvaluator { private final int minLength; private final int maxLength; public ResponseLengthEvaluator(String name, int minLength, int maxLength) { super(name, 1.0, List.of(EvalTestCaseParam.ACTUAL_OUTPUT)); this.minLength = minLength; this.maxLength = maxLength; } @Override protected EvalResult runEvaluation(EvalTestCase testCase) { String output = testCase.actualOutput(); int length = output.length(); boolean withinBounds = length >= minLength && length <= maxLength; double score = withinBounds ? 1.0 : 0.0; String reason = String.format("Output length %d (expected %d-%d)", length, minLength, maxLength); return EvalResult.builder() .name(name()) .score(score) .threshold(threshold()) .reason(reason) .build(); } } // Usage Evaluator lengthCheck = new ResponseLengthEvaluator("Length Check", 50, 200); ``` ```kotlin class ResponseLengthEvaluator( private val minLength: Int, private val maxLength: Int, private val evaluatorName: String = "Length Check" ) : BaseEvaluator(evaluatorName, 1.0, listOf(EvalTestCaseParam.ACTUAL_OUTPUT)) { override fun runEvaluation(testCase: EvalTestCase): EvalResult { val output = testCase.actualOutput() val length = output.length val withinBounds = length in minLength..maxLength val score = if (withinBounds) 1.0 else 0.0 val reason = "Output length $length (expected $minLength-$maxLength)" return EvalResult( name = name(), score = score, threshold = threshold(), reason = reason, ) } } // Usage val lengthCheck: Evaluator = ResponseLengthEvaluator(50, 200) ``` For very simple checks, implement the `Evaluator` interface directly. ## Combining multiple evaluators Most applications need to pass several quality checks. Put the evaluators in a list and run them together. ```java List evaluators = List.of( // Check if the answer is correct LLMJudgeEvaluator.builder() .name("Correctness") .criteria("Is the answer factually correct?") .threshold(0.85) .judge(judge) .build(), // Check if it's grounded in retrieved docs (RAG) FaithfulnessEvaluator.builder() .threshold(0.80) .judge(judge) .contextKey("retrievedContext") .build(), // Check if it follows the required format RegexEvaluator.builder() .name("Format Check") .pattern("^[A-Z].*\\.$") // Must start with capital and end with period .threshold(1.0) .build() ); ``` ```kotlin val evaluators: List = evaluators { // Check if the answer is correct llmJudge(judge) { name = "Correctness" criteria = "Is the answer factually correct?" threshold = 0.85 } // Check if it's grounded in retrieved docs (RAG) faithfulness(judge) { threshold = 0.80 contextKey = "retrievedContext" } // Check if it follows the required format regex { name = "Format Check" pattern = "^[A-Z].*\\.$" // Must start with capital and end with period threshold = 1.0 } } ``` An output passes only if it meets **all** the thresholds. This lets you enforce several quality dimensions at once. ## Best practices ### Pick the right evaluator for the job - Use **ExactMatch** when there is only one correct answer (math, data extraction). - Use **Regex** for format validation (dates, emails, IDs). - Use **StructuralMatch** for structured or JSON output where formatting and numeric representation should not count as differences (see the [Structured & Typed Data](./structured-typed-data.md) hub). - Use **LLMJudge** for semantic quality (helpfulness, clarity, tone). - Use **Faithfulness** for RAG systems to measure how grounded the output is. - Use **Hallucination** to measure and cap fabricated content. - Use **ContextualRelevance** to evaluate retrieval quality without ground truth. - Use **Precision/Recall** when you have ground truth labels for relevant items. - Use **[Agent evaluators](./agent-evaluation)** to evaluate AI agents that use tools (task completion, tool validity, argument hallucination, tool reliability). - Build **custom evaluators** for domain-specific requirements. ### Start with looser thresholds Do not aim for perfection right away. Start around 0.7 to 0.8 and tighten as your system improves. A threshold of 1.0 fails on any imperfection. ### Write specific criteria for LLM judges Be clear about what you are scoring. ```java // Good (specific and measurable) .criteria("Does the answer correctly explain the refund process and mention the 30-day policy?") // Bad (too vague) .criteria("Is this good?") ``` ```kotlin // Good (specific and measurable) criteria = "Does the answer correctly explain the refund process and mention the 30-day policy?" // Bad (too vague) criteria = "Is this good?" ``` ### Use multiple evaluators for important outputs Check each aspect on its own: correctness, format, grounding, tone. This shows you exactly where things go wrong. ### Test your evaluators Confirm your evaluators behave on known examples before you rely on them. ```java @Test void faithfulnessEvaluatorShouldCatchHallucination() { var testCase = EvalTestCase.builder() .actualOutput("The product costs $500") // Made up .metadata(Map.of("context", List.of("The product costs $100"))) .build(); var result = faithfulnessEvaluator.evaluate(testCase); // Should fail because claim isn't in context assertFalse(result.success()); } ``` ```kotlin @Test fun faithfulnessEvaluatorShouldCatchHallucination() { val testCase = EvalTestCase.builder() .actualOutput("The product costs $500") // Made up .metadata(mapOf("context" to listOf("The product costs $100"))) .build() val result = faithfulnessEvaluator.evaluate(testCase) // Should fail because claim isn't in context assertFalse(result.success()) } ``` ## Using evaluator results `evaluate` returns an `EvalResult` with the score, the pass status, and an explanation. Read them directly. ```java EvalResult result = evaluator.evaluate(testCase); System.out.println("Score: " + result.score()); System.out.println("Passed: " + result.success()); System.out.println("Reason: " + result.reason()); ``` ```kotlin val result = evaluator.evaluate(testCase) println("Score: ${result.score()}") println("Passed: ${result.success()}") println("Reason: ${result.reason()}") ``` In experiments, analyze results across all examples. ```java ExperimentResult experimentResult = experiment.run(); // Average scores per evaluator double avgCorrectness = experimentResult.averageScore("Correctness"); double avgFaithfulness = experimentResult.averageScore("Faithfulness"); // Dig into individual results for (ItemResult item : experimentResult.itemResults()) { for (EvalResult eval : item.evalResults()) { if (!eval.success()) { System.out.println("Failed: " + eval.name() + " (" + eval.reason() + ")"); } } } ``` ```kotlin val experimentResult = experiment.run() // Average scores per evaluator val avgCorrectness = experimentResult.averageScore("Correctness") val avgFaithfulness = experimentResult.averageScore("Faithfulness") // Dig into individual results experimentResult.itemResults().forEach { item -> item.evalResults() .filterNot { eval -> eval.success() } .forEach { eval -> println("Failed: ${eval.name()} (${eval.reason()})") } } ``` In JUnit tests, a failing evaluator fails the test. ```java @ParameterizedTest @DatasetSource("classpath:datasets/qa.json") void shouldProduceQualityAnswers(Example example) { String answer = aiService.generate(example.input()); var testCase = example.toTestCase(answer); // Fails test if evaluators don't pass Assertions.assertEval(testCase, evaluators); } ``` ```kotlin @ParameterizedTest @DatasetSource("classpath:datasets/qa.json") fun shouldProduceQualityAnswers(example: Example) { val answer = aiService.generate(example.input()) val testCase = example.toTestCase(answer) // Fails test if evaluators don't pass Assertions.assertEval(testCase, evaluators) } ```