Evaluators
An evaluator scores one of your LLM's outputs and tells you if it passes. Each evaluator returns a score from 0.0 to 1.0 and compares it against a threshold you set. Use this page to pick a built-in evaluator, configure it, and read its result.
Start with a built-in evaluator for common checks (exact matches, regex patterns, LLM judging, RAG grounding, retrieval quality). Write a custom one when none of them fit.
The Evaluator interface
Every evaluator implements Evaluator. It has three methods: score a test case, report its name, and report its threshold.
- Java
- Kotlin
public interface Evaluator {
EvalResult evaluate(EvalTestCase testCase);
String name();
double threshold();
}
interface Evaluator {
fun evaluate(testCase: EvalTestCase): EvalResult
fun name(): String
fun threshold(): Double
}
Evaluators that extend BaseEvaluator can also run asynchronously. Call evaluateAsync to get a CompletableFuture.
- Java
- Kotlin
// Async using common fork-join pool
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase);
// Async with custom executor
ExecutorService executor = Executors.newFixedThreadPool(4);
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase, executor);
// Async using common fork-join pool
val evalResult = evaluator.evaluateAsync(testCase).await()
// Async with custom executor
val executor = Executors.newFixedThreadPool(4)
val evalResult2 = evaluator.evaluateAsync(testCase, executor).await()
Every call returns an EvalResult. It holds:
- score: numeric score (0.0 to 1.0)
- success: whether the score meets the threshold
- reason: explanation of the score
- metadata: extra evaluation data
Built-in evaluators
ExactMatchEvaluator
Checks if the output matches the expected result exactly. Use it when there is one correct answer.
- Java
- Kotlin
Evaluator evaluator = ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build();
val evaluator = exactMatch {
name = "Exact Match"
threshold = 1.0
}
Returns 1.0 if the strings match, 0.0 otherwise.
When to use: math calculations, code generation, or any case where the output is a string that should come back exactly as expected.
ExactMatchEvaluator compares the string forms of the outputs (toString()). For a structured output (a record, Map, or list) use StructuralMatchEvaluator instead. It compares the values structurally and ignores formatting and numeric representation (5 vs 5.0).
RegexEvaluator
Checks if the output matches a pattern. Use it to validate format when the exact content can vary.
- Java
- Kotlin
Evaluator dateFormat = RegexEvaluator.builder()
.name("Date Format")
.pattern("\\d{4}-\\d{2}-\\d{2}") // YYYY-MM-DD
.threshold(1.0)
.build();
Evaluator emailFormat = RegexEvaluator.builder()
.name("Email Format")
.pattern("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")
.ignoreCase(true)
.threshold(1.0)
.build();
val dateFormat = regex {
name = "Date Format"
pattern = "\\d{4}-\\d{2}-\\d{2}" // YYYY-MM-DD
threshold = 1.0
}
val emailFormat = regex {
name = "Email Format"
pattern = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
ignoreCase = true
threshold = 1.0
}
When to use: validating dates, emails, phone numbers, IDs, or URLs, where the exact value varies but the pattern stays the same.
LLMJudgeEvaluator
Uses a second LLM to score outputs against criteria you write in plain language. Use it for quality checks that rules cannot capture.
- Java
- Kotlin
JudgeLM judge = prompt -> judgeModel.generate(prompt);
Evaluator helpfulness = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Is the answer helpful and complete? Does it actually solve the user's problem?")
.evaluationParams(List.of(
EvalTestCaseParam.INPUT,
EvalTestCaseParam.ACTUAL_OUTPUT
))
.threshold(0.8)
.judge(judge)
.build();
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }
val helpfulness: Evaluator = llmJudge(judge) {
name = "Helpfulness"
criteria = "Is the answer helpful and complete? Does it actually solve the user's problem?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
threshold = 0.8
}
The evaluator sends your criteria and the test case to the judge model, which returns a score between 0 and 1. The reply is parsed leniently. A one-sentence preamble or trailing prose around the JSON is dropped, so a usable judgment is not lost to a formatting quirk.
A structured output (a record, Map, or list) is rendered to the judge as pretty-printed JSON, so you can judge a structured value directly. String and primitive output is passed through verbatim.
By default the judge scores on a 0..1 scale. To let it work on a different range, set scoreRange(min, max). The reported score is normalized back to 0..1, so your threshold always stays on the 0..1 scale.
- Java
- Kotlin
Evaluator helpfulness = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Rate the answer's helpfulness.")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.scoreRange(1, 5) // judge replies 1..5; score is normalized to 0..1
.threshold(0.8)
.judge(judge)
.build();
val helpfulness: Evaluator = llmJudge(judge) {
name = "Helpfulness"
criteria = "Rate the answer's helpfulness."
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
scoreRange(1.0, 5.0) // judge replies 1..5; score is normalized to 0..1
threshold = 0.8
}
When to use: semantic correctness, helpfulness, tone, clarity, or any quality you can describe in words more easily than in code.
StructuralMatchEvaluator
Compares the actual output against the expected output as JSON structures, not as opaque strings. Both sides are normalized to a JSON tree first, so a record, a Map, or a JSON string all compare object against object. This is the right tool for structured output (extraction results, function-call arguments, typed POJOs) where reformatting, key ordering, or numeric representation should not count as a difference.
Numbers compare by value, not representation: 5 equals 5.0, and 1.0 equals 1.00, in both modes. Plain string equality of the serialized form would flag those as mismatches. Structural comparison does not.
- Java
- Kotlin
record Invoice(String id, double total, List<String> items) {}
Evaluator structural = StructuralMatchEvaluator.builder()
.name("Invoice Match")
.threshold(1.0)
.build(); // STRICT mode, outputKey "output", partial scoring
var testCase = EvalTestCase.builder()
.expectedOutput("output", new Invoice("INV-1", 42.0, List.of("a", "b")))
.actualOutput("output", new Invoice("INV-1", 42.00, List.of("a", "b")))
.build();
EvalResult result = structural.evaluate(testCase);
// result.score() == 1.0 because 42.0 and 42.00 are value-equal
data class Invoice(val id: String, val total: Double, val items: List<String>)
val structural: Evaluator = StructuralMatchEvaluator.builder()
.name("Invoice Match")
.threshold(1.0)
.build() // STRICT mode, outputKey "output", partial scoring
val testCase = EvalTestCase.builder()
.expectedOutput("output", Invoice("INV-1", 42.0, listOf("a", "b")))
.actualOutput("output", Invoice("INV-1", 42.00, listOf("a", "b")))
.build()
val result = structural.evaluate(testCase)
// result.score() == 1.0 because 42.0 and 42.00 are value-equal
Comparison modes
Set the mode with .mode(...) using StructuralMatchMode:
STRICT(the default) requires the exact field set and exact array order. An extra field in the actual output is a mismatch and lowers the score. Anullvalue is distinct from a missing field.LENIENTallows extra actual fields (the actual object may be a superset of the expected one) and ignores array order, comparing arrays as multisets.[1, 1, 2]does not match[1, 2], but order does not matter. Anullvalue and a missing field are treated as equal.
- Java
- Kotlin
Evaluator lenient = StructuralMatchEvaluator.builder()
.name("Extraction Match")
.mode(StructuralMatchMode.LENIENT) // tolerate extra fields, ignore array order
.threshold(0.9)
.build();
val lenient: Evaluator = StructuralMatchEvaluator.builder()
.name("Extraction Match")
.mode(StructuralMatchMode.LENIENT) // tolerate extra fields, ignore array order
.threshold(0.9)
.build()
Scoring
By default the score is the fraction of matching leaf paths in [0.0, 1.0], so one wrong field on a large object is a partial miss, not a total failure. In STRICT the denominator is the union of expected and actual leaf paths (extra fields lower the score). In LENIENT the denominator is the expected leaf paths only.
Call .binary() for an exact-contract gate. The score collapses to 1.0 when the structures match completely and 0.0 when anything differs. Pair it with threshold(1.0) when the output contract must be satisfied exactly.
- Java
- Kotlin
Evaluator contract = StructuralMatchEvaluator.builder()
.name("Schema Contract")
.binary() // 1.0 if everything matches, 0.0 otherwise
.threshold(1.0)
.build();
val contract: Evaluator = StructuralMatchEvaluator.builder()
.name("Schema Contract")
.binary() // 1.0 if everything matches, 0.0 otherwise
.threshold(1.0)
.build()
By default the evaluator reads both sides from the "output" key of the expected and actual output maps. Use .outputKey(...) to read from a different key. The expected value is required. If it is absent, the evaluator throws.
This evaluator pairs with the typed output accessors on EvalTestCase (actualOutputAs(...) and expectedOutputAs(...)). Store your structured result under a map key as a record or Map, compare it structurally here, and read it back as a typed object elsewhere. See the Structured & Typed Data hub for the whole pipeline end to end.
When to use: structured or JSON output (extraction results, tool-call arguments, typed response objects) where you care about the data, not its textual formatting, and where numeric representation differences (5 vs 5.0) should never count as a regression.
FaithfulnessEvaluator
Checks if the output is grounded in the provided context. Use it in RAG systems to make sure the LLM is not making things up.
- Java
- Kotlin
JudgeLM judge = prompt -> judgeModel.generate(prompt);
Evaluator faithfulness = FaithfulnessEvaluator.builder()
.threshold(0.8)
.judge(judge)
.contextKey("retrievedContext") // Where to find the context in outputs
.includeReason(true)
.build();
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }
val faithfulness: Evaluator = faithfulness(judge) {
threshold = 0.8
contextKey = "retrievedContext" // Where to find the context in outputs
includeReason = true
}
The evaluator:
- Breaks the output into individual claims.
- Checks each claim against the retrieved context.
- Calculates score = (supported claims) / (total claims).
When to use: any RAG system where accuracy matters. If your LLM answers from retrieved documents, use this to catch hallucinations.
HallucinationEvaluator
Detects output that the context does not support. FaithfulnessEvaluator measures how much is grounded. This evaluator measures the share of content that is hallucinated.
- Java
- Kotlin
JudgeLM judge = prompt -> judgeModel.generate(prompt);
Evaluator hallucination = HallucinationEvaluator.builder()
.threshold(0.3) // Allow at most 30% hallucinated content
.judge(judge)
.contextKey("context")
.includeReason(true)
.build();
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }
val hallucination: Evaluator = hallucination(judge) {
threshold = 0.3 // Allow at most 30% hallucinated content
contextKey = "context"
includeReason = true
}
The evaluator:
- Breaks the output into individual statements.
- Checks if the context supports each statement.
- Calculates score = (unsupported statements) / (total statements).
Important: for this evaluator, lower scores are better (0.0 means no hallucinations). Success is score <= threshold.
When to use: when you need to measure and cap the hallucination rate, especially in high-stakes applications where any fabricated information is a problem.
ContextualRelevanceEvaluator
Measures how relevant the retrieved context chunks are to the user's query. Use it to evaluate retrieval quality in RAG systems.
- Java
- Kotlin
JudgeLM judge = prompt -> judgeModel.generate(prompt);
Evaluator relevance = ContextualRelevanceEvaluator.builder()
.threshold(0.5)
.judge(judge)
.retrievalContextKey("retrievalContext")
.includeReason(true)
.strictMode(false) // Set to true for threshold of 1.0
.build();
The evaluator:
- Scores each context chunk on its own (0.0 to 1.0) for relevance to the query.
- Calculates the final score as the mean of all chunk scores.
- Stores the individual chunk scores in the result metadata.
var testCase = EvalTestCase.builder()
.input("What are symptoms of dehydration?")
.actualOutput("retrievalContext", List.of(
"Dehydration symptoms include thirst and fatigue.", // Highly relevant
"The Pacific Ocean is the largest ocean.", // Irrelevant
"Severe dehydration can cause dizziness." // Highly relevant
))
.build();
EvalResult result = relevance.evaluate(testCase);
// result.score() ≈ 0.63 (average of individual scores)
// result.metadata().get("contextScores") contains per-chunk details
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }
val relevance: Evaluator = contextualRelevance(judge) {
threshold = 0.5
retrievalContextKey = "retrievalContext"
includeReason = true
strictMode = false // Set to true for threshold of 1.0
}
The evaluator:
- Scores each context chunk on its own (0.0 to 1.0) for relevance to the query.
- Calculates the final score as the mean of all chunk scores.
- Stores the individual chunk scores in the result metadata.
val testCase = EvalTestCase(
input = "What are symptoms of dehydration?",
actualOutputs = mapOf("retrievalContext" to listOf(
"Dehydration symptoms include thirst and fatigue.", // Highly relevant
"The Pacific Ocean is the largest ocean.", // Irrelevant
"Severe dehydration can cause dizziness." // Highly relevant
)))
val result = relevance.evaluate(testCase)
// result.score() ≈ 0.63 (average of individual scores)
// result.metadata()["contextScores"] contains per-chunk details
When to use: evaluating retrieval quality in RAG pipelines. It tells you when your retriever returns irrelevant documents that could confuse the LLM or dilute the answer.
PrecisionEvaluator
Measures what fraction of retrieved items are actually relevant. Needs ground truth labels.
- Java
- Kotlin
Evaluator precision = PrecisionEvaluator.builder()
.name("retrieval-precision")
.retrievedKey("retrievedDocs") // Key in actualOutputs
.expectedKey("relevantDocs") // Key in expectedOutputs (ground truth)
.matchingStrategy(MatchingStrategy.byEquality())
.threshold(0.8)
.build();
val precision: Evaluator = precision {
name = "retrieval-precision"
retrievedKey = "retrievedDocs" // Key in actualOutputs
expectedKey = "relevantDocs" // Key in expectedOutputs (ground truth)
matchingStrategy = MatchingStrategy.byEquality()
threshold = 0.8
}
Formula: precision = |relevant ∩ retrieved| / |retrieved|
A precision of 1.0 means every retrieved item was relevant (no false positives).
When to use: when you need to minimize noise in retrieved results. High precision matters when downstream processing is expensive or when irrelevant items could mislead the LLM.
RecallEvaluator
Measures what fraction of relevant items were actually retrieved. Needs ground truth labels.
- Java
- Kotlin
Evaluator recall = RecallEvaluator.builder()
.name("retrieval-recall")
.retrievedKey("retrievedDocs")
.expectedKey("relevantDocs")
.matchingStrategy(MatchingStrategy.byEquality())
.threshold(0.8)
.build();
val recall: Evaluator = recall {
name = "retrieval-recall"
retrievedKey = "retrievedDocs"
expectedKey = "relevantDocs"
matchingStrategy = MatchingStrategy.byEquality()
threshold = 0.8
}
Formula: recall = |relevant ∩ retrieved| / |relevant|
A recall of 1.0 means all relevant items were found (no false negatives).
When to use: when missing relevant information is costly. High recall matters for complete answers or when the user expects full coverage.
Matching strategies
Both PrecisionEvaluator and RecallEvaluator support several strategies for matching retrieved items to ground truth.
- Java
- Kotlin
// Simple equality (default, for string IDs)
MatchingStrategy.byEquality()
// Case-insensitive string matching
MatchingStrategy.caseInsensitive()
// Match by a specific field (for Map/JSON objects)
MatchingStrategy.byField("id")
// Match by multiple fields (e.g., knowledge graph triples)
MatchingStrategy.byFields("subject", "predicate", "object")
// Substring containment matching
MatchingStrategy.byContainment(true) // normalized
// LLM-based semantic matching (most flexible, most expensive)
MatchingStrategy.llmBased(judge)
// Combine strategies
MatchingStrategy.anyOf(strategy1, strategy2) // OR
MatchingStrategy.allOf(strategy1, strategy2) // AND
// Simple equality (default, for string IDs)
MatchingStrategy.byEquality()
// Case-insensitive string matching
MatchingStrategy.caseInsensitive()
// Match by a specific field (for Map/JSON objects)
MatchingStrategy.byField("id")
// Match by multiple fields (e.g., knowledge graph triples)
MatchingStrategy.byFields("subject", "predicate", "object")
// Substring containment matching
MatchingStrategy.byContainment(normalize = true)
// LLM-based semantic matching (most flexible, most expensive)
MatchingStrategy.llmBased(judge)
// Combine strategies
MatchingStrategy.anyOf(strategy1, strategy2) // OR
MatchingStrategy.allOf(strategy1, strategy2) // AND
Example with knowledge graph triples:
- Java
- Kotlin
var precision = PrecisionEvaluator.builder()
.retrievedKey("triples")
.expectedKey("relevantTriples")
.matchingStrategy(MatchingStrategy.byFields("subject", "predicate", "object"))
.build();
var testCase = EvalTestCase.builder()
.input("Who founded Microsoft?")
.actualOutput("triples", List.of(
Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft")
))
.expectedOutput("relevantTriples", List.of(
Map.of("subject", "Bill Gates", "predicate", "founded", "object", "Microsoft"),
Map.of("subject", "Paul Allen", "predicate", "co-founded", "object", "Microsoft")
))
.build();
val precision = precision {
retrievedKey = "triples"
expectedKey = "relevantTriples"
matchingStrategy = MatchingStrategy.byFields("subject", "predicate", "object")
}
val testCase = EvalTestCase(
input = "Who founded Microsoft?",
actualOutputs = mapOf("triples" to listOf(
mapOf("subject" to "Bill Gates", "predicate" to "founded", "object" to "Microsoft")
)),
expectedOutputs = mapOf("relevantTriples" to listOf(
mapOf("subject" to "Bill Gates", "predicate" to "founded", "object" to "Microsoft"),
mapOf("subject" to "Paul Allen", "predicate" to "co-founded", "object" to "Microsoft")
)))
Agent evaluators
Dokimos ships specialized evaluators for AI agents that use tools. They cover task completion, tool call validation, argument hallucination detection, and tool definition quality.
See the dedicated Agent Evaluation guide for full documentation.
Common configuration
Every evaluator supports these settings.
Name sets how the evaluator shows up in results.
- Java
- Kotlin
.name("Answer Quality")
name = "Answer Quality"
Threshold sets the minimum score needed to pass.
- Java
- Kotlin
.threshold(0.8) // Needs 80% or higher
threshold = 0.8 // Needs 80% or higher
Evaluation parameters set which fields the evaluator reads.
- Java
- Kotlin
.evaluationParams(List.of(
EvalTestCaseParam.INPUT, // The user's question
EvalTestCaseParam.EXPECTED_OUTPUT, // What you expect
EvalTestCaseParam.ACTUAL_OUTPUT, // What the LLM actually said
))
params(
EvalTestCaseParam.INPUT, // The user's question
EvalTestCaseParam.EXPECTED_OUTPUT, // What you expect
EvalTestCaseParam.ACTUAL_OUTPUT, // What the LLM actually said
)
Creating custom evaluators
When no built-in evaluator fits, write your own by extending BaseEvaluator. Override runEvaluation and return an EvalResult.
- Java
- Kotlin
public class ResponseLengthEvaluator extends BaseEvaluator {
private final int minLength;
private final int maxLength;
public ResponseLengthEvaluator(String name, int minLength, int maxLength) {
super(name, 1.0, List.of(EvalTestCaseParam.ACTUAL_OUTPUT));
this.minLength = minLength;
this.maxLength = maxLength;
}
@Override
protected EvalResult runEvaluation(EvalTestCase testCase) {
String output = testCase.actualOutput();
int length = output.length();
boolean withinBounds = length >= minLength && length <= maxLength;
double score = withinBounds ? 1.0 : 0.0;
String reason = String.format("Output length %d (expected %d-%d)",
length, minLength, maxLength);
return EvalResult.builder()
.name(name())
.score(score)
.threshold(threshold())
.reason(reason)
.build();
}
}
// Usage
Evaluator lengthCheck = new ResponseLengthEvaluator("Length Check", 50, 200);
class ResponseLengthEvaluator(
private val minLength: Int,
private val maxLength: Int,
private val evaluatorName: String = "Length Check"
) : BaseEvaluator(evaluatorName, 1.0, listOf(EvalTestCaseParam.ACTUAL_OUTPUT)) {
override fun runEvaluation(testCase: EvalTestCase): EvalResult {
val output = testCase.actualOutput()
val length = output.length
val withinBounds = length in minLength..maxLength
val score = if (withinBounds) 1.0 else 0.0
val reason = "Output length $length (expected $minLength-$maxLength)"
return EvalResult(
name = name(),
score = score,
threshold = threshold(),
reason = reason,
)
}
}
// Usage
val lengthCheck: Evaluator = ResponseLengthEvaluator(50, 200)
For very simple checks, implement the Evaluator interface directly.
Combining multiple evaluators
Most applications need to pass several quality checks. Put the evaluators in a list and run them together.
- Java
- Kotlin
List<Evaluator> evaluators = List.of(
// Check if the answer is correct
LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer factually correct?")
.threshold(0.85)
.judge(judge)
.build(),
// Check if it's grounded in retrieved docs (RAG)
FaithfulnessEvaluator.builder()
.threshold(0.80)
.judge(judge)
.contextKey("retrievedContext")
.build(),
// Check if it follows the required format
RegexEvaluator.builder()
.name("Format Check")
.pattern("^[A-Z].*\\.$") // Must start with capital and end with period
.threshold(1.0)
.build()
);
val evaluators: List<Evaluator> = evaluators {
// Check if the answer is correct
llmJudge(judge) {
name = "Correctness"
criteria = "Is the answer factually correct?"
threshold = 0.85
}
// Check if it's grounded in retrieved docs (RAG)
faithfulness(judge) {
threshold = 0.80
contextKey = "retrievedContext"
}
// Check if it follows the required format
regex {
name = "Format Check"
pattern = "^[A-Z].*\\.$" // Must start with capital and end with period
threshold = 1.0
}
}
An output passes only if it meets all the thresholds. This lets you enforce several quality dimensions at once.
Best practices
Pick the right evaluator for the job
- Use ExactMatch when there is only one correct answer (math, data extraction).
- Use Regex for format validation (dates, emails, IDs).
- Use StructuralMatch for structured or JSON output where formatting and numeric representation should not count as differences (see the Structured & Typed Data hub).
- Use LLMJudge for semantic quality (helpfulness, clarity, tone).
- Use Faithfulness for RAG systems to measure how grounded the output is.
- Use Hallucination to measure and cap fabricated content.
- Use ContextualRelevance to evaluate retrieval quality without ground truth.
- Use Precision/Recall when you have ground truth labels for relevant items.
- Use Agent evaluators to evaluate AI agents that use tools (task completion, tool validity, argument hallucination, tool reliability).
- Build custom evaluators for domain-specific requirements.
Start with looser thresholds
Do not aim for perfection right away. Start around 0.7 to 0.8 and tighten as your system improves. A threshold of 1.0 fails on any imperfection.
Write specific criteria for LLM judges
Be clear about what you are scoring.
- Java
- Kotlin
// Good (specific and measurable)
.criteria("Does the answer correctly explain the refund process and mention the 30-day policy?")
// Bad (too vague)
.criteria("Is this good?")
// Good (specific and measurable)
criteria = "Does the answer correctly explain the refund process and mention the 30-day policy?"
// Bad (too vague)
criteria = "Is this good?"
Use multiple evaluators for important outputs
Check each aspect on its own: correctness, format, grounding, tone. This shows you exactly where things go wrong.
Test your evaluators
Confirm your evaluators behave on known examples before you rely on them.
- Java
- Kotlin
@Test
void faithfulnessEvaluatorShouldCatchHallucination() {
var testCase = EvalTestCase.builder()
.actualOutput("The product costs $500") // Made up
.metadata(Map.of("context", List.of("The product costs $100")))
.build();
var result = faithfulnessEvaluator.evaluate(testCase);
// Should fail because claim isn't in context
assertFalse(result.success());
}
@Test
fun faithfulnessEvaluatorShouldCatchHallucination() {
val testCase = EvalTestCase.builder()
.actualOutput("The product costs $500") // Made up
.metadata(mapOf("context" to listOf("The product costs $100")))
.build()
val result = faithfulnessEvaluator.evaluate(testCase)
// Should fail because claim isn't in context
assertFalse(result.success())
}
Using evaluator results
evaluate returns an EvalResult with the score, the pass status, and an explanation. Read them directly.
- Java
- Kotlin
EvalResult result = evaluator.evaluate(testCase);
System.out.println("Score: " + result.score());
System.out.println("Passed: " + result.success());
System.out.println("Reason: " + result.reason());
val result = evaluator.evaluate(testCase)
println("Score: ${result.score()}")
println("Passed: ${result.success()}")
println("Reason: ${result.reason()}")
In experiments, analyze results across all examples.
- Java
- Kotlin
ExperimentResult experimentResult = experiment.run();
// Average scores per evaluator
double avgCorrectness = experimentResult.averageScore("Correctness");
double avgFaithfulness = experimentResult.averageScore("Faithfulness");
// Dig into individual results
for (ItemResult item : experimentResult.itemResults()) {
for (EvalResult eval : item.evalResults()) {
if (!eval.success()) {
System.out.println("Failed: " + eval.name() + " (" + eval.reason() + ")");
}
}
}
val experimentResult = experiment.run()
// Average scores per evaluator
val avgCorrectness = experimentResult.averageScore("Correctness")
val avgFaithfulness = experimentResult.averageScore("Faithfulness")
// Dig into individual results
experimentResult.itemResults().forEach { item ->
item.evalResults()
.filterNot { eval -> eval.success() }
.forEach { eval ->
println("Failed: ${eval.name()} (${eval.reason()})")
}
}
In JUnit tests, a failing evaluator fails the test.
- Java
- Kotlin
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void shouldProduceQualityAnswers(Example example) {
String answer = aiService.generate(example.input());
var testCase = example.toTestCase(answer);
// Fails test if evaluators don't pass
Assertions.assertEval(testCase, evaluators);
}
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
fun shouldProduceQualityAnswers(example: Example) {
val answer = aiService.generate(example.input())
val testCase = example.toTestCase(answer)
// Fails test if evaluators don't pass
Assertions.assertEval(testCase, evaluators)
}