LLM Evaluation with Spring AI and Dokimos: Building and Evaluating an AI Agent

This tutorial walks you through building a complete AI agent with Spring AI and systematically evaluating its performance using Dokimos in Java and Kotlin. You will create a knowledge assistant that retrieves documents from a vector store and generates contextual answers, then evaluate it across multiple quality dimensions.

By the end of this tutorial, you will have:

A working Spring AI agent with RAG (Retrieval-Augmented Generation) capabilities
A multi evaluator pipeline that checks faithfulness, hallucination, and answer quality
Insights into how your agent performs and where it can improve

Why Evaluate Your AI Agent?

Building an AI agent is just the start. The real challenge is ensuring it performs reliably in production and meets user expectations. Traditional testing approaches fall short when dealing with LLM applications because:

LLM Outputs are nondeterministic: The same question can produce different (but equally valid) answers. You cannot simply assert that output will always equal some expected string.

Quality is multidimensional: A response might be factually correct but poorly worded, or helpful but not grounded in your documents.

Failures are subtle: An agent might generate confident and at a first glance convincing but factually incorrect information.

Dokimos provides a systematic framework for evaluating LLM applications. It lets you define quality criteria, run evaluations, and track performance over time.

What We Are Building

We will build a knowledge assistant for a fictional company's documentation. The assistant will:

Accept user questions about products, policies, and services
Retrieve relevant documents from a vector store
Generate contextual answers based on the retrieved documents

Then we will evaluate this assistant on multiple dimensions:

Faithfulness: Are the responses grounded in the retrieved documents?
Answer Quality: Are the responses helpful and complete?
Contextual Relevance: Is the retriever finding relevant documents?
Hallucination Detection: Is the agent making things up?

Prerequisites

Before you begin, make sure you have:

Java 21 or later
Maven or Gradle
An OpenAI API key (or another supported LLM provider)
Basic familiarity with Spring Boot and Spring AI

Project Setup

Dependencies

Create a new Spring Boot project and add the following dependencies:

Maven

<dependencies>
    <!-- Spring Boot Web -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Spring AI -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    </dependency>

    <!-- Dokimos Core -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-core</artifactId>
        <version>${dokimos.version}</version>
    </dependency>

    <!-- Dokimos Spring AI Integration -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-spring-ai</artifactId>
        <version>${dokimos.version}</version>
    </dependency>

    <!-- Dokimos Kotlin Integration (Optional) -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-kotlin</artifactId>
        <version>${dokimos.version}</version>
    </dependency>

    <!-- For JUnit integration -->
    <dependency>
        <groupId>dev.dokimos</groupId>
        <artifactId>dokimos-junit</artifactId>
        <version>${dokimos.version}</version>
        <scope>test</scope>
    </dependency>

    <!-- Spring Boot Test -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

Gradle

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-web'
    implementation 'org.springframework.ai:spring-ai-openai-spring-boot-starter'
    implementation 'dev.dokimos:dokimos-core:${dokimosVersion}'
    implementation 'dev.dokimos:dokimos-spring-ai:${dokimosVersion}'
    implementation 'dev.dokimos:dokimos-kotlin:${dokimosVersion}' //optional for Kotlin projects
    testImplementation 'dev.dokimos:dokimos-junit:${dokimosVersion}'
    testImplementation 'org.springframework.boot:spring-boot-starter-test'
}

Configuration

Add your OpenAI API key and model settings to application.properties:

spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-5-nano
spring.ai.openai.chat.options.temperature=1.0
spring.ai.openai.embedding.options.model=text-embedding-3-small

Note: The gpt-5-nano model only supports temperature=1.0. If using a different model like gpt-4o-mini, you can omit the temperature setting.

The SimpleVectorStore requires an embedding model to convert text to vectors. We use OpenAI's text-embedding-3-small which is fast and cost-effective.

Part 1: Building the AI Agent

We start by building our knowledge assistant. This is a simple RAG pipeline that retrieves documents and generates answers.

Setting Up the Vector Store

First, we need a vector store to hold our company documents. For this tutorial, we will use Spring AI's SimpleVectorStore, which stores embeddings in-memory.

Java
Kotlin

import org.springframework.ai.document.Document;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.vectorstore.SimpleVectorStore;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.util.List;

@Configuration
public class VectorStoreConfig {

    @Bean
    public VectorStore vectorStore(EmbeddingModel embeddingModel) {
        SimpleVectorStore store = SimpleVectorStore.builder(embeddingModel).build();

        // Load our company documents
        List<Document> documents = List.of(
            new Document(
                "Our return policy allows customers to return any product within 30 days " +
                "of purchase for a full refund. Items must be in original condition with " +
                "tags attached. Refunds are processed within 5 business days."
            ),
            new Document(
                "Premium members receive free shipping on all orders, 20% discount on " +
                "all products, early access to new releases, and priority customer support. " +
                "Premium membership costs $99 per year."
            ),
            new Document(
                "Our customer support team is available Monday through Friday from 9 AM " +
                "to 6 PM Eastern Time. You can reach us by email at support@example.com " +
                "or by phone at 1-800-EXAMPLE."
            ),
            new Document(
                "We offer three shipping options: Standard (5-7 business days, $5.99), " +
                "Express (2-3 business days, $12.99), and Next Day ($24.99). " +
                "Orders over $50 qualify for free standard shipping."
            ),
            new Document(
                "Gift cards are available in denominations of $25, $50, $100, and $200. " +
                "Gift cards never expire and can be used for any purchase on our website. " +
                "They cannot be redeemed for cash."
            )
        );

        store.add(documents);
        return store;
    }
}

import org.springframework.ai.document.Document
import org.springframework.ai.embedding.EmbeddingModel
import org.springframework.ai.vectorstore.SimpleVectorStore
import org.springframework.ai.vectorstore.VectorStore
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration

@Configuration
class VectorStoreConfig {

    @Bean
    fun vectorStore(embeddingModel: EmbeddingModel): VectorStore {
        val store = SimpleVectorStore.builder(embeddingModel).build()

        // Load our company documents
        val documents = listOf(
            Document(
                "Our return policy allows customers to return any product within 30 days " +
                "of purchase for a full refund. Items must be in original condition with " +
                "tags attached. Refunds are processed within 5 business days."
            ),
            Document(
                "Premium members receive free shipping on all orders, 20% discount on " +
                "all products, early access to new releases, and priority customer support. " +
                "Premium membership costs $99 per year."
            ),
            Document(
                "Our customer support team is available Monday through Friday from 9 AM " +
                "to 6 PM Eastern Time. You can reach us by email at support@example.com " +
                "or by phone at 1-800-EXAMPLE."
            ),
            Document(
                "We offer three shipping options: Standard (5-7 business days, $5.99), " +
                "Express (2-3 business days, $12.99), and Next Day ($24.99). " +
                "Orders over $50 qualify for free standard shipping."
            ),
            Document(
                "Gift cards are available in denominations of $25, $50, $100, and $200. " +
                "Gift cards never expire and can be used for any purchase on our website. " +
                "They cannot be redeemed for cash."
            )
        )

        store.add(documents)
        return store
    }
}

Creating the Knowledge Assistant

Now we create our AI agent that combines retrieval with generation:

Java
Kotlin

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;

@Service
public class KnowledgeAssistant {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public KnowledgeAssistant(ChatClient.Builder chatClientBuilder, VectorStore vectorStore) {
        this.chatClient = chatClientBuilder.build();
        this.vectorStore = vectorStore;
    }

    public AssistantResponse answer(String question) {
        // Step 1: Retrieve relevant documents
        List<Document> retrievedDocs = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(question)
                .topK(3)
                .build()
        );

        // Step 2: Build context from retrieved documents
        String context = retrievedDocs.stream()
            .map(Document::getText)
            .reduce("", (a, b) -> a + "\n\n" + b);

        // Step 3: Generate response using context
        String systemPrompt = """
            You are a helpful customer service assistant. Answer the user's question
            based ONLY on the provided context. If the context does not contain
            enough information to answer the question, say so clearly.

            Context:
            %s
            """.formatted(context);

        String response = chatClient.prompt()
            .system(systemPrompt)
            .user(question)
            .call()
            .content();

        // Return both the response and retrieved context for evaluation
        return new AssistantResponse(response, retrievedDocs);
    }

    public record AssistantResponse(
        String answer,
        List<Document> retrievedDocuments
    ) {}
}

import org.springframework.ai.chat.client.ChatClient
import org.springframework.ai.document.Document
import org.springframework.ai.vectorstore.SearchRequest
import org.springframework.ai.vectorstore.VectorStore
import org.springframework.stereotype.Service

@Service
class KnowledgeAssistant(
    chatClientBuilder: ChatClient.Builder,
    private val vectorStore: VectorStore
) {
    private val chatClient: ChatClient = chatClientBuilder.build()

    fun answer(question: String): AssistantResponse {
        // Step 1: Retrieve relevant documents
        val retrievedDocs: List<Document> = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(question)
                .topK(3)
                .build()
        )

        // Step 2: Build context from retrieved documents
        val context = retrievedDocs.joinToString(separator = "\n\n") { it.text }

        // Step 3: Generate response using context
        val systemPrompt = """
            You are a helpful customer service assistant. Answer the user's question
            based ONLY on the provided context. If the context does not contain
            enough information to answer the question, say so clearly.

            Context:
            %s
        """.trimIndent().format(context)

        val response = chatClient.prompt()
            .system(systemPrompt)
            .user(question)
            .call()
            .content()

        return AssistantResponse(response, retrievedDocs)
    }

    data class AssistantResponse(
        val answer: String,
        val retrievedDocuments: List<Document>
    )
}

Exposing the Assistant as a REST API

To make the assistant usable as a real service, expose it via a REST endpoint:

Java
Kotlin

import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import java.util.List;

@RestController
@RequestMapping("/api")
public class KnowledgeAssistantController {

    private final KnowledgeAssistant assistant;

    public KnowledgeAssistantController(KnowledgeAssistant assistant) {
        this.assistant = assistant;
    }

    @PostMapping("/chat")
    public ResponseEntity<ChatResponse> chat(@RequestBody ChatRequest request) {
        var response = assistant.answer(request.question());

        List<String> sources = response.retrievedDocuments().stream()
                .map(doc -> doc.getText())
                .toList();

        return ResponseEntity.ok(new ChatResponse(response.answer(), sources));
    }

    public record ChatRequest(String question) {}

    public record ChatResponse(String answer, List<String> sources) {}
}

import org.springframework.http.ResponseEntity
import org.springframework.web.bind.annotation.PostMapping
import org.springframework.web.bind.annotation.RequestBody
import org.springframework.web.bind.annotation.RequestMapping
import org.springframework.web.bind.annotation.RestController

@RestController
@RequestMapping("/api")
class KnowledgeAssistantController(
    private val assistant: KnowledgeAssistant
) {

    @PostMapping("/chat")
    fun chat(@RequestBody request: ChatRequest): ResponseEntity<ChatResponse> {
        val response = assistant.answer(request.question)
        val sources = response.retrievedDocuments.map { it.text }
        return ResponseEntity.ok(ChatResponse(response.answer, sources))
    }

    data class ChatRequest(val question: String)
    data class ChatResponse(val answer: String, val sources: List<String>)
}

You can now interact with your assistant:

curl -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "What is your return policy?"}'

Part 2: Setting Up Evaluation with Dokimos

Now that we have a working assistant, we can evaluate it systematically. We create a dataset of test questions and run them through multiple evaluators.

Creating the Evaluation Dataset

First, create a dataset with questions and expected behaviors:

Java
Kotlin

import dev.dokimos.core.Dataset;
import dev.dokimos.core.Example;

Dataset dataset = Dataset.builder()
    .name("Knowledge Assistant Evaluation")
    .addExample(Example.builder()
        .input("What is your return policy?")
        .expectedOutput("30 days, full refund, original condition")
        .metadata("category", "returns")
        .build())
    .addExample(Example.builder()
        .input("How much does premium membership cost?")
        .expectedOutput("$99 per year")
        .metadata("category", "membership")
        .build())
    .addExample(Example.builder()
        .input("What are your customer support hours?")
        .expectedOutput("Monday through Friday, 9 AM to 6 PM Eastern")
        .metadata("category", "support")
        .build())
    .addExample(Example.builder()
        .input("Do gift cards expire?")
        .expectedOutput("Gift cards never expire")
        .metadata("category", "gift-cards")
        .build())
    .addExample(Example.builder()
        .input("How can I get free shipping?")
        .expectedOutput("Orders over $50 or premium membership")
        .metadata("category", "shipping")
        .build())
    .addExample(Example.builder()
        .input("What is the fastest shipping option?")
        .expectedOutput("Next Day shipping for $24.99")
        .metadata("category", "shipping")
        .build())
    .addExample(Example.builder()
        .input("Can I return a product after 60 days?")
        .expectedOutput("No, returns must be within 30 days")
        .metadata("category", "returns")
        .build())
    .addExample(Example.builder()
        .input("What benefits do premium members get?")
        .expectedOutput("Free shipping, 20% discount, early access, priority support")
        .metadata("category", "membership")
        .build())
    .build();

import dev.dokimos.kotlin.dsl.dataset
import dev.dokimos.kotlin.dsl.example

val dataset = dataset {
    name = "Knowledge Assistant Evaluation"
    example {
        input = "What is your return policy?"
        expected = "30 days, full refund, original condition"
        metadata("category", "returns")
    }
    example {
        input = "How much does premium membership cost?"
        expected = "$99 per year"
        metadata("category", "membership")
    }
    example {
        input = "What are your customer support hours?"
        expected = "Monday through Friday, 9 AM to 6 PM Eastern"
        metadata("category", "support")
    }
    example {
        input = "Do gift cards expire?"
        expected = "Gift cards never expire"
        metadata("category", "gift-cards")
    }
    example {
        input = "How can I get free shipping?"
        expected = "Orders over $50 or premium membership"
        metadata("category", "shipping")
    }
    example {
        input = "What is the fastest shipping option?"
        expected = "Next Day shipping for $24.99"
        metadata("category", "shipping")
    }
    example {
        input = "Can I return a product after 60 days?"
        expected = "No, returns must be within 30 days"
        metadata("category", "returns")
    }
    example {
        input = "What benefits do premium members get?"
        expected = "Free shipping, 20% discount, early access, priority support"
        metadata("category", "membership")
    }
}

Alternatively, you can also load datasets from JSON files for easier maintenance:

{
  "name": "Knowledge Assistant Evaluation",
  "examples": [
    {
      "input": "What is your return policy?",
      "expectedOutput": "30 days, full refund, original condition",
      "metadata": { "category": "returns" }
    },
    {
      "input": "How much does premium membership cost?",
      "expectedOutput": "$99 per year",
      "metadata": { "category": "membership" }
    }
  ]
}

Java
Kotlin

Dataset dataset = Dataset.fromJson(Paths.get("src/test/resources/datasets/qa-dataset.json"));

import java.nio.file.Paths

val dataset = Dataset.fromJson(Paths.get("src/test/resources/datasets/qa-dataset.json"))

Defining the Evaluation Task

The Task interface bridges your application with Dokimos. It takes an example and returns the outputs that evaluators will check:

Java
Kotlin

import dev.dokimos.core.Task;
import org.springframework.ai.document.Document;

Task evaluationTask = example -> {
    // Run our assistant
    var response = assistant.answer(example.input());

    // Extract context texts for evaluation
    List<String> contextTexts = response.retrievedDocuments().stream()
        .map(Document::getText)
        .toList();

    // Return outputs for evaluators to check
    return Map.of(
        "output", response.answer(),
        "context", contextTexts
    );
};

import dev.dokimos.kotlin.dsl.task

val evaluationTask = task { example ->
    // Run our assistant
    val response = assistant.answer(example.input())

    // Extract context texts for evaluation
    val contextTexts = response.retrievedDocuments.map { it.text }

    // Return outputs for evaluators to check
    mapOf(
        "output" to response.answer,
        "context" to contextTexts
    )
}

The key insight here is that we return both the answer and the retrieved context. This allows evaluators to check not just what the agent said, but whether it was grounded in the documents it retrieved.

Setting Up the LLM Judge

Dokimos uses the LLM as Judge pattern for semantic evaluation. We will use Spring AI's ChatModel as our judge:

Java
Kotlin

import dev.dokimos.core.JudgeLM;
import dev.dokimos.springai.SpringAiSupport;
import org.springframework.ai.chat.model.ChatModel;

@Autowired
private ChatModel chatModel;

// Convert Spring AI ChatModel to Dokimos JudgeLM
JudgeLM judge = SpringAiSupport.asJudge(chatModel);

import dev.dokimos.core.JudgeLM
import dev.dokimos.springai.SpringAiSupport
import org.springframework.ai.chat.model.ChatModel
import org.springframework.beans.factory.annotation.Autowired

@Autowired
val chatModel: ChatModel

val judge: JudgeLM = SpringAiSupport.asJudge(chatModel)

Using a Different Model for Judging

For better evaluation quality, consider using a more capable model as your judge. You can configure a separate ChatModel bean specifically for evaluation:

Java
Kotlin

@Bean
@Qualifier("judgeModel")
public ChatModel judgeModel() {
    return OpenAiChatModel.builder()
        .apiKey(System.getenv("OPENAI_API_KEY"))
        .model("gpt-5.2")
        .build();
}

@Bean
@Qualifier("judgeModel")
fun judgeModel(): ChatModel = OpenAiChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .model("gpt-5.2")
    .build()

Part 3: Configuring Multiple Evaluators

Next, set up evaluators to check different quality dimensions. Dokimos provides several built-in evaluators, and you can create custom ones for specific needs.

API Costs

LLM based evaluators (FaithfulnessEvaluator, HallucinationEvaluator, LLMJudgeEvaluator, ContextualRelevanceEvaluator) make API calls to your judge model for each test case. Running evaluations on large datasets can incur significant costs. Start with a small dataset (10 to 20 examples) during development, and consider using a more cost effective model for judging later.

Faithfulness Evaluator

The FaithfulnessEvaluator checks whether the response is grounded in the retrieved context. This is crucial for RAG systems where you want to ensure the agent is not making things up.

Java
Kotlin

import dev.dokimos.core.evaluators.FaithfulnessEvaluator;

Evaluator faithfulness = FaithfulnessEvaluator.builder()
    .threshold(0.8)
    .judge(judge)
    .contextKey("context")  // Key where we stored retrieved documents
    .includeReason(true)    // Get explanation for the score
    .build();

val faithfulness = faithfulness(judge) {
    threshold = 0.8
    contextKey = "context"  // Key where we stored retrieved documents
    includeReason = true     // Get explanation for the score
}

The evaluator works by:

Breaking down the response into individual claims
Checking each claim against the retrieved context
Calculating score = (supported claims) / (total claims)

A score of 0.8 means 80% of the claims in the response are supported by the context.

Hallucination Evaluator

While faithfulness measures how much is grounded, the HallucinationEvaluator specifically measures the proportion of fabricated content:

Java
Kotlin

import dev.dokimos.core.evaluators.HallucinationEvaluator;

Evaluator hallucination = HallucinationEvaluator.builder()
    .threshold(0.2)  // Allow at most 20% hallucinated content
    .judge(judge)
    .contextKey("context")
    .includeReason(true)
    .build();

val hallucination = hallucination(judge) {
    threshold = 0.2  // Allow at most 20% hallucinated content
    contextKey = "context"
    includeReason = true
}

Important: For this evaluator, lower scores are better. A score of 0.0 means no hallucinations were detected. The evaluator passes when score <= threshold.

Answer Quality Evaluator

The LLMJudgeEvaluator lets you define custom evaluation criteria in natural language:

Java
Kotlin

import dev.dokimos.core.evaluators.LLMJudgeEvaluator;
import dev.dokimos.core.EvalTestCaseParam;

Evaluator answerQuality = LLMJudgeEvaluator.builder()
    .name("Answer Quality")
    .criteria("""
        Evaluate the answer based on these criteria:
        1. Does it directly address the user's question?
        2. Is it clear and easy to understand?
        3. Does it provide specific, actionable information?
        4. Is it appropriately concise without missing key details?
        """)
    .evaluationParams(List.of(
        EvalTestCaseParam.INPUT,
        EvalTestCaseParam.ACTUAL_OUTPUT
    ))
    .threshold(0.7)
    .judge(judge)
    .build();

val answerQuality = llmJudge(judge) {
    name = "Answer Quality"
    criteria = """
        Evaluate the answer based on these criteria:
        1. Does it directly address the user's question?
        2. Is it clear and easy to understand?
        3. Does it provide specific, actionable information?
        4. Is it appropriately concise without missing key details?
    """.trimIndent()
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
    threshold = 0.7
}

Contextual Relevance Evaluator

This evaluator checks whether the retriever is finding relevant documents for each query:

Java
Kotlin

import dev.dokimos.core.evaluators.ContextualRelevanceEvaluator;

Evaluator contextRelevance = ContextualRelevanceEvaluator.builder()
    .threshold(0.6)
    .judge(judge)
    .retrievalContextKey("context")
    .includeReason(true)
    .build();

val contextRelevance = contextualRelevance(judge) {
    threshold = 0.6
    retrievalContextKey = "context"
    includeReason = true
}

The evaluator scores each retrieved chunk independently and calculates the mean. This helps identify when your retriever is returning irrelevant documents that could confuse the LLM.

Combining All Evaluators

Put all evaluators together:

Java
Kotlin

List<Evaluator> evaluators = List.of(
    // Check if response is grounded in context
    FaithfulnessEvaluator.builder()
        .threshold(0.8)
        .judge(judge)
        .contextKey("context")
        .includeReason(true)
        .build(),

    // Check for hallucinated content
    HallucinationEvaluator.builder()
        .threshold(0.2)
        .judge(judge)
        .contextKey("context")
        .includeReason(true)
        .build(),

    // Check answer quality
    LLMJudgeEvaluator.builder()
        .name("Answer Quality")
        .criteria("Is the answer helpful, clear, and directly addresses the question?")
        .evaluationParams(List.of(
            EvalTestCaseParam.INPUT,
            EvalTestCaseParam.ACTUAL_OUTPUT
        ))
        .threshold(0.7)
        .judge(judge)
        .build(),

    // Check retrieval quality
    ContextualRelevanceEvaluator.builder()
        .threshold(0.6)
        .judge(judge)
        .retrievalContextKey("context")
        .includeReason(true)
        .build()
);

val evaluators = evaluators {
    // Check if response is grounded in context
    faithfulness(judge) {
        threshold = 0.8
        contextKey = "context"
        includeReason = true
    }

    // Check for hallucinated content
    hallucination(judge) {
        threshold = 0.2
        contextKey = "context"
        includeReason = true
    }

    // Check answer quality
    llmJudge(judge) {
        name = "Answer Quality"
        criteria = "Is the answer helpful, clear, and directly addresses the question?"
        params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
        threshold = 0.7
    }

    // Check retrieval quality
    contextualRelevance(judge) {
        threshold = 0.6
        retrievalContextKey = "context"
        includeReason = true
    }
}

Part 4: Running the Evaluation Experiment

With our dataset, task, and evaluators ready, we can run a full evaluation experiment:

Java
Kotlin

import dev.dokimos.core.Experiment;
import dev.dokimos.core.ExperimentResult;

ExperimentResult result = Experiment.builder()
    .name("Knowledge Assistant v1.0 Evaluation")
    .description("Evaluating the RAG based knowledge assistant")
    .dataset(dataset)
    .task(evaluationTask)
    .evaluators(evaluators)
    .metadata("model", "gpt-5-nano")
    .metadata("retrievalTopK", 3)
    .metadata("timestamp", Instant.now().toString())
    .build()
    .run();

import dev.dokimos.kotlin.dsl.experiment

val result = experiment {
    name = "Knowledge Assistant v1.0 Evaluation"
    description = "Evaluating the RAG based knowledge assistant"
    dataset(dataset)
    task(evaluationTask)
    evaluators(evaluators)
    metadata("model", "gpt-5-nano")
    metadata("retrievalTopK", 3)
    metadata("timestamp", Instant.now().toString())
}.run()

Analyzing Results

The experiment returns aggregate metrics and detailed per example results:

Java
Kotlin

// Overall metrics
System.out.println("=== Experiment Results ===");
System.out.println("Name: " + result.name());
System.out.println("Total examples: " + result.totalCount());
System.out.println("Passed: " + result.passCount());
System.out.println("Failed: " + result.failCount());
System.out.println("Pass rate: " + String.format("%.1f%%", result.passRate() * 100));

// Per evaluator metrics
System.out.println("\n=== Average Scores by Evaluator ===");
System.out.println("Faithfulness: " + String.format("%.2f", result.averageScore("Faithfulness")));
System.out.println("Hallucination: " + String.format("%.2f", result.averageScore("Hallucination")));
System.out.println("Answer Quality: " + String.format("%.2f", result.averageScore("Answer Quality")));
System.out.println("Contextual Relevance: " + String.format("%.2f", result.averageScore("ContextualRelevance")));

// Overall metrics
println("=== Experiment Results ===")
println("Name: ${result.name()}")
println("Total examples: ${result.totalCount()}")
println("Passed: ${result.passCount()}")
println("Failed: ${result.failCount()}")
println("Pass rate: ${"%.1f".format(result.passRate() * 100)}%")

// Per evaluator metrics
println("\n=== Average Scores by Evaluator ===")
println("Faithfulness: ${"%.2f".format(result.averageScore("Faithfulness"))}")
println("Hallucination: ${"%.2f".format(result.averageScore("Hallucination"))}")
println("Answer Quality: ${"%.2f".format(result.averageScore("Answer Quality"))}")
println("Contextual Relevance: ${"%.2f".format(result.averageScore("ContextualRelevance"))}")

Investigating Failures

When tests fail, you can dig into the details:

Java
Kotlin

System.out.println("\n=== Failed Cases ===");
for (ItemResult item : result.itemResults()) {
    if (!item.success()) {
        System.out.println("\nQuestion: " + item.example().input());
        System.out.println("Expected: " + item.example().expectedOutput());
        System.out.println("Actual: " + item.actualOutputs().get("output"));

        System.out.println("Evaluator Results:");
        for (EvalResult eval : item.evalResults()) {
            String status = eval.success() ? "PASS" : "FAIL";
            System.out.println("  " + eval.name() + ": " + status +
                " (score: " + String.format("%.2f", eval.score()) + ")");
            if (!eval.success() && eval.reason() != null) {
                System.out.println("    Reason: " + eval.reason());
            }
        }
    }
}

println("\n=== Failed Cases ===")
result.itemResults().forEach { item ->
    if (!item.success()) {
        println("\nQuestion: ${item.example().input()}")
        println("Expected: ${item.example().expectedOutput()}")
        println("Actual: ${item.actualOutputs()["output"]}")

        println("Evaluator Results:")
        item.evalResults().forEach { eval ->
            val status = if (eval.success()) "PASS" else "FAIL"
            println("  ${eval.name()}: $status (score: ${"%.2f".format(eval.score())})")
            if (!eval.success() && eval.reason() != null) {
                println("    Reason: ${eval.reason()}")
            }
        }
    }
}

Part 5: Integrating with JUnit

For continuous integration, you can run evaluations as part of your test suite using the Dokimos JUnit integration.

Organizing Evaluators

In a real application, you will want to organize your evaluators in a reusable way. Create a factory class that encapsulates your evaluation configuration:

Java
Kotlin

package com.example.evaluation;

import dev.dokimos.core.EvalTestCaseParam;
import dev.dokimos.core.Evaluator;
import dev.dokimos.core.JudgeLM;
import dev.dokimos.core.evaluators.*;

import java.util.List;

public final class QAEvaluators {

    public static final String CONTEXT_KEY = "context";

    private QAEvaluators() {}

    public static List<Evaluator> standard(JudgeLM judge) {
        return List.of(
                faithfulness(judge),
                hallucination(judge),
                answerQuality(judge),
                contextualRelevance(judge)
        );
    }

    public static Evaluator faithfulness(JudgeLM judge) {
        return FaithfulnessEvaluator.builder()
                .threshold(0.8)
                .judge(judge)
                .contextKey(CONTEXT_KEY)
                .includeReason(true)
                .build();
    }

    public static Evaluator hallucination(JudgeLM judge) {
        return HallucinationEvaluator.builder()
                .threshold(0.2)
                .judge(judge)
                .contextKey(CONTEXT_KEY)
                .includeReason(true)
                .build();
    }

    public static Evaluator answerQuality(JudgeLM judge) {
        return LLMJudgeEvaluator.builder()
                .name("Answer Quality")
                .criteria("""
                        Evaluate the answer based on:
                        1. Does it directly address the user's question?
                        2. Is it clear and easy to understand?
                        3. Does it provide specific, actionable information?
                        4. Is it appropriately concise?
                        """)
                .evaluationParams(List.of(
                        EvalTestCaseParam.INPUT,
                        EvalTestCaseParam.ACTUAL_OUTPUT
                ))
                .threshold(0.7)
                .judge(judge)
                .build();
    }

    public static Evaluator contextualRelevance(JudgeLM judge) {
        return ContextualRelevanceEvaluator.builder()
                .threshold(0.6)
                .judge(judge)
                .retrievalContextKey(CONTEXT_KEY)
                .includeReason(true)
                .build();
    }
}

package com.example.evaluation

import dev.dokimos.core.Evaluator
import dev.dokimos.core.JudgeLM
import dev.dokimos.core.EvalTestCaseParam
import dev.dokimos.kotlin.dsl.contextualRelevance
import dev.dokimos.kotlin.dsl.faithfulness
import dev.dokimos.kotlin.dsl.hallucination
import dev.dokimos.kotlin.dsl.llmJudge

object QAEvaluators {
    const val CONTEXT_KEY = "context"

    fun standard(judge: JudgeLM): List<Evaluator> = listOf(
        faithfulness(judge) {
            threshold = 0.8
            contextKey = CONTEXT_KEY
            includeReason = true
        },
        hallucination(judge) {
            threshold = 0.2
            contextKey = CONTEXT_KEY
            includeReason = true
        },
        llmJudge(judge) {
            name = "Answer Quality"
            criteria = "Is the answer helpful, clear, and directly addresses the question?"
            params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
            threshold = 0.7
        },
        contextualRelevance(judge) {
            threshold = 0.6
            retrievalContextKey = CONTEXT_KEY
            includeReason = true
        }
    )
}

This approach keeps your evaluation logic separate from your application code and makes it easy to reuse across different tests.

Writing the Evaluation Test

Now write a clean test that uses the evaluator factory:

Java
Kotlin

import dev.dokimos.core.Assertions;
import dev.dokimos.core.EvalTestCase;
import dev.dokimos.core.Evaluator;
import dev.dokimos.core.Example;
import dev.dokimos.core.JudgeLM;
import dev.dokimos.junit.DatasetSource;
import dev.dokimos.springai.SpringAiSupport;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.params.ParameterizedTest;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.document.Document;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;

import java.util.List;

@SpringBootTest
class KnowledgeAssistantEvaluationTest {

    @Autowired
    private KnowledgeAssistant assistant;

    @Autowired
    private ChatModel chatModel;

    private List<Evaluator> evaluators;

    @BeforeEach
    void setup() {
        JudgeLM judge = SpringAiSupport.asJudge(chatModel);
        evaluators = QAEvaluators.standard(judge);
    }

    @ParameterizedTest
    @DatasetSource("classpath:datasets/qa-dataset.json")
    void shouldProvideQualityAnswers(Example example) {
        var response = assistant.answer(example.input());

        List<String> contextTexts = response.retrievedDocuments().stream()
                .map(Document::getText)
                .toList();

        EvalTestCase testCase = EvalTestCase.builder()
                .input(example.input())
                .actualOutput(response.answer())
                .actualOutput(QAEvaluators.CONTEXT_KEY, contextTexts)
                .expectedOutput(example.expectedOutput())
                .build();

        Assertions.assertEval(testCase, evaluators);
    }
}

import dev.dokimos.core.Assertions
import dev.dokimos.core.EvalTestCase
import dev.dokimos.core.Evaluator
import dev.dokimos.core.Example
import dev.dokimos.core.JudgeLM
import dev.dokimos.junit.DatasetSource
import dev.dokimos.springai.SpringAiSupport
import org.junit.jupiter.api.BeforeEach
import org.junit.jupiter.params.ParameterizedTest
import org.springframework.ai.chat.model.ChatModel
import org.springframework.boot.test.context.SpringBootTest

@SpringBootTest
class KnowledgeAssistantEvaluationTest {

    @Autowired
    private lateinit var assistant: KnowledgeAssistant

    @Autowired
    private lateinit var chatModel: ChatModel

    private lateinit var evaluators: List<Evaluator>

    @BeforeEach
    fun setup() {
        val judge: JudgeLM = SpringAiSupport.asJudge(chatModel)
        evaluators = QAEvaluators.standard(judge)
    }

    @ParameterizedTest
    @DatasetSource("classpath:datasets/qa-dataset.json")
    fun shouldProvideQualityAnswers(example: Example) {
        val response = assistant.answer(example.input())

        val contextTexts = response.retrievedDocuments.map { it.text }

        val testCase =EvalTestCase(
            input = example.input(),
            actualOutput = response.answer,
            actualOutputs = mapOf(QAEvaluators.CONTEXT_KEY, contextTexts),
            expectedOutput = example.expectedOutput())

        Assertions.assertEval(testCase, evaluators)
    }
}

Running in CI/CD

Add this to your GitHub Actions workflow:

name: AI Agent Evaluation

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up JDK 21
        uses: actions/setup-java@v4
        with:
          java-version: '21'
          distribution: 'temurin'

      - name: Run Evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: mvn test -Dtest=KnowledgeAssistantEvaluationTest

Part 6: Tracking Results Over Time

For production applications, you want to track evaluation results over time. The Dokimos Server provides a web UI for visualizing trends and comparing experiments.

Starting the Server

Download the Docker Compose file and start the server:

curl -O https://raw.githubusercontent.com/dokimos-dev/dokimos/master/docker-compose.yml
docker compose up -d

The server will be available at http://localhost:8080.

Sending Results to the Server

Use DokimosServerReporter to send experiment results to the server:

Java
Kotlin

import dev.dokimos.server.client.DokimosServerReporter;

var reporter = DokimosServerReporter.builder()
    .serverUrl("http://localhost:8080")
    .projectName("knowledge-assistant")
    .build();

ExperimentResult result = Experiment.builder()
    .name("Knowledge Assistant v1.0")
    .dataset(dataset)
    .task(evaluationTask)
    .evaluators(evaluators)
    .reporter(reporter)
    .build()
    .run();

import dev.dokimos.kotlin.dsl.experiment
import dev.dokimos.server.client.DokimosServerReporter

val reporter = DokimosServerReporter.builder()
    .serverUrl("http://localhost:8080")
    .projectName("knowledge-assistant")
    .build()

val result = experiment {
    name = "Knowledge Assistant v1.0"
    dataset(dataset)
    task(evaluationTask)
    evaluators(evaluators)
    reporter(reporter)
}.run()

The reporter batches results and sends them to the server during experiment execution. After the experiment completes, you can view the results in the web UI.

The server stores results and lets you:

View pass rates and scores over time
Compare different model configurations
Drill down into specific failures
Share results with your team

Part 7: Creating Custom Evaluators

Sometimes the built-in evaluators do not cover your specific needs. You can create custom evaluators by extending BaseEvaluator. Place these in your evaluation package alongside QAEvaluators:

Java
Kotlin

package com.example.evaluation;

import dev.dokimos.core.BaseEvaluator;
import dev.dokimos.core.EvalResult;
import dev.dokimos.core.EvalTestCase;
import dev.dokimos.core.EvalTestCaseParam;

import java.util.List;

/**
 * Custom evaluator that checks if the response length is within acceptable bounds.
 * This demonstrates a deterministic evaluator that does not require an LLM judge.
 */
public class ResponseLengthEvaluator extends BaseEvaluator {

    private final int minWords;
    private final int maxWords;

    public ResponseLengthEvaluator(int minWords, int maxWords) {
        super("Response Length", 1.0, List.of(EvalTestCaseParam.ACTUAL_OUTPUT));
        this.minWords = minWords;
        this.maxWords = maxWords;
    }

    @Override
    protected EvalResult runEvaluation(EvalTestCase testCase) {
        String output = testCase.actualOutput();
        int wordCount = output.split("\\s+").length;

        boolean withinBounds = wordCount >= minWords && wordCount <= maxWords;
        double score = withinBounds ? 1.0 : 0.0;
        String reason = String.format(
            "Response has %d words (expected %d-%d)",
            wordCount, minWords, maxWords);

        return EvalResult.builder()
            .name(name())
            .score(score)
            .threshold(threshold())
            .reason(reason)
            .build();
    }
}

package com.example.evaluation

import dev.dokimos.core.BaseEvaluator
import dev.dokimos.core.EvalResult
import dev.dokimos.core.EvalTestCase
import dev.dokimos.core.EvalTestCaseParam

/**
 * Custom evaluator that checks if the response length is within acceptable bounds.
 * This demonstrates a deterministic evaluator that does not require an LLM judge.
 */
class ResponseLengthEvaluator(
    private val minWords: Int,
    private val maxWords: Int
) : BaseEvaluator("Response Length", 1.0, listOf(EvalTestCaseParam.ACTUAL_OUTPUT)) {

    override fun runEvaluation(testCase: EvalTestCase): EvalResult {
        val output = testCase.actualOutput()
        val wordCount = output.split("\s+".toRegex()).size

        val withinBounds = wordCount in minWords..maxWords
        val score = if (withinBounds) 1.0 else 0.0
        val reason = "Response has $wordCount words (expected $minWords-$maxWords)"

        return EvalResult(
            name = name(),
            score = score,
            threshold = threshold(),
            reason = reason)
    }
}

You can then add it to your evaluator factory:

Java
Kotlin

// In QAEvaluators.java
public static Evaluator responseLength(int minWords, int maxWords) {
    return new ResponseLengthEvaluator(minWords, maxWords);
}

// In QAEvaluators.kt
fun responseLength(minWords: Int, maxWords: Int): Evaluator = ResponseLengthEvaluator(minWords, maxWords)

Part 8: Advanced Evaluation Patterns

Evaluating Precision and Recall

For RAG systems where you have ground truth labels for relevant documents, you can measure traditional IR (Information Retrieval) metrics, such as precision and recall:

Java
Kotlin

import dev.dokimos.core.evaluators.PrecisionEvaluator;
import dev.dokimos.core.evaluators.RecallEvaluator;
import dev.dokimos.core.evaluators.MatchingStrategy;

// Example with document IDs
var example = Example.builder()
    .input("What is your return policy?")
    .expectedOutput("relevantDocs", List.of("doc-returns-1", "doc-returns-2"))
    .build();

Task taskWithDocIds = example -> {
    var response = assistant.answer(example.input());

    List<String> retrievedIds = response.retrievedDocuments().stream()
        .map(doc -> doc.getMetadata().get("id").toString())
        .toList();

    return Map.of(
        "output", response.answer(),
        "retrievedDocs", retrievedIds
    );
};

Evaluator precision = PrecisionEvaluator.builder()
    .name("Retrieval Precision")
    .retrievedKey("retrievedDocs")
    .expectedKey("relevantDocs")
    .matchingStrategy(MatchingStrategy.byEquality())
    .threshold(0.8)
    .build();

Evaluator recall = RecallEvaluator.builder()
    .name("Retrieval Recall")
    .retrievedKey("retrievedDocs")
    .expectedKey("relevantDocs")
    .matchingStrategy(MatchingStrategy.byEquality())
    .threshold(0.8)
    .build();

import dev.dokimos.core.evaluators.MatchingStrategy
import dev.dokimos.kotlin.dsl.precision
import dev.dokimos.kotlin.dsl.recall

val example = example {
    input = "What is your return policy?"
    expected("relevantDocs", listOf("doc-returns-1", "doc-returns-2"))
}

val taskWithDocIds = task { ex ->
    val response = assistant.answer(ex.input())
    val retrievedIds = response.retrievedDocuments.map { it.metadata["id"].toString() }

    mapOf(
        "output" to response.answer,
        "retrievedDocs" to retrievedIds
    )
}

val precision: Evaluator = precision {
    name = "Retrieval Precision"
    retrievedKey = "retrievedDocs"
    expectedKey = "relevantDocs"
    matchingStrategy = MatchingStrategy.byEquality()
    threshold = 0.8
}

val recall: Evaluator = recall {
    name = "Retrieval Recall"
    retrievedKey = "retrievedDocs"
    expectedKey = "relevantDocs"
    matchingStrategy = MatchingStrategy.byEquality()
    threshold = 0.8
}

Flexible Matching Strategies

Dokimos provides various matching strategies for comparing retrieved items:

Java
Kotlin

// Case insensitive matching
MatchingStrategy.caseInsensitive()

// Match by a specific field in objects
MatchingStrategy.byField("id")

// Match by multiple fields
MatchingStrategy.byFields("subject", "predicate", "object")

// Substring containment
MatchingStrategy.byContainment(true)

// LLM based semantic matching (most flexible)
MatchingStrategy.llmBased(judge)

// Combine strategies
MatchingStrategy.anyOf(strategy1, strategy2)  // OR
MatchingStrategy.allOf(strategy1, strategy2)  // AND

// Case insensitive matching
MatchingStrategy.caseInsensitive()

// Match by a specific field in objects
MatchingStrategy.byField("id")

// Match by multiple fields
MatchingStrategy.byFields("subject", "predicate", "object")

// Substring containment
MatchingStrategy.byContainment(normalize = true)

// LLM based semantic matching (most flexible)
MatchingStrategy.llmBased(judge)

// Combine strategies
MatchingStrategy.anyOf(strategy1, strategy2)  // OR
MatchingStrategy.allOf(strategy1, strategy2)  // AND

Async Evaluation

For large datasets, you can run evaluations asynchronously:

Java
Kotlin

// Single evaluator async
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase);

// With custom executor for parallel evaluation
ExecutorService executor = Executors.newFixedThreadPool(4);
CompletableFuture<EvalResult> future = evaluator.evaluateAsync(testCase, executor);

// Single evaluator async
val evalResult: EvalResult = evaluator.evaluateAsync(testCase).await()

// With custom executor for parallel evaluation
val executor = Executors.newFixedThreadPool(4)
val evalResult2: EvalResult = evaluator.evaluateAsync(testCase, executor).await()

Best Practices

Start with a Small, High-Quality Dataset

Do not try to build a huge dataset upfront. Start with 10 to 20 carefully crafted examples that cover your main use cases. Add more examples as you discover edge cases and failures.

Use Multiple Evaluators

Different evaluators catch different problems:

Faithfulness catches responses that stray from the context
Hallucination quantifies fabricated content
Answer Quality catches unhelpful or unclear responses
Contextual Relevance identifies retrieval issues

Set Realistic Thresholds

Do not expect perfection right away. Start with achievable thresholds (around 0.7) and increase them as you improve your system. A threshold of 1.0 means any imperfection fails.

Run Evaluations Regularly

Integrate evaluations into your CI/CD pipeline. Run quick evaluations on every PR with a small dataset, and full evaluations nightly or weekly with larger datasets.

Conclusion

Evaluating AI agents is essential for building reliable applications. In this tutorial, you learned how to:

Build a RAG based knowledge assistant with Spring AI and expose it as a REST API
Create evaluation datasets with examples and expected outputs
Organize evaluators in a reusable factory class
Configure multiple evaluators for different quality dimensions
Integrate evaluations with JUnit for CI/CD
Track results over time with the Dokimos Server
Create custom evaluators for domain specific needs

The combination of Spring AI for building and Dokimos for evaluating provides a complete toolkit for developing production ready AI applications in Java.

Next Steps

Explore the Evaluators documentation for all available evaluators
Learn about Datasets for advanced dataset management
Set up the Dokimos Server for result tracking
Check out the JUnit integration for test driven evaluation

Resources

Tutorial Example Code - The complete working code from this tutorial
Spring AI Documentation
Dokimos GitHub Repository

If you found this tutorial helpful, please consider giving the repository a star on GitHub. It helps others discover the project and keeps us motivated to improve it ⭐.

Why Evaluate Your AI Agent?​

What We Are Building​

Prerequisites​

Project Setup​

Dependencies​

Maven​

Gradle​

Configuration​

Part 1: Building the AI Agent​

Setting Up the Vector Store​

Creating the Knowledge Assistant​

Exposing the Assistant as a REST API​

Part 2: Setting Up Evaluation with Dokimos​

Creating the Evaluation Dataset​

Defining the Evaluation Task​

Setting Up the LLM Judge​

Part 3: Configuring Multiple Evaluators​

Faithfulness Evaluator​

Hallucination Evaluator​

Answer Quality Evaluator​

Contextual Relevance Evaluator​

Combining All Evaluators​

Part 4: Running the Evaluation Experiment​

Analyzing Results​

Investigating Failures​

Part 5: Integrating with JUnit​

Organizing Evaluators​

Writing the Evaluation Test​

Running in CI/CD​

Part 6: Tracking Results Over Time​

Starting the Server​

Sending Results to the Server​

Part 7: Creating Custom Evaluators​

Part 8: Advanced Evaluation Patterns​

Evaluating Precision and Recall​

Flexible Matching Strategies​

Async Evaluation​

Best Practices​

Start with a Small, High-Quality Dataset​

Use Multiple Evaluators​

Set Realistic Thresholds​

Run Evaluations Regularly​

Conclusion​

Next Steps​

Resources​