LangChain4j Integration

Dokimos works with LangChain4j so you can evaluate your AI Services and RAG pipelines.

Why Use This Integration?

Automatic context extraction: LangChain4j's Result<T> objects already contain the retrieved documents. Dokimos extracts them automatically so you don't have to track context manually.

Simple conversion: Turn a ChatModel or AI Service into a Dokimos Task with one line of code.

RAG evaluation ready: Use the FaithfulnessEvaluator to check if answers are grounded in retrieved documents.

Setup

Add the LangChain4j integration dependency:

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-langchain4j</artifactId>
    <version>${dokimos.version}</version>
</dependency>

Basic Usage

Evaluating a Simple ChatModel

Convert a LangChain4j ChatModel to a Task:

import dev.dokimos.langchain4j.LangChain4jSupport;
import dev.langchain4j.model.openai.OpenAiChatModel;

ChatModel model = OpenAiChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("gpt-5.2")
    .build();

// Convert to Task
Task task = LangChain4jSupport.simpleTask(model);

// Run experiment
ExperimentResult result = Experiment.builder()
    .name("ChatModel Evaluation")
    .dataset(dataset)
    .task(task)
    .evaluators(evaluators)
    .build()
    .run();

Using ChatModel as LLM Judge

Convert a ChatModel to a JudgeLM for evaluation:

import dev.dokimos.langchain4j.LangChain4jSupport;

ChatModel judgeModel = OpenAiChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("gpt-5.2")
    .build();

// Convert to JudgeLM
JudgeLM judge = LangChain4jSupport.asJudge(judgeModel);

// Use in evaluators
Evaluator correctness = LLMJudgeEvaluator.builder()
    .name("Answer Correctness")
    .criteria("Is the answer factually correct?")
    .judge(judge)
    .threshold(0.8)
    .build();

Evaluating RAG Systems

The main reason to use this integration is for evaluating RAG systems. Here's a complete example:

import dev.dokimos.langchain4j.LangChain4jSupport;
import dev.langchain4j.service.AiServices;
import dev.langchain4j.service.Result;

// 1. Define your AI Service interface (must return Result<String>)
interface Assistant {
    Result<String> chat(String userMessage);
}

// 2. Build your RAG pipeline
Assistant assistant = AiServices.builder(Assistant.class)
    .chatLanguageModel(chatModel)
    .contentRetriever(EmbeddingStoreContentRetriever.builder()
        .embeddingStore(embeddingStore)
        .embeddingModel(embeddingModel)
        .maxResults(3)
        .build())
    .build();

// 3. Create dataset
Dataset dataset = Dataset.builder()
    .name("customer-qa")
    .addExample(Example.of("What is the refund policy?", "30-day money-back guarantee"))
    .addExample(Example.of("How long does shipping take?", "5-7 business days"))
    .build();

// 4. Create Task (automatically extracts context from Result)
Task task = LangChain4jSupport.ragTask(assistant::chat);

// 5. Set up evaluators
JudgeLM judge = LangChain4jSupport.asJudge(judgeModel);

List<Evaluator> evaluators = List.of(
    // Check answer correctness
    LLMJudgeEvaluator.builder()
        .name("Answer Correctness")
        .criteria("Is the answer accurate and complete?")
        .judge(judge)
        .threshold(0.8)
        .build(),
    
    // Check faithfulness to retrieved context
    FaithfulnessEvaluator.builder()
        .threshold(0.7)
        .judge(judge)
        .build()
);

// 6. Run experiment
ExperimentResult result = Experiment.builder()
    .name("RAG Evaluation")
    .dataset(dataset)
    .task(task)
    .evaluators(evaluators)
    .build()
    .run();

// 7. Analyze results
System.out.println("Pass rate: " + result.passRate() * 100 + "%");
System.out.println("Faithfulness: " + result.averageScore("Faithfulness"));

How It Works

The ragTask() method extracts the input, calls your AI Service, and automatically pulls the retrieved context from Result.sources(). The output includes both the answer and context:

{
    "output": "We offer a 30-day money-back guarantee",
    "context": [
        "Refund policy: 30-day guarantee...",
        "Contact support to process refunds..."
    ]
}

This lets the FaithfulnessEvaluator check if the answer is grounded in what was actually retrieved.

Advanced Usage

Custom Dataset Keys

If your dataset uses different key names:

// Dataset with custom keys
Dataset dataset = Dataset.builder()
    .addExample(Example.builder()
        .input("question", "What is the refund policy?")
        .expectedOutput("answer", "30-day money-back guarantee")
        .build())
    .build();

// Map keys accordingly
Task task = LangChain4jSupport.ragTask(
    assistant::chat,
    "question",         // input key
    "answer",           // output key
    "retrievedContext"  // context key
);

Tracking Extra Metrics

Use customTask() to track latency, source counts, or other metrics:

Task task = LangChain4jSupport.customTask(example -> {
    long start = System.currentTimeMillis();
    Result<String> result = assistant.chat(example.input());
    long latency = System.currentTimeMillis() - start;
    
    return Map.of(
        "output", result.content(),
        "context", LangChain4jSupport.extractTexts(result.sources()),
        "latencyMs", latency,
        "numSources", result.sources().size()
    );
});

Context Extraction Utilities

Extract retrieved context in different formats:

// Simple text extraction
List<String> contextTexts = LangChain4jSupport.extractTexts(result.sources());
// ["Text from doc 1", "Text from doc 2"]

// With metadata (for source attribution)
List<Map<String, Object>> contextsWithMeta = 
    LangChain4jSupport.extractTextsWithMetadata(result.sources());
// [
//   {"text": "...", "metadata": {"source": "doc1.pdf", "page": 5}},
//   {"text": "...", "metadata": {"source": "doc2.pdf", "page": 12}}
// ]

RAG-Specific Evaluators

Faithfulness Evaluation

Verify outputs are grounded in retrieved context:

Evaluator faithfulness = FaithfulnessEvaluator.builder()
    .threshold(0.8)
    .judge(judge)
    .contextKey("context")  // Must match Task's context key
    .includeReason(true)
    .build();

The evaluator:

Extracts claims from the actual output
Verifies each claim against the retrieved context
Computes score = (supported claims) / (total claims)

Multi-dimensional RAG Evaluation

Evaluate different quality aspects:

List<Evaluator> evaluators = List.of(
    // Answer quality
    LLMJudgeEvaluator.builder()
        .name("Answer Quality")
        .criteria("Is the answer helpful and accurate?")
        .evaluationParams(List.of(
            EvalTestCaseParam.INPUT,
            EvalTestCaseParam.ACTUAL_OUTPUT
        ))
        .judge(judge)
        .threshold(0.8)
        .build(),
    
    // Faithfulness to sources
    FaithfulnessEvaluator.builder()
        .name("Faithfulness")
        .threshold(0.85)
        .judge(judge)
        .build(),
    
    // Context relevance
    LLMJudgeEvaluator.builder()
        .name("Context Relevance")
        .criteria("Is the retrieved context relevant to answering the question?")
        .evaluationParams(List.of(
            EvalTestCaseParam.INPUT,
            EvalTestCaseParam.METADATA  // Contains context
        ))
        .judge(judge)
        .threshold(0.75)
        .build()
);

Complete Working Example

import dev.dokimos.core.*;
import dev.dokimos.langchain4j.LangChain4jSupport;
import dev.langchain4j.data.document.Document;
import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.BgeSmallEnV15QuantizedEmbeddingModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever;
import dev.langchain4j.service.AiServices;
import dev.langchain4j.service.Result;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;

public class RAGEvaluation {
    
    public static void main(String[] args) {
        // 1. Set up RAG components
        var embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel();
        var embeddingStore = new InMemoryEmbeddingStore<TextSegment>();
        
        // Ingest documents
        var documents = List.of(
            Document.from("We offer a 30-day money-back guarantee."),
            Document.from("Standard shipping takes 5-7 business days.")
        );
        
        EmbeddingStoreIngestor.builder()
            .embeddingModel(embeddingModel)
            .embeddingStore(embeddingStore)
            .build()
            .ingest(documents);
        
        // 2. Build AI Service
        interface Assistant {
            Result<String> chat(String userMessage);
        }
        
        Assistant assistant = AiServices.builder(Assistant.class)
            .chatLanguageModel(OpenAiChatModel.builder()
                .apiKey(System.getenv("OPENAI_API_KEY"))
                .modelName("gpt-5.2")
                .build())
            .contentRetriever(EmbeddingStoreContentRetriever.builder()
                .embeddingStore(embeddingStore)
                .embeddingModel(embeddingModel)
                .maxResults(2)
                .build())
            .build();
        
        // 3. Create dataset
        Dataset dataset = Dataset.builder()
            .name("customer-qa")
            .addExample(Example.of(
                "What is the refund policy?", 
                "30-day money-back guarantee"
            ))
            .addExample(Example.of(
                "How long does shipping take?", 
                "5-7 business days"
            ))
            .build();
        
        // 4. Set up evaluation
        var judgeModel = OpenAiChatModel.builder()
            .apiKey(System.getenv("OPENAI_API_KEY"))
            .modelName("gpt-5.2")
            .build();
        
        JudgeLM judge = LangChain4jSupport.asJudge(judgeModel);
        
        List<Evaluator> evaluators = List.of(
            LLMJudgeEvaluator.builder()
                .name("Answer Quality")
                .criteria("Is the answer accurate?")
                .judge(judge)
                .threshold(0.8)
                .build(),
            FaithfulnessEvaluator.builder()
                .threshold(0.7)
                .judge(judge)
                .build()
        );
        
        // 5. Run experiment
        ExperimentResult result = Experiment.builder()
            .name("RAG Evaluation")
            .dataset(dataset)
            .task(LangChain4jSupport.ragTask(assistant::chat))
            .evaluators(evaluators)
            .build()
            .run();
        
        // 6. Display results
        System.out.println("Pass rate: " + 
            String.format("%.0f%%", result.passRate() * 100));
        System.out.println("Answer Quality: " + 
            String.format("%.2f", result.averageScore("Answer Quality")));
        System.out.println("Faithfulness: " + 
            String.format("%.2f", result.averageScore("Faithfulness")));
    }
}

JUnit Integration

Combine with JUnit for testing:

import dev.dokimos.junit.DatasetSource;
import org.junit.jupiter.params.ParameterizedTest;

@ParameterizedTest
@DatasetSource("classpath:datasets/rag-qa.json")
void ragSystemShouldAnswerCorrectly(Example example) {
    // Call your RAG system
    Result<String> result = assistant.chat(example.input());
    
    // Create test case with context
    Map<String, Object> outputs = Map.of(
        "output", result.content(),
        "context", LangChain4jSupport.extractTexts(result.sources())
    );
    EvalTestCase testCase = example.toTestCase(outputs);
    
    // Assert faithfulness
    Assertions.assertEval(testCase, faithfulnessEvaluator);
}

Best Practices

Always return Result<String>: Your AI Service interface must return Result<String>, not just String. This is how LangChain4j provides the retrieved context.

// ✅ Good
interface Assistant {
    Result<String> chat(String message);
}

// ❌ Won't work (can't extract context)
interface Assistant {
    String chat(String message);
}

Use a better model for judging: Use GPT-5.2 or similar for evaluation, even if your application uses a smaller model for generation.

Track retrieval quality: Monitor how many documents are retrieved and whether they're relevant. Use customTask() to add metrics.

Test different retrieval settings: Use experiments to compare different maxResults, embedding models, or reranking strategies.

Why Use This Integration?​

Setup​

Basic Usage​

Evaluating a Simple ChatModel​

Using ChatModel as LLM Judge​

Evaluating RAG Systems​

How It Works​

Advanced Usage​

Custom Dataset Keys​

Tracking Extra Metrics​

Context Extraction Utilities​

RAG-Specific Evaluators​

Faithfulness Evaluation​

Multi-dimensional RAG Evaluation​

Complete Working Example​

JUnit Integration​

Best Practices​