# Multi-Turn Conversations > This page shows you how to test a chat assistant across a full back-and-forth conversation, not just one prompt and reply. # Multi-Turn Conversations import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; This page shows you how to test a chat assistant across a full back-and-forth conversation, not just one prompt and reply. Single-turn tests check one answer. Real users keep talking. They follow up, change their mind, and get frustrated. To test that, you need to drive a whole conversation and then judge how it went. Dokimos gives you three pieces to do that: - **Simulated users**: an LLM that plays a role and types like a real person (an angry customer, a confused user, a technical expert). - **Conversation simulator**: takes turns between your app and the simulated user until the chat ends. - **Trajectory evaluator**: scores the whole conversation with an LLM as the judge. ## Quick Example Here is the full loop: build a fake user, wrap your app, run the chat, then grade it. Copy this and replace `chatClient` and `judgeLM` with your own. ```java // 1. Create a simulated user (frustrated customer) SimulatedUser user = UserPersonas.aggressiveCustomer(judgeLM); // 2. Wrap your application ConversationalApplication app = trajectory -> { String response = chatClient.chat(formatHistory(trajectory)); return Message.assistant(response); }; // 3. Run the simulation ConversationTrajectory trajectory = ConversationSimulator.builder() .simulatedUser(user) .application(app) .maxTurns(8) .scenario("Handle product return request") .initialMessage("I want to return this defective product!") .build() .simulate(); // 4. Evaluate the conversation EvalResult result = TrajectoryEvaluator.builder() .name("Customer Service Quality") .threshold(0.7) .judge(judgeLM) .criteria(List.of( TrajectoryEvaluationCriteria.userSatisfaction(), TrajectoryEvaluationCriteria.problemResolution() )) .build() .evaluate(EvalTestCase.builder() .actualOutput("trajectory", trajectory) .build()); ``` ```kotlin // 1. Create a simulated user (frustrated customer) val user: SimulatedUser = UserPersonas.aggressiveCustomer(judgeLM) // 2. Wrap your application val app: ConversationalApplication = ConversationalApplication { trajectory -> val response = chatClient.chat(formatHistory(trajectory)) Message.assistant(response) } // 3. Run the simulation val trajectory = simulator { simulatedUser = user application = app maxTurns = 8 scenario = "Handle product return request" initialMessage = "I want to return this defective product!" }.simulate() // 4. Evaluate the conversation val result = trajectoryEvaluator(judgeLM) { name = "Customer Service Quality" threshold = 0.7 criteria(listOf( TrajectoryEvaluationCriteria.userSatisfaction(), TrajectoryEvaluationCriteria.problemResolution() )) } .evaluate( EvalTestCase( actualOutputs = mapOf("trajectory" to trajectory) ) ) ``` The rest of this page breaks down each step. ## Core Concepts ### Messages and Trajectories A conversation is a list of messages. Each message has a role: user, assistant, or system. Build one with the matching factory method. ```java Message userMsg = Message.user("I need help with my order"); Message assistantMsg = Message.assistant("I'd be happy to help. What's your order number?"); Message systemMsg = Message.system("You are a helpful support agent"); ``` ```kotlin val userMsg = Message.user("I need help with my order") val assistantMsg = Message.assistant("I'd be happy to help. What's your order number?") val systemMsg = Message.system("You are a helpful support agent") ``` A `ConversationTrajectory` holds the whole conversation. The simulator builds one for you, but you can also build one by hand to test a fixed transcript. ```java ConversationTrajectory trajectory = ConversationTrajectory.builder() .scenario("Customer support interaction") .userMessage("I need help") .assistantMessage("How can I assist you?") .userMessage("My order is late") .assistantMessage("Let me check that for you") .build(); // Methods you will use trajectory.turnCount(); // Number of complete turns trajectory.userMessages(); // All user messages trajectory.assistantMessages(); // All assistant messages trajectory.lastMessage(); // Most recent message trajectory.toJson(); // JSON for debugging trajectory.toText(); // Plain text transcript ``` ```kotlin val trajectory = trajectory { scenario = "Customer support interaction" user("I need help") assistant("How can I assist you?") user("My order is late") assistant("Let me check that for you") } // Methods you will use trajectory.turnCount() // Number of complete turns trajectory.userMessages() // All user messages trajectory.assistantMessages() // All assistant messages trajectory.lastMessage() // Most recent message trajectory.toJson() // JSON for debugging trajectory.toText() // Plain text transcript ``` ### Tool Calls on Turns A real agent calls tools mid-conversation: it looks up the weather, searches flights, then books a hotel. An assistant turn can carry the tool calls it made, so you can score *what the agent did each turn*, not just what it said. Attach a typed `List` to an assistant turn. A turn that called no tools needs no change. ```java ConversationTrajectory trajectory = ConversationTrajectory.builder() .userMessage("What's the weather in Paris?") .assistantMessage("It's 18C and sunny.", List.of( ToolCall.builder().name("get_weather").argument("city", "Paris").result("18C, sunny").build() )) .userMessage("Book me a hotel there.") .assistantMessage("Booked the Hotel Le Marais.", List.of( ToolCall.of("book_hotel", Map.of("city", "Paris")) )) .userMessage("Thanks!") .assistantMessage("You're all set!") // tool-free turn, unchanged .build(); ``` ```kotlin val trajectory = trajectory { user("What's the weather in Paris?") assistant("It's 18C and sunny.", listOf( ToolCall.builder().name("get_weather").argument("city", "Paris").result("18C, sunny").build() )) user("Book me a hotel there.") assistant("Booked the Hotel Le Marais.", listOf( ToolCall.of("book_hotel", mapOf("city" to "Paris")) )) user("Thanks!") assistant("You're all set!") // tool-free turn, unchanged } ``` `Message` carries the tool calls as a typed `List`; an assistant message built without them returns an empty list. When your app produces a turn, attach the calls with `Message.assistant(content, toolCalls)`. #### Per-Turn Evaluation (Primary Path) This is the recommended way to grade tool use across a conversation. `toolCallsByTurn()` returns one tool-call list per assistant turn, in order. Pair each turn with the calls you expected and run the [deterministic agent evaluators](./agent-evaluation.md), with no LLM and no API key. ```java List> actualByTurn = trajectory.toolCallsByTurn(); List> expectedByTurn = List.of( List.of(ToolCall.of("get_weather", Map.of())), List.of(ToolCall.of("book_hotel", Map.of())), List.of() // final turn calls no tools ); var validity = ToolCallValidityEvaluator.builder().build(); var correctness = ToolCorrectnessEvaluator.builder().build(); for (int turn = 0; turn < actualByTurn.size(); turn++) { EvalTestCase turnCase = EvalTestCase.builder() .actualOutput("toolCalls", actualByTurn.get(turn)) .expectedOutput("toolCalls", expectedByTurn.get(turn)) .metadata("tools", tools) .build(); EvalResult v = validity.evaluate(turnCase); EvalResult c = correctness.evaluate(turnCase); } ``` ```kotlin val actualByTurn = trajectory.toolCallsByTurn() val expectedByTurn = listOf( listOf(ToolCall.of("get_weather", mapOf())), listOf(ToolCall.of("book_hotel", mapOf())), listOf() // final turn calls no tools ) val validity = ToolCallValidityEvaluator.builder().build() val correctness = ToolCorrectnessEvaluator.builder().build() actualByTurn.forEachIndexed { turn, calls -> val turnCase = EvalTestCase.builder() .actualOutput("toolCalls", calls) .expectedOutput("toolCalls", expectedByTurn[turn]) .metadata("tools", tools) .build() val v = validity.evaluate(turnCase) val c = correctness.evaluate(turnCase) } ``` :::note `toolCallsByTurn()` groups by **assistant message**, which can differ from `turnCount()` (user/assistant pairs) when a conversation has consecutive or leading assistant messages. Each inner list lines up with `assistantMessages()`. ::: See [`MultiTurnToolCallExample.java`](https://github.com/dokimos-dev/dokimos/blob/master/dokimos-examples/src/main/java/dev/dokimos/examples/conversation/MultiTurnToolCallExample.java) for a complete runnable version. #### Whole-Conversation Shortcuts When you want to assert over the whole conversation rather than per turn, build a test case straight from the trajectory. - `toolCalls()`: every turn's calls flattened into one list, in order. - `toTestCase()` and `toTestCase(tools)`: a **deterministic** test case. The flattened `toolCalls` go in the actual outputs, the input is the **last user message**, and `tools` (when given) go in metadata. As-is, it feeds the rule-based evaluators that read only actual outputs (validity, error, efficiency). `ToolCorrectnessEvaluator` and `ToolTrajectoryEvaluator` additionally need an expected list, which this path does not set; wire one in yourself (for example, `EvalTestCase.builder().expectedOutput("toolCalls", expected)`) or they throw an `EvaluationException`. - `toTestCase(tools, tasks)`: the **judge** test case for `TaskCompletionEvaluator` and `ToolArgumentHallucinationEvaluator`. Its input is the rendered transcript of the whole conversation, but tool calls are rendered **name-only** (`[tool: name]`, not `[tool: name(args)]`) so the argument values a hallucination judge assesses never appear in the grounding it reads; the arguments stay available through the actual outputs. No separate output is set, so the transcript is not double-wrapped. - `toAgentTrace()` / `toAgentOutputs()`: collapse the conversation into a single `AgentTrace` (or its output map) for the standard agent data flow. ```java // Deterministic: input is the last user message, calls are flattened across turns EvalTestCase deterministic = trajectory.toTestCase(tools); EvalResult validity = ToolCallValidityEvaluator.builder().build().evaluate(deterministic); // Judge: input is the transcript (tool calls name-only), tasks listed in metadata EvalTestCase judgeCase = trajectory.toTestCase(tools, List.of("Check weather", "Book a hotel")); EvalResult completion = TaskCompletionEvaluator.builder().judge(judgeLM).build().evaluate(judgeCase); ``` ```kotlin // Deterministic: input is the last user message, calls are flattened across turns val deterministic = trajectory.toTestCase(tools) val validity = ToolCallValidityEvaluator.builder().build().evaluate(deterministic) // Judge: input is the transcript (tool calls name-only), tasks listed in metadata val judgeCase = trajectory.toTestCase(tools, listOf("Check weather", "Book a hotel")) val completion = TaskCompletionEvaluator.builder().judge(judgeLM).build().evaluate(judgeCase) ``` #### Tool Calls in the Transcript `toText()` and `toJson()` render each turn's tool calls. `toText()` adds one compact `[tool: name(args)]` line per call under the message; `toJson()` adds a `toolCalls` array to a turn that has any. A tool-free conversation renders exactly as before, byte-identical, so adding tool calls to one turn never reshapes the rest. To let the trajectory judge reason over tool usage, turn it on with `includeToolCalls(true)`. It is off by default, so existing judge suites see an unchanged prompt. ```java TrajectoryEvaluator evaluator = TrajectoryEvaluator.builder() .name("Support Quality") .judge(judgeLM) .criteria(List.of(TrajectoryEvaluationCriteria.goalCompletion())) .includeToolCalls(true) // render each turn's tool calls in the judge prompt .build(); ``` ```kotlin // includeToolCalls is on the Java builder; call it directly from Kotlin val evaluator = TrajectoryEvaluator.builder() .name("Support Quality") .judge(judgeLM) .criteria(listOf(TrajectoryEvaluationCriteria.goalCompletion())) .includeToolCalls(true) // render each turn's tool calls in the judge prompt .build() ``` ### Simulated Users A simulated user types the user side of the chat. The `SimulatedUser` interface takes the conversation so far and returns the next user message. ```java @FunctionalInterface public interface SimulatedUser { Message generateMessage(ConversationTrajectory trajectory); } ``` ```kotlin fun interface SimulatedUser { fun generateMessage(trajectory: ConversationTrajectory): Message } ``` #### LLM-Based Simulated User `LLMSimulatedUser` uses an LLM to write each message. Give it a persona and a few behavior rules, and it stays in character across turns. ```java SimulatedUser user = LLMSimulatedUser.builder() .judge(judgeLM) .persona("impatient customer who is in a hurry") .behaviorGuidelines(""" - Express time pressure - Ask for quick solutions - Show frustration with long explanations """) .build(); ``` ```kotlin val user: SimulatedUser = llmUser(judgeLM) { persona = "impatient customer who is in a hurry" behaviorGuidelines = """ - Express time pressure - Ask for quick solutions - Show frustration with long explanations """ } ``` Want the conversation to start the same way every run? Set fixed responses for the opening turns. ```java SimulatedUser user = LLMSimulatedUser.builder() .judge(judgeLM) .persona("customer with a complaint") .fixedResponses(List.of( "I ordered a blue shirt but received a red one!", "I want a full refund, not a replacement" )) .build(); ``` ```kotlin val user: SimulatedUser = llmUser(judgeLM) { persona = "customer with a complaint" fixedResponses(listOf( "I ordered a blue shirt but received a red one!", "I want a full refund, not a replacement" )) } ``` The simulated user sends each fixed response in order, one per turn. After the list runs out, the LLM takes over and writes contextual replies. #### Pre-Built Personas `UserPersonas` ships ready-made characters for common tests. Pass your `judgeLM` and you get a configured `SimulatedUser`. ```java // Customer service UserPersonas.aggressiveCustomer(judgeLM) // Frustrated, demanding UserPersonas.confusedUser(judgeLM) // Needs clarification UserPersonas.impatientUser(judgeLM) // Wants quick answers UserPersonas.satisfiedCustomer(judgeLM) // Cooperative, positive // Technical users UserPersonas.technicalExpert(judgeLM) // Uses jargon, probes details UserPersonas.noviceUser(judgeLM) // Needs basic explanations // Edge cases UserPersonas.adversarialUser(judgeLM) // Tests boundaries (red-teaming) UserPersonas.offTopicUser(judgeLM) // Goes on tangents ``` ```kotlin // Customer service UserPersonas.aggressiveCustomer(judgeLM) // Frustrated, demanding UserPersonas.confusedUser(judgeLM) // Needs clarification UserPersonas.impatientUser(judgeLM) // Wants quick answers UserPersonas.satisfiedCustomer(judgeLM) // Cooperative, positive // Technical users UserPersonas.technicalExpert(judgeLM) // Uses jargon, probes details UserPersonas.noviceUser(judgeLM) // Needs basic explanations // Edge cases UserPersonas.adversarialUser(judgeLM) // Tests boundaries (red-teaming) UserPersonas.offTopicUser(judgeLM) // Goes on tangents ``` Need a character that is not in the list? Build your own with `UserPersonas.custom`. Pass the judge, a one-line persona, and the behavior rules. ```java SimulatedUser user = UserPersonas.custom( judgeLM, "elderly user unfamiliar with technology", """ - Use simple language - Ask about basic terminology - Express confusion about technical steps - Need reassurance """ ); ``` ```kotlin val user: SimulatedUser = llmUser(judgeLM) { persona = "elderly user unfamiliar with technology" behaviorGuidelines = """ - Use simple language - Ask about basic terminology - Express confusion about technical steps - Need reassurance """ } ``` ### Conversation Simulator `ConversationSimulator` runs the chat. It alternates between the simulated user and your app until it hits `maxTurns` or your stopping condition. Each option is commented below. ```java ConversationSimulator simulator = ConversationSimulator.builder() .simulatedUser(user) .application(myApp) .maxTurns(10) // Limit conversation length .scenario("Product return request") // Context for the user .initialMessage("I want to return...") // First user message .stoppingCondition(trajectory -> { // Optional early termination Message last = trajectory.lastAssistantMessage(); return last != null && last.content().contains("goodbye"); }) .build(); ConversationTrajectory trajectory = simulator.simulate(); ``` ```kotlin val simulator = simulator { simulatedUser = user application = myApp maxTurns = 10 // Limit conversation length scenario = "Product return request" // Context for the user initialMessage = "I want to return..." // First user message stoppingCondition = { trajectory -> // Optional early termination val last = trajectory.lastAssistantMessage() last != null && last.content().contains("goodbye") } } val trajectory = simulator.simulate() ``` To run the chat off the calling thread, use `simulateAsync` instead of `simulate`. ```java CompletableFuture future = simulator.simulateAsync(); // ... do other work ... ConversationTrajectory trajectory = future.get(); ``` ```kotlin val trajectory: ConversationTrajectory = simulator.simulateAsync().await() ``` ### Wrapping Your Application The simulator needs to call your app each turn. Implement `ConversationalApplication`. It takes the conversation so far and returns the assistant's next reply. ```java @FunctionalInterface public interface ConversationalApplication { Message respond(ConversationTrajectory trajectory); } ``` ```kotlin fun interface ConversationalApplication { fun respond(trajectory: ConversationTrajectory): Message } ``` Inside `respond`, convert the trajectory to your framework's message type, call your model, and wrap the reply in `Message.assistant(...)`. Here is how to do that with Spring AI. ```java ConversationalApplication app = trajectory -> { // Convert trajectory to Spring AI messages List messages = trajectory.messages().stream() .map(m -> switch (m.role()) { case USER -> new UserMessage(m.content()); case ASSISTANT -> new AssistantMessage(m.content()); case SYSTEM -> new SystemMessage(m.content()); }) .toList(); String response = chatClient.prompt() .messages(messages) .call() .content(); return Message.assistant(response); }; ``` ```kotlin val app: ConversationalApplication = ConversationalApplication { trajectory -> // Convert trajectory to Spring AI messages val messages = trajectory.messages() .map { m -> when (m.role()) { Message.Role.USER -> UserMessage(m.content()) Message.Role.ASSISTANT -> AssistantMessage(m.content()) Message.Role.SYSTEM -> SystemMessage(m.content()) } } val response = chatClient.prompt() .messages(messages) .call() .content() Message.assistant(response) } ``` The same pattern works with LangChain4j. Map the roles to LangChain4j message types and call your `chatModel`. ```java ConversationalApplication app = trajectory -> { // Convert trajectory to LangChain4j messages List messages = trajectory.messages().stream() .map(m -> switch (m.role()) { case USER -> new UserMessage(m.content()); case ASSISTANT -> new AiMessage(m.content()); case SYSTEM -> new SystemMessage(m.content()); }) .toList(); String response = chatModel.chat(messages); return Message.assistant(response); }; ``` ```kotlin val app: ConversationalApplication = ConversationalApplication { trajectory -> // Convert trajectory to LangChain4j messages val messages = trajectory.messages() .map { m -> when (m.role()) { Message.Role.USER -> UserMessage(m.content()) Message.Role.ASSISTANT -> AiMessage(m.content()) Message.Role.SYSTEM -> SystemMessage(m.content()) } } val response = chatModel.chat(messages) Message.assistant(response) } ``` ## Trajectory Evaluation Once you have a trajectory, `TrajectoryEvaluator` grades it. It sends the whole conversation to the judge LLM and scores it against the criteria you pick. Set a `threshold` to decide pass or fail. ```java TrajectoryEvaluator evaluator = TrajectoryEvaluator.builder() .name("Support Quality") .threshold(0.7) .judge(judgeLM) .criteria(List.of( TrajectoryEvaluationCriteria.userSatisfaction(), TrajectoryEvaluationCriteria.goalCompletion(), TrajectoryEvaluationCriteria.professionalTone() )) .aggregationStrategy(AggregationStrategy.WEIGHTED_MEAN) .includePerCriterionScores(true) .build(); ``` ```kotlin val evaluator = trajectoryEvaluator(judgeLM) { name = "Support Quality" threshold = 0.7 criteria(listOf( TrajectoryEvaluationCriteria.userSatisfaction(), TrajectoryEvaluationCriteria.goalCompletion(), TrajectoryEvaluationCriteria.professionalTone() )) aggregationStrategy = AggregationStrategy.WEIGHTED_MEAN includePerCriterionScores = true } ``` ### Evaluation Criteria Each criterion is one thing the judge checks. An `EvaluationCriterion` has a name, a description of what to look for, and a weight. Raise the weight to make a criterion count more in the final score. ```java EvaluationCriterion criterion = new EvaluationCriterion( "Response Time Awareness", "Evaluate if the assistant acknowledged and respected the user's time constraints", 1.5 // Higher weight ); ``` ```kotlin val criterion = EvaluationCriterion( "Response Time Awareness", "Evaluate if the assistant acknowledged and respected the user's time constraints", 1.5 // Higher weight ) ``` You do not have to write your own. `TrajectoryEvaluationCriteria` has ready-made criteria grouped by what they check. ```java // Core quality TrajectoryEvaluationCriteria.userSatisfaction() // Was the user satisfied? TrajectoryEvaluationCriteria.goalCompletion() // Was the goal achieved? TrajectoryEvaluationCriteria.conversationQuality() // Natural flow and coherence // Professional quality TrajectoryEvaluationCriteria.responseRelevance() // On-topic responses TrajectoryEvaluationCriteria.professionalTone() // Appropriate demeanor TrajectoryEvaluationCriteria.problemResolution() // Issues resolved // Information quality TrajectoryEvaluationCriteria.informationAccuracy() // Factually correct TrajectoryEvaluationCriteria.clarity() // Easy to understand TrajectoryEvaluationCriteria.helpfulness() // Genuinely helpful // Behavioral TrajectoryEvaluationCriteria.consistency() // No contradictions TrajectoryEvaluationCriteria.safety() // Appropriate boundaries ``` ```kotlin // Core quality TrajectoryEvaluationCriteria.userSatisfaction() // Was the user satisfied? TrajectoryEvaluationCriteria.goalCompletion() // Was the goal achieved? TrajectoryEvaluationCriteria.conversationQuality() // Natural flow and coherence // Professional quality TrajectoryEvaluationCriteria.responseRelevance() // On-topic responses TrajectoryEvaluationCriteria.professionalTone() // Appropriate demeanor TrajectoryEvaluationCriteria.problemResolution() // Issues resolved // Information quality TrajectoryEvaluationCriteria.informationAccuracy() // Factually correct TrajectoryEvaluationCriteria.clarity() // Easy to understand TrajectoryEvaluationCriteria.helpfulness() // Genuinely helpful // Behavioral TrajectoryEvaluationCriteria.consistency() // No contradictions TrajectoryEvaluationCriteria.safety() // Appropriate boundaries ``` ### Aggregation Strategies The judge scores each criterion. The aggregation strategy decides how those scores combine into one number. ```java AggregationStrategy.MEAN // Simple average AggregationStrategy.WEIGHTED_MEAN // Weighted by criterion weights AggregationStrategy.MIN // Strictest: lowest score wins AggregationStrategy.MAX // Most lenient: highest score wins ``` ```kotlin AggregationStrategy.MEAN // Simple average AggregationStrategy.WEIGHTED_MEAN // Weighted by criterion weights AggregationStrategy.MIN // Strictest: lowest score wins AggregationStrategy.MAX // Most lenient: highest score wins ``` ### Evaluation Results `evaluate` returns an `EvalResult` with the overall score, a pass flag, and metadata. When you set `includePerCriterionScores(true)`, the metadata holds the score and reason for every criterion under `criterionScores`. Read it like this. ```java EvalResult result = evaluator.evaluate(testCase); System.out.println("Overall Score: " + result.score()); System.out.println("Passed: " + result.success()); System.out.println("Turn Count: " + result.metadata().get("turnCount")); // Per-criterion breakdown Map criterionScores = (Map) result.metadata().get("criterionScores"); criterionScores.forEach((name, details) -> { Map d = (Map) details; System.out.println(name + ": " + d.get("score") + " - " + d.get("reason")); }); ``` ```kotlin val result = evaluator.evaluate(testCase) println("Overall Score: ${result.score()}") println("Passed: ${result.success()}") println("Turn Count: ${result.metadata()["turnCount"]}") // Per-criterion breakdown val criterionScores = result.metadata()["criterionScores"] as Map criterionScores.forEach { (name, details) -> val d = details as Map println("$name: ${d["score"]} - ${d["reason"]}") } ``` ## Complete Example This puts every step together: a runnable `main` that tests a customer service chatbot end to end. Swap `myChatbot` and `openAiClient` for your own. ```java public class CustomerServiceEvaluation { public static void main(String[] args) { // Setup judge LLM JudgeLM judgeLM = prompt -> openAiClient.chat(prompt); // Create simulated user with specific persona SimulatedUser user = LLMSimulatedUser.builder() .judge(judgeLM) .persona("frustrated customer who received a damaged product") .behaviorGuidelines(""" - Express disappointment about the damaged item - Request either replacement or refund - Be firm but not abusive - Mention you've been a loyal customer """) .fixedResponses(List.of( "I just received my order and the item is completely damaged!" )) .build(); // Wrap the chatbot being tested ConversationalApplication chatbot = trajectory -> { // Your chatbot implementation here String response = myChatbot.respond(trajectory.toText()); return Message.assistant(response); }; // Run the simulation ConversationTrajectory trajectory = ConversationSimulator.builder() .simulatedUser(user) .application(chatbot) .maxTurns(6) .scenario("Customer received damaged product and wants resolution") .build() .simulate(); // Print the conversation System.out.println("=== Conversation ==="); System.out.println(trajectory.toText()); // Evaluate TrajectoryEvaluator evaluator = TrajectoryEvaluator.builder() .name("Customer Service Quality") .threshold(0.7) .judge(judgeLM) .criteria(List.of( TrajectoryEvaluationCriteria.userSatisfaction(), TrajectoryEvaluationCriteria.problemResolution(), TrajectoryEvaluationCriteria.professionalTone(), TrajectoryEvaluationCriteria.helpfulness() )) .aggregationStrategy(AggregationStrategy.WEIGHTED_MEAN) .build(); EvalTestCase testCase = EvalTestCase.builder() .actualOutput("trajectory", trajectory) .build(); EvalResult result = evaluator.evaluate(testCase); // Print the results System.out.println("\n=== Evaluation Results ==="); System.out.println("Overall Score: " + String.format("%.2f", result.score())); System.out.println("Passed: " + result.success()); System.out.println("Reason: " + result.reason()); } } ``` ```kotlin object CustomerServiceEvaluation { @JvmStatic fun main(args: Array) { // Setup judge LLM val judgeLM = JudgeLM { prompt -> openAiClient.chat(prompt) } // Create simulated user with specific persona val user: SimulatedUser = llmUser(judgeLM) { persona = "frustrated customer who received a damaged product" behaviorGuidelines = """ - Express disappointment about the damaged item - Request either replacement or refund - Be firm but not abusive - Mention you've been a loyal customer """ fixedResponses(listOf("I just received my order and the item is completely damaged!")) } // Wrap the chatbot being tested val chatbot: ConversationalApplication = ConversationalApplication { trajectory -> // Your chatbot implementation here val response = myChatbot.respond(trajectory.toText()) Message.assistant(response) } // Run the simulation val trajectory = simulator { simulatedUser = user application = chatbot maxTurns = 6 scenario = "Customer received damaged product and wants resolution" }.simulate() // Print the conversation println("=== Conversation ===") println(trajectory.toText()) // Evaluate val evaluator = trajectoryEvaluator(judgeLM) { name = "Customer Service Quality" threshold = 0.7 criteria(listOf( TrajectoryEvaluationCriteria.userSatisfaction(), TrajectoryEvaluationCriteria.problemResolution(), TrajectoryEvaluationCriteria.professionalTone(), TrajectoryEvaluationCriteria.helpfulness() )) aggregationStrategy = AggregationStrategy.WEIGHTED_MEAN } val testCase = EvalTestCase( actualOutputs = mapOf("trajectory" to trajectory) ) val result = evaluator.evaluate(testCase) // Print the results println("\n=== Evaluation Results ===") println("Overall Score: ${"%.2f".format(result.score())}") println("Passed: ${result.success()}") println("Reason: ${result.reason()}") } } ``` ## Best Practices ### Choose appropriate personas Pick the persona that matches what you are testing: - Testing how it holds up under pressure? Use `adversarialUser` or `aggressiveCustomer`. - Testing clarity? Use `confusedUser` or `noviceUser`. - Testing happy paths? Use `satisfiedCustomer`. ### Set realistic turn limits Most real conversations resolve in 5 to 10 turns. A `maxTurns` that is too high wastes API calls. One that is too low cuts the chat off before it resolves. ### Use stopping conditions for efficiency Stop the chat as soon as the goal is met, so you do not pay for extra turns. ```java .stoppingCondition(trajectory -> { Message last = trajectory.lastAssistantMessage(); return last != null && ( last.content().contains("Is there anything else") || last.content().contains("Have a great day") ); }) ``` ```kotlin .stoppingCondition { trajectory -> val last = trajectory.lastAssistantMessage() last != null && ( last.content().contains("Is there anything else") || last.content().contains("Have a great day") ) } ``` ### Choose the right aggregation strategy - **WEIGHTED_MEAN**: good default. Lets you prioritize criteria by weight. - **MIN**: every criterion must pass. Use it as a strict quality gate. - **MEAN**: simple equal weighting. - **MAX**: lenient. Use it sparingly. ### Test multiple scenarios Do not test one user type. Loop over several personas so you catch problems each one exposes. ```java List personas = List.of( UserPersonas.aggressiveCustomer(judgeLM), UserPersonas.confusedUser(judgeLM), UserPersonas.satisfiedCustomer(judgeLM) ); for (SimulatedUser user : personas) { ConversationTrajectory trajectory = ConversationSimulator.builder() .simulatedUser(user) .application(app) .maxTurns(8) .build() .simulate(); EvalResult result = evaluator.evaluate( EvalTestCase.builder() .actualOutput("trajectory", trajectory) .build() ); System.out.println(user + ": " + result.score()); } ``` ```kotlin val personas = listOf( UserPersonas.aggressiveCustomer(judgeLM), UserPersonas.confusedUser(judgeLM), UserPersonas.satisfiedCustomer(judgeLM) ) personas.forEach { user -> val trajectory = simulator { simulatedUser = user application = app maxTurns = 8 }.simulate() val result = evaluator.evaluate( EvalTestCase( actualOutputs = mapOf("trajectory" to trajectory) ) ) println("$user: ${result.score()}") } ``` ### Debug with trajectory JSON When a test fails, print the full conversation to see what the assistant actually said. ```java System.out.println(trajectory.toJson()); // Pretty-printed JSON System.out.println(trajectory.toText()); // Human-readable transcript ``` ```kotlin println(trajectory.toJson()) // Pretty-printed JSON println(trajectory.toText()) // Human-readable transcript ```