Evaluation Testing

Testing AI applications requires evaluating the generated content to ensure the AI model has not produced a hallucinated response.

One method to evaluate the response is to use the AI model itself for evaluation. Select the best AI model for the evaluation, which may not be the same model used to generate the response.

The Spring AI interface for evaluating responses is Evaluator, defined as:

@FunctionalInterface
public interface Evaluator {
    EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}

The input to the evaluation is the EvaluationRequest defined as

public class EvaluationRequest {

	private final String userText;

	private final List<Content> dataList;

	private final String responseContent;

	public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
		this.userText = userText;
		this.dataList = dataList;
		this.responseContent = responseContent;
	}

  ...
}
  • userText: The raw input from the user as a String

  • dataList: Contextual data, such as from Retrieval Augmented Generation, appended to the raw input.

  • responseContent: The AI model’s response content as a String

RelevancyEvaluator

One implementation is the RelevancyEvaluator, which uses the AI model for evaluation. More implementations will be available in future releases.

The RelevancyEvaluator uses the input (userText) and the AI model’s output (chatResponse) to ask the question:

Your task is to evaluate if the response for the query
is in line with the context information provided.\n
You have two options to answer. Either YES/ NO.\n
Answer - YES, if the response for the query
is in line with context information otherwise NO.\n
Query: \n {query}\n
Response: \n {response}\n
Context: \n {context}\n
Answer: "

Here is an example of a JUnit test that performs a RAG query over a PDF document loaded into a Vector Store and then evaluates if the response is relevant to the user text.

@Test
void testEvaluation() {

    dataController.delete();
    dataController.load();

    String userText = "What is the purpose of Carina?";

    ChatResponse response = ChatClient.builder(chatModel)
            .build().prompt()
            .advisors(new QuestionAnswerAdvisor(vectorStore))
            .user(userText)
            .call()
            .chatResponse();
    String responseContent = response.getResult().getOutput().getContent();

    var relevancyEvaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

    EvaluationRequest evaluationRequest = new EvaluationRequest(userText,
            (List<Content>) response.getMetadata().get(QuestionAnswerAdvisor.RETRIEVED_DOCUMENTS), responseContent);

    EvaluationResponse evaluationResponse = relevancyEvaluator.evaluate(evaluationRequest);

    assertTrue(evaluationResponse.isPass(), "Response is not relevant to the question");

}

The code above is from the example application located here.

FactCheckingEvaluator

The FactCheckingEvaluator is another implementation of the Evaluator interface, designed to assess the factual accuracy of AI-generated responses against provided context. This evaluator helps detect and reduce hallucinations in AI outputs by verifying if a given statement (claim) is logically supported by the provided context (document).

The 'claim' and 'document' are presented to the AI model for evaluation. Smaller and more efficient AI models dedicated to this purpose are available, such as Bespoke’s Minicheck, which helps reduce the cost of performing these checks compared to flagship models like GPT-4. Minicheck is also available for use through Ollama.

Usage

The FactCheckingEvaluator constructor takes a ChatClient.Builder as a parameter:

public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
  this.chatClientBuilder = chatClientBuilder;
}

The evaluator uses the following prompt template for fact-checking:

Document: {document}
Claim: {claim}

Where {document} is the context information, and {claim} is the AI model’s response to be evaluated.

Example

Here’s an example of how to use the FactCheckingEvaluator with an Ollama-based ChatModel, specifically the Bespoke-Minicheck model:

@Test
void testFactChecking() {
  // Set up the Ollama API
  OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");

  ChatModel chatModel = new OllamaChatModel(ollamaApi,
				OllamaOptions.builder().model(BESPOKE_MINICHECK).numPredict(2).temperature(0.0d).build())


  // Create the FactCheckingEvaluator
  var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));

  // Example context and claim
  String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
  String claim = "The Earth is the fourth planet from the Sun.";

  // Create an EvaluationRequest
  EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);

  // Perform the evaluation
  EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);

  assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");

}