Evaluation Testing

Testing AI applications requires evaluating the generated content to ensure the AI model has not produced a hallucinated response.

One method to evaluate the response is to use the AI model itself for evaluation. Select the best AI model for the evaluation, which may not be the same model used to generate the response.

The Spring AI interface for evaluating responses is Evaluator, defined as:

@FunctionalInterface
public interface Evaluator {
    EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}

The input to the evaluation is the EvaluationRequest defined as

public class EvaluationRequest {

	private final String userText;

	private final List<Content> dataList;

	private final String responseContent;

	public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
		this.userText = userText;
		this.dataList = dataList;
		this.responseContent = responseContent;
	}

  ...
}

userText: The raw input from the user as a String
dataList: Contextual data, such as from Retrieval Augmented Generation, appended to the raw input.
responseContent: The AI model’s response content as a String

Relevancy Evaluator

The RelevancyEvaluator is an implementation of the Evaluator interface, designed to assess the relevance of AI-generated responses against provided context. This evaluator helps assess the quality of a RAG flow by determining if the AI model’s response is relevant to the user’s input with respect to the retrieved context.

The evaluation is based on the user input, the AI model’s response, and the context information. It uses a prompt template to ask the AI model if the response is relevant to the user input and context.

This is the default prompt template used by the RelevancyEvaluator:

Your task is to evaluate if the response for the query
is in line with the context information provided.

You have two options to answer. Either YES or NO.

Answer YES, if the response for the query
is in line with context information otherwise NO.

Query:
{query}

Response:
{response}

Context:
{context}

Answer:

You can customize the prompt template by providing your own PromptTemplate object via the .promptTemplate() builder method. See Custom Template for details.

Usage in Integration Tests

Here is an example of usage of the RelevancyEvaluator in an integration test, validating the result of a RAG flow using the RetrievalAugmentationAdvisor:

@Test
void evaluateRelevancy() {
    String question = "Where does the adventure of Anacletus and Birba take place?";

    RetrievalAugmentationAdvisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
        .documentRetriever(VectorStoreDocumentRetriever.builder()
            .vectorStore(pgVectorStore)
            .build())
        .build();

    ChatResponse chatResponse = ChatClient.builder(chatModel).build()
        .prompt(question)
        .advisors(ragAdvisor)
        .call()
        .chatResponse();

    EvaluationRequest evaluationRequest = new EvaluationRequest(
        // The original user question
        question,
        // The retrieved context from the RAG flow
        chatResponse.getMetadata().get(RetrievalAugmentationAdvisor.DOCUMENT_CONTEXT),
        // The AI model's response
        chatResponse.getResult().getOutput().getText()
    );

    RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

    EvaluationResponse evaluationResponse = evaluator.evaluate(evaluationRequest);

    assertThat(evaluationResponse.isPass()).isTrue();
}

You can find several integration tests in the Spring AI project that use the RelevancyEvaluator to test the functionality of the QuestionAnswerAdvisor (see tests) and RetrievalAugmentationAdvisor (see tests).

Custom Template

The RelevancyEvaluator uses a default template to prompt the AI model for evaluation. You can customize this behavior by providing your own PromptTemplate object via the .promptTemplate() builder method.

The custom PromptTemplate can use any TemplateRenderer implementation (by default, it uses StPromptTemplate based on the StringTemplate engine). The important requirement is that the template must contain the following placeholders:

a query placeholder to receive the user question.
a response placeholder to receive the AI model’s response.
a context placeholder to receive the context information.

FactCheckingEvaluator

The FactCheckingEvaluator is another implementation of the Evaluator interface, designed to assess the factual accuracy of AI-generated responses against provided context. This evaluator helps detect and reduce hallucinations in AI outputs by verifying if a given statement (claim) is logically supported by the provided context (document).

The 'claim' and 'document' are presented to the AI model for evaluation. Smaller and more efficient AI models dedicated to this purpose are available, such as Bespoke’s Minicheck, which helps reduce the cost of performing these checks compared to flagship models like GPT-4. Minicheck is also available for use through Ollama.

Usage

The FactCheckingEvaluator constructor takes a ChatClient.Builder as a parameter:

public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
  this.chatClientBuilder = chatClientBuilder;
}

The evaluator uses the following prompt template for fact-checking:

Document: {document}
Claim: {claim}

Where {document} is the context information, and {claim} is the AI model’s response to be evaluated.

Example

Here’s an example of how to use the FactCheckingEvaluator with an Ollama-based ChatModel, specifically the Bespoke-Minicheck model:

@Test
void testFactChecking() {
  // Set up the Ollama API
  OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");

  ChatModel chatModel = new OllamaChatModel(ollamaApi,
				OllamaOptions.builder().model(BESPOKE_MINICHECK).numPredict(2).temperature(0.0d).build())


  // Create the FactCheckingEvaluator
  var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));

  // Example context and claim
  String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
  String claim = "The Earth is the fourth planet from the Sun.";

  // Create an EvaluationRequest
  EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);

  // Perform the evaluation
  EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);

  assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");

}