The contextual recall scorer is a default LLM judge scorer that measures whether the retrieval_context aligns with the expected_output. In practice, this scorer helps determine whether your RAG pipeline’s retriever is effectively retrieving relevant contexts.

There are many factors to consider when evaluating the quality of your RAG pipeline. judgeval offers a suite of default scorers to construct a comprehensive evaluation of each RAG component. Check out our guide on RAG system evaluation for a deep dive!

Required Fields

To run the contextual recall scorer, you must include the following fields in your Example:

  • input
  • actual_output
  • expected_output
  • retrieval_context

Scorer Breakdown

ContextualRecall scores are calculated by first determining all statements made in expected_output, then classifying which statements are backed up by the retrieval_context.

This scorer uses the expected_output rather than actual_output because we’re interested in whether the retriever is performing well.

The score is calculated as:

Contextual Recall=Number of Relevant Statements in Retrieval ContextNumber of Relevant Statements in Expected Output\text{Contextual Recall} = \frac{\text{Number of Relevant Statements in Retrieval Context}}{\text{Number of Relevant Statements in Expected Output}}

Our contextual recall scorer is based on Stanford NLP’s ARES paper (Saad-Falcon et. al., 2024).

Sample Implementation

contextual_recall.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ContextualRecallScorer

client = JudgmentClient()
example = Example(
    input="What's your return policy for a pair of socks?",
    # Replace this with your LLM system's output
    actual_output="We offer a 30-day return policy for all items, including socks!",
    # Replace this with the ideal output from your RAG generator model
    expected_output="All customers are eligible for a 30-day return policy, no questions asked.",
    # Replace this with the contexts retrieved by your RAG retriever
    retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
)
# supply your own threshold
scorer = ContextualRecallScorer(threshold=0.8)

results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)
print(results)

The ContextualRecall scorer uses an LLM judge, so you’ll receive a reason for the score in the reason field of the results. This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.