Contextual Recall

The contextual recall scorer is a default LLM judge scorer that measures whether the retrieval_context aligns with the expected_output. In practice, this scorer helps determine whether your RAG pipeline’s retriever is effectively retrieving relevant contexts.

There are many factors to consider when evaluating the quality of your RAG pipeline. judgeval offers a suite of default scorers to construct a comprehensive evaluation of each RAG component. Check out our guide on RAG system evaluation for a deep dive!

Required Fields

To run the contextual recall scorer, you must include the following fields in your Example:

input
actual_output
expected_output
retrieval_context

Scorer Breakdown

ContextualRecall scores are calculated by first determining all statements made in expected_output, then classifying which statements are backed up by the retrieval_context.

This scorer uses the expected_output rather than actual_output because we’re interested in whether the retriever is performing well.

The score is calculated as:

\text{Contextual Recall} = \frac{\text{Number of Relevant Statements in Retrieval Context}}{\text{Number of Relevant Statements in Expected Output}}

Our contextual recall scorer is based on Stanford NLP’s ARES paper (Saad-Falcon et. al., 2024).

Sample Implementation

contextual_recall.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ContextualRecallScorer

client = JudgmentClient()
example = Example(
    input="What's your return policy for a pair of socks?",
    # Replace this with your LLM system's output
    actual_output="We offer a 30-day return policy for all items, including socks!",
    # Replace this with the ideal output from your RAG generator model
    expected_output="All customers are eligible for a 30-day return policy, no questions asked.",
    # Replace this with the contexts retrieved by your RAG retriever
    retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
)
# supply your own threshold
scorer = ContextualRecallScorer(threshold=0.8)

results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)
print(results)

The ContextualRecall scorer uses an LLM judge, so you’ll receive a reason for the score in the reason field of the results. This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.

Welcome!

Evaluation (Experiments)

Monitoring

Integrations

Alerts

API Reference

Contextual Recall

Required Fields

Scorer Breakdown

Sample Implementation

Welcome!

Evaluation (Experiments)

Monitoring

Integrations

Alerts

API Reference

​Required Fields

​Scorer Breakdown

​Sample Implementation

Required Fields

Scorer Breakdown

Sample Implementation