Contextual Recall
The contextual recall scorer is a default LLM judge scorer that measures whether the retrieval_context
aligns with the expected_output
.
In practice, this scorer helps determine whether your RAG pipeline’s retriever is effectively retrieving relevant contexts.
There are many factors to consider when evaluating the quality of your RAG pipeline. judgeval
offers a suite of default scorers to construct a comprehensive
evaluation of each RAG component. Check out our guide on RAG system evaluation for a deep dive!
Required Fields
To run the contextual recall scorer, you must include the following fields in your Example
:
input
actual_output
expected_output
retrieval_context
Scorer Breakdown
ContextualRecall
scores are calculated by first determining all statements made in expected_output
, then classifying which
statements are backed up by the retrieval_context
.
This scorer uses the expected_output
rather than actual_output
because we’re interested in whether the retriever is performing well.
The score is calculated as:
Our contextual recall scorer is based on Stanford NLP’s ARES paper (Saad-Falcon et. al., 2024).
Sample Implementation
The ContextualRecall
scorer uses an LLM judge, so you’ll receive a reason for the score in the reason
field of the results.
This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.