The contextual relevancy scorer is a default LLM judge scorer that measures how relevant the contexts in retrieval_context are for an input. In practice, this scorer helps determine whether your RAG pipeline’s retriever effectively retrieves relevant contexts for a query.

Required Fields

To run the contextual relevancy scorer, you must include the following fields in your Example:

  • input
  • actual_output
  • retrieval_context

Scorer Breakdown

ContextualRelevancy scores are calculated by first extracting all statements in retrieval_context and then classifying which ones are relevant to the input.

The score is then calculated as:

Contextual Relevancy=Number of Relevant StatementsTotal Number of Statements\text{Contextual Relevancy} = \frac{\text{Number of Relevant Statements}}{\text{Total Number of Statements}}

Our contextual relevancy scorer is based on Stanford NLP’s ARES paper (Saad-Falcon et. al., 2024).

Sample Implementation

contextual_relevancy.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ContextualRelevancyScorer

client = JudgmentClient()
example = Example(
    input="What's your return policy for a pair of socks?",
    # Replace this with your LLM system's output
    actual_output="We offer a 30-day return policy for all items, including socks!",
    # Replace this with the contexts retrieved by your RAG retriever
    retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
)
# supply your own threshold
scorer = ContextualRelevancyScorer(threshold=0.8)

results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)
print(results)

The ContextualRelevancy scorer uses an LLM judge, so you’ll receive a reason for the score in the reason field of the results. This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.