Faithfulness

The Faithfulness scorer is a default LLM judge scorer that measures how factually aligned the actual_output is to the retrieval_context. In practice, this scorer helps determine the degree to which your RAG pipeline’s generator is hallucinating.

For optimal Faithfulness scoring, check out our leading evaluation foundation model research here!

The Faithfulness scorer is similar to but not identical to the Hallucination scorer.

Faithfulness is concerned with contradictions between the actual_output and retrieval_context, while Hallucination is concerned with context. If you’re building an app with a RAG pipeline, you should try the Faithfulness scorer first.

Required Fields

To run the Faithfulness scorer, you must include the following fields in your Example:

input
actual_output
retrieval_context

Scorer Breakdown

Faithfulness scores are calculated by first extracting all statements in actual_output and then classifying which ones are contradicted by the retrieval_context. A claim is considered faithful if it does not contradict any information in retrieval_context.

The score is calculated as:

\text{Faithfulness} = \frac{\text{Number of Faithful Statements}}{\text{Total Number of Statements}}

Sample Implementation

faithfulness.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()
example = Example(
    input="What's your return policy for a pair of socks?",
    # Replace this with your LLM system's output
    actual_output="We offer a 30-day return policy for all items, including socks!",
    # Replace this with the contexts retrieved by your RAG retriever
    retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
)
# supply your own threshold
scorer = FaithfulnessScorer(threshold=0.8)

results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)
print(results)

The Faithfulness scorer uses an LLM judge, so you’ll receive a reason for the score in the reason field of the results. This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.

Welcome!

Evaluation (Experiments)

Monitoring

Integrations

Alerts

API Reference

Required Fields

Scorer Breakdown

Sample Implementation

Welcome!

Evaluation (Experiments)

Monitoring

Integrations

Alerts

API Reference

​Required Fields

​Scorer Breakdown

​Sample Implementation

Required Fields

Scorer Breakdown

Sample Implementation