The Summarization scorer is a default LLM judge scorer that measures whether your LLM can accurately summarize text. In this case, the actual_output is the summary, and the input is the text to summarize.

Required Fields

To run the Summarization scorer, you must include the following fields in your Example:

  • input
  • actual_output

Scorer Breakdown

Summarization scores are calculated by determining:

  1. Whether the summary contains contradictory information from the original text.
  2. Whether the summary contains all of the important information from the original text.

To do so, we compute two subscores respectively:

contradiction score=Number of Contradictory Statements in SummaryTotal Number of Statements in Summary\text{contradiction score} = \frac{\text{Number of Contradictory Statements in Summary}}{\text{Total Number of Statements in Summary}}

For the information score, we generate a list of important questions from the original text and check the fraction of the questions that are answered by information inthe summary.

information score=Number of Important Questions Answered in SummaryTotal Number of Important Questions\text{information score} = \frac{\text{Number of Important Questions Answered in Summary}}{\text{Total Number of Important Questions}}

The final score is the minimum of the two subscores.

Summarization Score=min(contradiction score,information score)\text{Summarization Score} = \min(\text{contradiction score}, \text{information score})

Sample Implementation

summarization.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import SummarizationScorer

client = JudgmentClient()
example = Example(
    input="...",
    # Replace this with your LLM system's summary
    actual_output="...",
)
# supply your own threshold
scorer = SummarizationScorer(threshold=0.8)

results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)
print(results)