A ClassifierScorer is a powerful tool for evaluating your LLM system using natural language criteria. Classifier scorers are great for prototyping new evaluation criteria on a small set of examples before using them to benchmark your workflows at scale.

Creating a Classifier Scorer

judgeval SDK

You can create a ClassifierScorer by providing a natural language description of your evaluation task/criteria and a set of choices that an LLM judge can choose from when evaluating an example.

Specifically, you need to provide a conversation that describes the task/criteria and a options dictionary that maps each choice to a score. You can also use Example fields in your conversation by using the mustache {{variable_name}} syntax.

Here’s an example of creating a ClassifierScorer that determines if a response is friendly or not:

friendliness_scorer.py
from judgeval.scorers import ClassifierScorer

friendliness_scorer = ClassifierScorer(
    name="Friendliness Scorer",
    threshold=1.0,
    conversation=[
        {
            "role": "system", 
            "content": "Is the response positive (Y/N)? The response is: {{actual_output}}."
        }
    ],
    options={"Y": 1, "N": 0}
)

Use variables from Examples into your conversation by using the mustache {{variable_name}} syntax.

Using a Classifier Scorer

Classifer scorers can be used in the same way as any other scorer in judgeval. They can also be run in conjunction with other scorers in a single evaluation run!

run_classifier_scorer.py
...

results = client.run_evaluation(
    examples=[example1],
    scorers=[friendliness_scorer],
    model="gpt-4o"
)

Saving Classifier Scorers

Whether you create a ClassifierScorer via the judgeval SDK or the Judgment platform, you can save it to the Judgment platform for reuse in future evaluations.

  • If you create a ClassifierScorer via the judgeval SDK, you can save it by calling client.push_classifier_scorer().
  • Similarly, you can load a ClassifierScorer by calling client.fetch_classifier_scorer().
  • Each ClassifierScorer has a unique slug that you can use to identify it.
from judgeval import JudgmentClient

client = JudgmentClient()

# Saving a ClassifierScorer from SDK to platform
friendliness_slug = client.push_classifier_scorer(friendliness_scorer)

# Loading a ClassifierScorer from platform to SDK
# You can load any ClassifierScorer from your account by providing the slug
loaded_friendliness_scorer = client.fetch_classifier_scorer("friendliness_slug") 

Real World Examples

You can find some real world examples of how our community has used ClassifierScorers to evaluate their LLM systems in our cookbook repository! Here are some of our favorites: