Custom Scorers
If none of judgeval
’s built-in scorers fit your evaluation criteria, you can easily build your own custom metric to be run through a JudgevalScorer
.
JudgevalScorer
s are automatically integrated within judgeval
’s infrastructure, so you can:
- Run your own scorer with the same syntax as any other
judgeval
scorer. - Use
judgeval
’s batched evaluation infrastructure to execute scalable evaluation runs. - Have your scorer’s results be viewed and analyzed in the Judgment platform.
Be creative with your custom scorers! You can measure anything you want using a JudgevalScorer
,
including using evaluations that aren’t LLM judge-based such as ROUGE or embedding similarity.
Guidelines for Implementing Custom Scorers
To implement your own custom scorer, you must:
1. Inherit from the JudgevalScorer
class
This will help judgeval
integrate your scorer into evaluation runs.
2. Implement the __init__()
method
JudgevalScorer
s have some required attributes that must be determined in the __init__()
method.
For instance, you must set a threshold
to determine what constitutes success/failure for a scorer.
There are additional optional attributes that can be set here for even more flexibility:
score_type (str)
: The name of your scorer. This will be displayed in the Judgment platform.include_reason (bool)
: Whether your scorer includes a reason for the score in the results. Only for LLM judge-based scorers.async_mode (bool)
: Whether your scorer should be run asynchronously during evaluations.strict_mode (bool)
: Whether your scorer fails if the score is not perfect (1.0).verbose_mode (bool)
: Whether your scorer produces verbose logs.custom_example (bool)
: Whether your scorer should be run on custom examples.
3. Implement the score_example()
and a_score_example()
methods
The score_example()
and a_score_example()
methods take an Example
object and execute your scorer to produce a score (float) between 0 and 1.
Optionally, you can include a reason to accompany the score if applicable (e.g. for LLM judge-based scorers).
The only requirement for score_example()
and a_score_example()
is that they:
- Take an
Example
as an argument (you can add other arguments too) - Set the
self.score
attribute - Set the
self.success
attribute
You can optionally set the self.reason attribute, depending on your preference.
a_score_example()
is simply the async version of score_example()
, so the implementation should largely be identical.
These methods are the core of your scorer, and you can implement them in any way you want. Be creative!
Handling Errors
If you want to handle errors gracefully, you can use a try
block and in the except
block, set the self.error
attribute to the error message.
This will allow judgeval
to catch the error but still execute the rest of an evaluation run, assuming you have multiple examples to evaluate.
Here’s a sample implementation that integrates everything we’ve covered:
4. Implement the _success_check()
method
When executing an evaluation run, judgeval
will check if your scorer has passed the _success_check()
method.
You can implement this method in any way you want, but it should return a bool
. Here’s a perfectly valid implementation:
5. Give your scorer a name
This is so that when displaying your scorer’s results in the Judgment platform, you can easily sort by and find your scorer.
Congratulations! 🎉
You’ve made your first custom judgeval scorer! Now that your scorer is implemented, you can run it on your own datasets
just like any other judgeval
scorer. Your scorer is fully integrated with judgeval
’s infrastructure so you can view it on
the Judgment platform too.
Using a Custom Scorer
Once you’ve implemented your custom scorer, you can use it in the same way as any other scorer in judgeval
.
They can be run in conjunction with other scorers in a single evaluation run!
Custom Scorers with Custom Examples
If you want to use a custom scorer with a custom example, you can do so by passing the custom scorer and custom example to the run_evaluation()
method.
Make sure to set the custom_example
attribute to True
in the __init__()
method of your custom scorer.
Real World Examples
You can find some real world examples of how our community has used custom JudgevalScorer
s to evaluate their LLM systems in our cookbook repository!
Here are some of our favorites:
- Code Style Scorer - Evaluates code quality and style
- Cold Email Scorer - Evaluates the effectiveness of cold emails
For more examples and detailed documentation on custom scorers, check out our Custom Scorers Cookbook.