Judges
Judges are LLMs that are used to evaluate a component of your LLM system. judgeval
’s LLM judge scorers, such as
AnswerRelevancyScorer
, use judge models to execute evaluations.
A good judge model should be able to evaluate your LLM system performance with high consistency and alignment with human preferences.
judgeval
allows you to pick from a variety of leading judge models, or you can use your own custom judge!
LiteLLM Judge Models
judgeval
supports all models found in the LiteLLM API.
This includes all popular closed-source models such as the OpenAI (GPT), Anthropic (Claude), and Gemini families.
To use a LiteLLM judge model, you simply pass the model name to the model
parameter in client.run_evaluation()
:
Open Source Judge Models
In addition to LiteLLM judge models, judgeval
supports a variety of popular open-source judge models via TogetherAI inference.
This includes all popular open-source models such as the Llama, DeepSeek, QWEN, Mistral, (and more!) families.
To use an open-source judge model, you simply pass the model name to the model
parameter in client.run_evaluation()
:
Use Your Own Judge Model
If you have a custom model you’d like to use as a judge, such as a finetuned gpt4o-mini
, you can use them in your JudgevalScorer
evaluations.
Simply inherit from the judgevalJudge
class and implement the following methods:
__init__()
: sets themodel_name (str)
andmodel
attributes.load_model()
: loads the model.generate()
: generates a response from the model given a conversation history (List[dict]).a_generate()
: generates a response from the model asynchronously given a conversation history (List[dict]).get_model_name()
: returns the model name.
Here’s an example of implementing a custom judge model for Gemini Flash 1.5: