Introduction
Overview
Evaluation is the process of scoring an LLM system’s outputs with metrics; an evaluation is composed of:
- An evaluation dataset
- Metrics we are interested in tracking
Examples
In judgeval
, an Example is a unit of data that allows you to use evaluation scorers on your LLM system.
In this example, input
represents a user talking with a RAG-based LLM application, where actual_output
is the
output of your chatbot and retrieval_context
is the retrieved context.
There are many fields in an Example
that can be used in an evaluation.
To learn more about the Example
class, click here.
Creating an Example allows you to evaluate using
judgeval
’s default scorers:
Datasets
An Evaluation Dataset is a collection of Examples. It provides an interface for running scaled evaluations of your LLM system using one or more scorers.
EvalDataset
s can be saved (loaded) to (from) disk in csv
and json
format or uploaded to the Judgment platform.
For more information on how to use EvalDataset
s, please see the EvalDataset docs.
Then, you can run evaluations on the dataset:
Metrics
judgeval
comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through judgeval
’s Scorer
interface.
Every Scorer
has a threshold
parameter that you can use in the context of unit testing your app.
You can use scorers to evaluate your LLM system’s outputs by using Example
s.
We’re always working on adding new scorers, so if you have a metric you’d like to add, please let us know!
Congratulations! 🎉
You’ve learned the basics of building and running evaluations with judgeval
.
For a deep dive into all the metrics you can run using judgeval
scorers, click here.