Examples
Overview
An Example
is a basic unit of data in judgeval
that allows you to run evaluation scorers on your LLM system.
An Example
is can be composed of a mixture of the following fields:
input
[Optional]actual_output
[Optional]expected_output
[Optional]retrieval_context
[Optional]context
[Optional]
Here’s a sample of creating an Example
:
The input
and actual_output
fields are required for all examples. However, you don’t need to use them in your evaluations.
For example, if you’re evaluating whether a chatbot’s response is friendly, you don’t need to use input
.
Other fields are optional and depend on the type of evaluation. If you want to detect hallucinations in a RAG system, you’d use the retrieval_context
field for the Faithfulness scorer.
Example Fields
Here, we cover the possible fields that make up an Example
.
Input
The input
field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and SHOULD NOT CONTAIN your prompt template itself.
Prompt templates are hyperparameters that you optimize for based on the scorer you’re executing.
Evaluation is always tied with optimization, so you should try to isolate your system’s independent variables (e.g. prompt template, model, etc.) from your evaluation.
Actual Output
The actual_output
field represents what your LLM system outputs based on the input
.
This is the actual output of your LLM system created either at evaluation time or with saved answers.
Expected Output
The expected_output
field is Optional[str]
and represents the ideal output of your LLM system.
One great part of judgeval
’s scorers is that they use LLMs which have flexible evaluation criteria. You don’t need to worry about your expected_output
perfectly matching your actual_output
!
To learn more about how judgeval
’s scorers work, please see the scorer docs.
Context
The context
field is Optional[List[str]]
and represents information that is supplied to the LLM system as ground truth.
For instance, context could be a list of facts that the LLM system is aware of. However, context
should not be confused with retrieval_context
.
In RAG systems, contextual information is retrieved from a vector database and is represented in judgeval
by retrieval_context
,
not context
. If you’re building a RAG system, you’ll want to use retrieval_context
.
Retrieval Context
The retrieval_context
field is Optional[List[str]]
and represents the context that is actually retrieved from a vector database.
This is often the context that is used to generate the actual_output
in a RAG system.
Some common cases for using retrieval_context
are:
- Checking for hallucinations in a RAG system
- Evaluating the quality of a retriever model by comparing
retrieval_context
tocontext
Custom Examples
Custom Examples are a way to evaluate your LLM system with more complex data structures. It is composed of the same fields as an Example
, but the fields are dictionaries instead of strings.
Custom Examples are designed to be used with our Custom Scorers. See the Custom Scorers documentation for more information.
Conclusion
Congratulations! 🎉
You’ve learned how to create an Example
and can begin using them to execute evaluations or create datasets.