Overview

An Example is a basic unit of data in judgeval that allows you to run evaluation scorers on your LLM system. An Example is can be composed of a mixture of the following fields:

  • input [Optional]
  • actual_output [Optional]
  • expected_output [Optional]
  • retrieval_context [Optional]
  • context [Optional]

Here’s a sample of creating an Example:

example_test.py
from judgeval.data import Example

example = Example(
    input="Who founded Microsoft?",
    actual_output="Bill Gates and Paul Allen.",
    expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.",
    retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
    context=["Bill Gates and Paul Allen are the founders of Microsoft."],
)

The input and actual_output fields are required for all examples. However, you don’t need to use them in your evaluations.

For example, if you’re evaluating whether a chatbot’s response is friendly, you don’t need to use input.

Other fields are optional and depend on the type of evaluation. If you want to detect hallucinations in a RAG system, you’d use the retrieval_context field for the Faithfulness scorer.

Example Fields

Here, we cover the possible fields that make up an Example.

Input

The input field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and SHOULD NOT CONTAIN your prompt template itself.

Prompt templates are hyperparameters that you optimize for based on the scorer you’re executing.

Evaluation is always tied with optimization, so you should try to isolate your system’s independent variables (e.g. prompt template, model, etc.) from your evaluation.

Actual Output

The actual_output field represents what your LLM system outputs based on the input. This is the actual output of your LLM system created either at evaluation time or with saved answers.

actual_output.py
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
    input=question,
    actual_output=medical_chatbot.chat(question)
)

Expected Output

The expected_output field is Optional[str] and represents the ideal output of your LLM system.

One great part of judgeval’s scorers is that they use LLMs which have flexible evaluation criteria. You don’t need to worry about your expected_output perfectly matching your actual_output!

To learn more about how judgeval’s scorers work, please see the scorer docs.

expected_output.py
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
    input=question,
    actual_output=medical_chatbot.chat(question),
    expected_output="Sparkling water is neither healthy nor unhealthy."
)

Context

The context field is Optional[List[str]] and represents information that is supplied to the LLM system as ground truth.

For instance, context could be a list of facts that the LLM system is aware of. However, context should not be confused with retrieval_context.

In RAG systems, contextual information is retrieved from a vector database and is represented in judgeval by retrieval_context, not context. If you’re building a RAG system, you’ll want to use retrieval_context.

context.py
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
    input=question,
    actual_output=medical_chatbot.chat(question),
    expected_output="Sparkling water is neither healthy nor unhealthy.",
    context=["Sparkling water is a type of water that is carbonated."]
)

Retrieval Context

The retrieval_context field is Optional[List[str]] and represents the context that is actually retrieved from a vector database. This is often the context that is used to generate the actual_output in a RAG system.

Some common cases for using retrieval_context are:

  • Checking for hallucinations in a RAG system
  • Evaluating the quality of a retriever model by comparing retrieval_context to context
retrieval_context.py
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
    input=question,
    actual_output=medical_chatbot.chat(question),
    expected_output="Sparkling water is neither healthy nor unhealthy.",
    context=["Sparkling water is a type of water that is carbonated."],
    retrieval_context=["Sparkling water is carbonated and has no calories."]
)

Custom Examples

Custom Examples are a way to evaluate your LLM system with more complex data structures. It is composed of the same fields as an Example, but the fields are dictionaries instead of strings.

custom_example.py
from judgeval.data import CustomExample

example = CustomExample(
    input={"question": "Is sparkling water healthy?"},
    actual_output={"answer": "Sparkling water is neither healthy nor unhealthy."},
)

Custom Examples are designed to be used with our Custom Scorers. See the Custom Scorers documentation for more information.

Conclusion

Congratulations! 🎉

You’ve learned how to create an Example and can begin using them to execute evaluations or create datasets.