Getting Started
This guide will help you learn the essential components of judgeval
.
Installation
pip install judgeval
Judgeval runs evaluations that you can manage inside the library. Additionally, you should analyze and manage your evaluations, datasets, and metrics on the natively-integrated Judgment Platform, an all-in-one suite for LLM system evaluation.
Our team is always making new releases of the judgeval
package! To get the latest version, run pip install --upgrade judgeval
.
You can follow our latest updates via our GitHub.
Judgment API Keys
Our API keys allow you to access the JudgmentClient
and Tracer
which enable you to track your agents and run evaluations on
Judgment Labs’ infrastructure, access our state-of-the-art judge models, and manage your evaluations/datasets on the Judgment Platform.
To get your account and organization API keys, create an account on the Judgment Platform.
For assistance with your registration and setup, such as dealing with sensitive data that has to reside in your private VPCs, feel free to get in touch with our team.
Create Your First Experiment
Congratulations! Your evaluation should have passed. Let’s break down what happened.
- The variable
input
mimics a user input andactual_output
is a placeholder for what your LLM system returns based on the input. - The variable
retrieval_context
represents the retrieved context from your RAG knowledge base. FaithfulnessScorer(threshold=0.5)
is a scorer that checks if the output is hallucinated relative to the retrieved context.-
The threshold is used in the context of unit testing.
-
- We chose
gpt-4o
as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. Consider trying out our state-of-the-art Osiris judge models for your next evaluation!
To learn more about using the Judgment Client to run evaluations, click here.
Create Your First Trace
judgeval
traces enable you to monitor your LLM systems in online development and production stages.
Traces enable you to track your LLM system’s flow end-to-end and measure:
- LLM costs
- Workflow latency
- Quality metrics, such as hallucination, retrieval quality, and more.
Congratulations! You’ve just created your first trace. It should look like this:
There are many benefits of monitoring your LLM systems with judgeval
tracing, including:
- Debugging LLM workflows in seconds with full observability
- Using production workflow data to create experimental datasets for future improvement/optimization
- Tracking and creating Slack/Email alerts on any metric (e.g. latency, cost, hallucination, etc.)
To learn more about judgeval
’s tracing module, click here.
Create Your First Online Evaluation
In addition to tracing, judgeval
allows you to run online evaluations on your LLM systems. This enables you to:
- Catch real-time quality regressions to take action before customers are impacted
- Gain insights into your agent performance in real-world scenarios
To run an online evaluation, you can simply add one line of code to your existing trace:
Online evaluations are automatically logged to the Judgment Platform as part of your traces. You can view them by navigating to your trace and clicking on the trace span that contains the online evaluation. If there is a quality regression, the UI will display an alert, like this:
Optimizing Your LLM System
Evaluation and monitoring are the building blocks for optimizing LLM systems. Measuring the quality of your LLM workflows allows you to compare design iterations and ultimately find the optimal set of prompts, models, RAG architectures, etc. that make your LLM excel in your production use cases.
A typical experimental setup might look like this:
- Create a new Project in the Judgment platform by either running an evaluation from the SDK or via the platform UI. This will help you keep track of all evaluations and traces for different iterations of your LLM system.
A Project keeps track of Experiments and Traces relating to a specific workflow. Each Experiment contains a set of Scorers that have been run on a set of Examples.
- You can create separate Experiments for different iterations of your LLM system, allowing you to independently test each component of your LLM system.
You can try different models (e.g. gpt-4o
, claude-3-5-sonnet
, etc.) and prompt templates in each Experiment to find the
optimal setup for your LLM system.
Next Steps
Congratulations! You’ve just finished getting started with judgeval
and the Judgment Platform.
For a deeper dive into using judgeval
, learn more about experiments, unit testing, and monitoring!