Installation

pip install judgeval

Judgeval runs evaluations that you can manage inside the library. Additionally, you should analyze and manage your evaluations, datasets, and metrics on the natively-integrated Judgment Platform, an all-in-one suite for LLM system evaluation.

Our team is always making new releases of the judgeval package! To get the latest version, run pip install --upgrade judgeval. You can follow our latest updates via our GitHub.

Judgment API Keys

Our API keys allow you to access the JudgmentClient and Tracer which enable you to track your agents and run evaluations on Judgment Labs’ infrastructure, access our state-of-the-art judge models, and manage your evaluations/datasets on the Judgment Platform.

To get your account and organization API keys, create an account on the Judgment Platform.

export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"

For assistance with your registration and setup, such as dealing with sensitive data that has to reside in your private VPCs, feel free to get in touch with our team.

Create Your First Experiment

sample_eval.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

example = Example(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)
print(results)

Congratulations! Your evaluation should have passed. Let’s break down what happened.

  • The variable input mimics a user input and actual_output is a placeholder for what your LLM system returns based on the input.
  • The variable retrieval_context represents the retrieved context from your RAG knowledge base.
  • FaithfulnessScorer(threshold=0.5) is a scorer that checks if the output is hallucinated relative to the retrieved context.
  • We chose gpt-4o as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. Consider trying out our state-of-the-art Osiris judge models for your next evaluation!

To learn more about using the Judgment Client to run evaluations, click here.

Create Your First Trace

judgeval traces enable you to monitor your LLM systems in online development and production stages. Traces enable you to track your LLM system’s flow end-to-end and measure:

  • LLM costs
  • Workflow latency
  • Quality metrics, such as hallucination, retrieval quality, and more.
trace_example.py
from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"{task_input}"}]
    )
    return res.choices[0].message.content

Congratulations! You’ve just created your first trace. It should look like this:

There are many benefits of monitoring your LLM systems with judgeval tracing, including:

  • Debugging LLM workflows in seconds with full observability
  • Using production workflow data to create experimental datasets for future improvement/optimization
  • Tracking and creating Slack/Email alerts on any metric (e.g. latency, cost, hallucination, etc.)

To learn more about judgeval’s tracing module, click here.

Create Your First Online Evaluation

In addition to tracing, judgeval allows you to run online evaluations on your LLM systems. This enables you to:

  • Catch real-time quality regressions to take action before customers are impacted
  • Gain insights into your agent performance in real-world scenarios

To run an online evaluation, you can simply add one line of code to your existing trace:

trace_example.py
from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"{task_input}"}]
    ).choices[0].message.content

    judgment.async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.5)],
        input=task_input,
        actual_output=res,
        model="gpt-4o"
    )

    return res

Online evaluations are automatically logged to the Judgment Platform as part of your traces. You can view them by navigating to your trace and clicking on the trace span that contains the online evaluation. If there is a quality regression, the UI will display an alert, like this:

Optimizing Your LLM System

Evaluation and monitoring are the building blocks for optimizing LLM systems. Measuring the quality of your LLM workflows allows you to compare design iterations and ultimately find the optimal set of prompts, models, RAG architectures, etc. that make your LLM excel in your production use cases.

A typical experimental setup might look like this:

  1. Create a new Project in the Judgment platform by either running an evaluation from the SDK or via the platform UI. This will help you keep track of all evaluations and traces for different iterations of your LLM system.

A Project keeps track of Experiments and Traces relating to a specific workflow. Each Experiment contains a set of Scorers that have been run on a set of Examples.

  1. You can create separate Experiments for different iterations of your LLM system, allowing you to independently test each component of your LLM system.

You can try different models (e.g. gpt-4o, claude-3-5-sonnet, etc.) and prompt templates in each Experiment to find the optimal setup for your LLM system.

Next Steps

Congratulations! You’ve just finished getting started with judgeval and the Judgment Platform.

For a deeper dive into using judgeval, learn more about experiments, unit testing, and monitoring!