Introduction

Experiment comparisons allow you to systematically A/B test changes in your LLM workflows. Whether you’re testing different prompts, models, or architectures, Judgment helps you compare results across experiments to make data-driven decisions about your LLM systems.

Creating Your First Comparison

Let’s walk through how to create and run experiment comparisons:

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import AnswerCorrectnessScorer

client = JudgmentClient()

# Define your test examples
examples = [
    Example(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
        expected_output="Paris"
    ),
    Example(
        input="What is the capital of Japan?",
        actual_output="Tokyo is the capital of Japan.",
        expected_output="Tokyo"
    )
]

# Define your scorer
scorer = AnswerCorrectnessScorer(threshold=0.7)

# Run first experiment with GPT-4
experiment_1 = client.run_evaluation(
    examples=examples,
    scorers=[scorer],
    model="gpt-4",
    project_name="capital_cities",
    eval_name="gpt4_experiment"
)

# Run second experiment with a different model
experiment_2 = client.run_evaluation(
    examples=examples,
    scorers=[scorer],
    model="gpt-3.5-turbo",
    project_name="capital_cities",
    eval_name="gpt35_experiment"
)

After running the following code, click the View Results link to take you to your experiment run on the Judgment Platform.

Analyzing Results

Once your experiments are complete, you can compare them on the Judgment Platform:

  1. You’ll be automatically directed to your Experiment page. Here you’ll see your latest experiment results and a “Compare” button.

  2. Click the “Compare” button to navigate to the Experiments page. Here you can select a previous experiment to compare against your current results.

  3. After selecting an experiment, you’ll return to the Experiment page with both experiments’ results displayed side by side.

  4. For detailed insights, click on any row in the comparison table to see specific metrics and analysis.

Use these detailed comparisons to make data-driven decisions about which model, prompt, or architecture performs best for your specific use case.

Next Steps

  • To learn more about creating datasets to run on your experiments, check out our Datasets section