Overview

judgeval contains a suite of monitoring tools that allow you to measure the quality of your LLM applications in production scenarios.

Using judgeval in production, you can:

  • Measure the quality of your LLM agent systems in real time using Judgment’s 10+ researched-backed scoring metrics.
    • Check for regressions in retrieval quality, hallucinations, and any other scoring metric you care about.
  • Measure token usage
  • Track latency of different system components (web searching, LLM generation, etc.)

Why evaluate your system in production?

Production data provides the highest signal for improving your LLM system on use cases you care about. Judgment Labs’ infrastructure enables LLM teams to capture quality signals from production use cases and provides actionable insights for improving any component of your system.

Standard Setup

A typical setup of judgeval on production systems involves:

  • Tracing your application using judgeval’s tracing module.
  • Embedding evaluation runs into your traces the async_evaluate() function.
  • Tracking your LLM agent’s performance in real time using the Judgment platform.

For a full example of how to set up judgeval in a production system, see our OpenAI Travel Agent example.

Disabling Monitoring

If your setup requires you to disable monitoring in production-level environments, you can disable monitoring by:

  • Setting the JUDGMENT_MONITORING environment variable to false (Disables tracing)
  • Setting the JUDGMENT_EVALUATIONS environment variable to false (Disables async_evaluates)