Peformance Monitoring Worfklows with Judgment
Overview
judgeval
contains a suite of monitoring tools that allow you to measure the quality of your LLM applications in production scenarios.
Using judgeval
in production, you can:
- Measure the quality of your LLM agent systems in real time using Judgment’s 10+ researched-backed scoring metrics.
- Check for regressions in retrieval quality, hallucinations, and any other scoring metric you care about.
- Measure token usage
- Track latency of different system components (web searching, LLM generation, etc.)
Why evaluate your system in production?
Production data provides the highest signal for improving your LLM system on use cases you care about. Judgment Labs’ infrastructure enables LLM teams to capture quality signals from production use cases and provides actionable insights for improving any component of your system.
Standard Setup
A typical setup of judgeval
on production systems involves:
- Tracing your application using
judgeval
’s tracing module. - Embedding evaluation runs into your traces the
async_evaluate()
function. - Tracking your LLM agent’s performance in real time using the Judgment platform.
For a full example of how to set up judgeval
in a production system, see our OpenAI Travel Agent example.
Disabling Monitoring
If your setup requires you to disable monitoring in production-level environments, you can disable monitoring by:
- Setting the
JUDGMENT_MONITORING
environment variable tofalse
(Disables tracing) - Setting the
JUDGMENT_EVALUATIONS
environment variable tofalse
(Disables async_evaluates)