Tracing
Overview
judgeval
’s tracing module allows you to view your LLM application’s execution from end-to-end.
Using tracing, you can:
- Gain observability into every layer of your agentic system, from database queries to tool calling and text generation.
- Measure the performance of each system component in any way you want to measure it. For instance:
- Catch regressions in retrieval quality, factuality, answer relevance, and 10+ other research-backed metrics.
- Quantify the quality of each tool call your agent makes
- Track the latency of each system component
- Count the token usage of each LLM generation
- Export your workflow runs to the Judgment platform for real-time analysis or as a dataset for offline experimentation.
Tracing Your Workflow
Setting up tracing with judgeval
takes two simple steps:
1. Initialize a tracer with your API keys and project name
The Judgment tracer is a singleton object that should be shared across your application. Your project name will be used to organize your traces in one place on the Judgment platform.
2. Wrap your workflow components
judgeval
provides wrapping mechanisms for your workflow components:
wrap()
The wrap()
function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) and captures metadata surrounding your LLM calls, such as:
- Latency
- Token usage
- Prompt/Completion
- Model name
Here’s an example of using wrap()
on an OpenAI client:
@observe
The @observe
decorator wraps your functions/tools and captures metadata surrounding your function calls, such as:
- Latency
- Input/Output
- Span type (e.g.
retriever
,tool
,LLM call
, etc.)
Here’s an example of using the @observe
decorator on a function:
span_type
is a string that you can use to categorize and organize your trace spans.
Span types are displayed on the trace UI to easily nagivate a visualization of your workflow.
Common span types include tool
, function
, retriever
, database
, web search
, etc.
Putting it all Together
Here’s a complete example of using judgeval’s tracing mechanisms:
And the trace will appear on the Judgment platform as follows:
3. Running Production Evaluations
Optionally, you can run asynchronous evaluations directly inside your traces.
This enables you to run evaluations on your production data in real-time, which can be useful for:
- Guardrailing your production system against quality regressions (hallucinations, toxic responses, revealing private data, etc.).
- Exporting production data for offline experimentation (e.g for A/B testing your workflow versions on relevant use cases).
- Getting actionable insights on how to fix common failure modes in your workflow (e.g. missing knowledge base info, suboptimal prompts, etc.).
To execute an asynchronous evaluation, you can use the trace.async_evaluate()
method. Here’s an example of that:
Your async evaluations will be logged to the Judgment platform as part of the original trace and a new evaluation will be created on the Judgment platform.
Example: Music Recommendation Agent
In this video, we’ll walk through all of the topics covered in this guide by tracing over a simple OpenAI API-based music recommendation agent.
Advanced: Customizing Traces Using the Context Manager
If you need to customize your tracing context, you can use the with judgment.trace()
context manager.
The context manager can save/print the state of the trace at any point in the workflow. This is useful for debugging or exporting any state of your workflow to run an evaluation from!
The with judgment.trace()
context manager detects any @observe
decorated functions or wrapped LLM calls within the context and automatically captures their metadata.
Here’s an example of using the context manager to trace a workflow:
The with judgment.trace()
context manager should only be used if you need to customize the context
over which you’re tracing. In most cases, you should trace using the @observe
decorator.