Welcome!
Introduction to Judgeval
Judgeval is an evaluation library for multi-step LLM systems, built and maintained by Judgment Labs.
Judgeval is designed for AI teams to easily benchmark and iterate on their LLM apps and was designed to:
- Offer a development and production evaluation layer for multi-step LLM applications, especially for agentic systems.
- Plug-and-evaluate LLM systems with 10+ research-backed metrics including hallucination detection, RAG retriever quality, and more.
- Construct powerful custom evaluation pipelines for your LLM systems.
- Monitor LLM systems in production using state-of-the-art real-time evaluation foundation models.
Judgeval integrates natively with the Judgment Labs Platform, allowing you to evaluate, unit test, and monitor LLM applications in the cloud.
Judgeval was built by a passionate team of LLM researchers from Stanford, Datadog, and Together AI 💜.
Click here to get started.