Judgeval is an evaluation library for multi-step LLM systems, built and maintained by Judgment Labs.

Judgeval is designed for AI teams to easily benchmark and iterate on their LLM apps and was designed to:

  • Offer a development and production evaluation layer for multi-step LLM applications, especially for agentic systems.
  • Plug-and-evaluate LLM systems with 10+ research-backed metrics including hallucination detection, RAG retriever quality, and more.
  • Construct powerful custom evaluation pipelines for your LLM systems.
  • Monitor LLM systems in production using state-of-the-art real-time evaluation foundation models.

Judgeval integrates natively with the Judgment Labs Platform, allowing you to evaluate, unit test, and monitor LLM applications in the cloud.

Judgeval was built by a passionate team of LLM researchers from Stanford, Datadog, and Together AI 💜.

Click here to get started.