agent as a judge

Report Abuse

Basic Information

Agent-as-a-Judge is an open-source implementation and methodology for using agents to evaluate other agents and their outputs. The repository provides runnable tools, example scripts, and a dataset workflow to automate evaluation of agentic tasks, collect evidence, and generate step-by-step feedback that can be used as reward signals for further training. It includes demos such as Ask Anything, an Agent-as-a-Judge run for DevAI code-generation benchmarks, and an OpenWiki demo for producing a DeepWiki-style knowledge resource. The project targets researchers and developers who need reproducible, scalable evaluation pipelines for agentic systems, and it documents installation and LLM configuration steps required to run the provided scripts. The work behind the repo has an academic paper and was accepted at ICML 2025.

Links

Categorization

App Details

Features
The repository emphasizes automated evaluation and continuous reward signal generation. Measured savings reported in the README claim roughly 97.72% time reduction and 97.64% cost reduction versus human experts when using automated judging. It supports running prebuilt scripts such as run_ask.py for workspace queries, run_aaaj.py for running the judge on DevAI-style developer agents, and run_wiki.py for the OpenWiki demo. The codebase integrates with external LLMs via environment configuration and a LiteLLM tool, includes a DevAI benchmark dataset (55 tasks and 365 hierarchical requirements) and examples showing how the judge collects evidence for scoring. The README includes quick-start installation instructions using conda and poetry and points to usage examples and dataset hosting.
Use Cases
This project helps teams and researchers scale evaluation of agentic systems by automating judgment tasks that would otherwise require costly human experts. It produces reproducible, stepwise feedback suitable as reward signals for agent training and self-improvement, enabling faster iteration on agent design and benchmarking. By providing scripts, example workspaces, and a benchmark dataset, the repo lowers the barrier to run large-scale evaluations and to generate labeled agentic datasets across domains. The OpenWiki demo also shows how the approach can be adapted to build knowledge resources. Overall, the toolkit is useful for academic benchmarking, developing evaluation pipelines, and generating training signals for autonomous agents.

Please fill the required fields*