agent as a judge
Basic Information
Agent-as-a-Judge is an open-source implementation and methodology for using agents to evaluate other agents and their outputs. The repository provides runnable tools, example scripts, and a dataset workflow to automate evaluation of agentic tasks, collect evidence, and generate step-by-step feedback that can be used as reward signals for further training. It includes demos such as Ask Anything, an Agent-as-a-Judge run for DevAI code-generation benchmarks, and an OpenWiki demo for producing a DeepWiki-style knowledge resource. The project targets researchers and developers who need reproducible, scalable evaluation pipelines for agentic systems, and it documents installation and LLM configuration steps required to run the provided scripts. The work behind the repo has an academic paper and was accepted at ICML 2025.