Report Abuse

Basic Information

AgentTuning is a research and engineering repository that demonstrates instruction-tuning of large language models using entire interaction trajectories to enable generalized agent abilities. The project packages a curated interaction dataset called AgentInstruct, pretrained models named AgentLM (available in 7B, 13B, and 70B sizes), and evaluation tooling to measure performance on both seen and unseen agent tasks. The repository includes code and instructions to run inference via a Text-Generation-Inference (TGI) Docker setup, example client requests, and evaluation scripts for general benchmarks. Its main purpose is to provide resources for developing, reproducing, and evaluating LLMs tuned specifically for sequential multi-step agent behaviors while retaining general language capabilities.

Links

App Details

Features
The repo provides several core artifacts and capabilities. AgentInstruct is a curated dataset of 1,866 high-quality interaction trajectories across six real-world tasks with chain-of-thought style explanations and strict filtering to avoid low-quality or leaked data. AgentLM models are mixed-trained on AgentInstruct and ShareGPT data and follow the Llama-2-chat conversation format; 7B, 13B, and 70B checkpoints are published. Reproducible inference is supported via Docker compose and Text-Generation-Inference with example curl requests. Evaluation suites include held-in tasks from AgentBench.old and held-out tasks recompiled from SciWorld, MiniWoB++, HotpotQA, ReWOO, WebArena, and a digital card game. Scripts are provided to run MMLU, GSM8k, and MT-Bench evaluations and to integrate with FastChat for judge-based assessment.
Use Cases
This repository is useful to researchers and developers who want to build, fine-tune, or evaluate LLMs with agent-like, multi-step decision capabilities. It supplies a high-quality, task-diverse dataset for instruction tuning, pretrained models to use or extend, and practical deployment instructions for serving models with TGI and Docker. The included evaluation pipelines let users measure general language performance as well as task generalization on held-in and held-out agent benchmarks. By publishing both data and models, the project accelerates reproducible research into agent behaviors, enables comparisons across model sizes, and provides integration examples for common evaluation workflows such as MMLU, GSM8k, and MT-Bench.

Please fill the required fields*