Report Abuse

Basic Information

Code and tooling to reproduce the MLE-bench benchmark for evaluating machine learning agents on machine learning engineering tasks. The repository provides the code used to construct the dataset, the evaluation and grading logic, and the agent implementations used in the benchmark study. It packages a collection of 75 Kaggle competitions with scripts to download raw data, split training sets into new training and test sets, and prepare both full and lite datasets. The repo also includes a leaderboard of evaluated agents, example usage, experimental artifacts from the paper, and extras such as rule-violation and plagiarism detectors. It supplies a base Docker environment and guidance on recommended resources and evaluation procedures for reproducible benchmarking.

Links

App Details

Features
Includes a curated dataset of 75 Kaggle competitions and a Low complexity "lite" split of 22 competitions to reduce compute and storage. Provides command-line tools to prepare datasets and grade submissions, including mlebench prepare, mlebench grade, and mlebench grade-sample. Bundles grading scripts per competition and a grading server to validate submission structure. Offers a Docker base image with a Conda environment for running agents and optional heavy dependencies. Contains examples, an experiments directory with splits and familiarity experiments, scripts to compile agent submissions, evaluated agents, a leaderboard, and extras for rule-violation and plagiarism detection. Supports Git-LFS and the Kaggle API for data download.
Use Cases
Enables standardized and reproducible comparison of ML engineering capabilities across agent implementations. It automates dataset preparation and splitting so researchers do not need held-out test sets from the original competitions. The grading tools and sample submission format allow consistent scoring and produce per-competition grading reports. The Docker environment and recommended compute/runtime guidance make it easier to run agents reliably. The lite evaluation option and clear benchmarking protocol reduce resource burden while preserving comparability. Experiment artifacts, examples, and scripts to build submissions help replicate the paper"s results and deploy new agents for evaluation. Extras help enforce fairness by detecting rule violations and plagiarism.

Please fill the required fields*