bigcodebench

Report Abuse

Basic Information

BigCodeBench is an open benchmark and tooling project for evaluating large language models on practical, humanEval-like code generation tasks with diverse function calls and complex instructions. The repository provides a dataset of tasks (including a harder subset called BigCodeBench-Hard), an evaluation CLI and workflow for generating model outputs and executing them, and integration with multiple inference backends and execution endpoints. It supports two task splits: Complete for docstring-driven completion and Instruct for instruction-tuned or chat models. The project publishes pre-generated LLM samples, maintains a public leaderboard on Hugging Face, and supplies reproducible real-time execution for ranking models. Packaging and distribution are available via PyPI, and guidance is supplied for remote evaluation using backends such as vLLM, Hugging Face, OpenAI, Anthropic, Mistral and E2B.

Links

Categorization

App Details

Features
The repository includes a command-line evaluator that accepts model name, execution backend, split and subset parameters and writes generated samples and evaluation results to structured JSON/JSONL files. It supports remote execution backends (gradio, e2b, local) and inference backends (vllm, openai, anthropic, google, mistral, hf, hf-inference). It produces calibrated outputs, pass@k metrics, and standardized filenames for reproducibility. BigCodeBench provides pre-generated LLM outputs for inspected models, a public Hugging Face leaderboard with real-time code execution sessions, a hard subset of 148 realistic tasks, and guidance for handling chat vs. base tokenizers and the direct_completion flag. The project is distributed on PyPI and documents advanced usage and result submission procedures.
Use Cases
BigCodeBench helps researchers, model developers and benchmarking teams measure and compare code-generation capabilities of LLMs in realistic programming scenarios. By providing a large suite of tasks, standardized evaluation scripts, remote execution options and pre-generated samples, it reduces the cost and effort of running reproducible evaluations and collecting pass@k and execution-based metrics. The Hugging Face leaderboard and shared outputs enable community comparison and transparency. Support for multiple backends and execution modes allows teams to evaluate instruction-tuned chat models and base models consistently. The project also documents result submission and reproducible file naming so teams can contribute model results to the public leaderboard with minimal friction.

Please fill the required fields*