bigcodebench
Basic Information
BigCodeBench is an open benchmark and tooling project for evaluating large language models on practical, humanEval-like code generation tasks with diverse function calls and complex instructions. The repository provides a dataset of tasks (including a harder subset called BigCodeBench-Hard), an evaluation CLI and workflow for generating model outputs and executing them, and integration with multiple inference backends and execution endpoints. It supports two task splits: Complete for docstring-driven completion and Instruct for instruction-tuned or chat models. The project publishes pre-generated LLM samples, maintains a public leaderboard on Hugging Face, and supplies reproducible real-time execution for ranking models. Packaging and distribution are available via PyPI, and guidance is supplied for remote evaluation using backends such as vLLM, Hugging Face, OpenAI, Anthropic, Mistral and E2B.