chinese llm benchmark
Basic Information
This repository is a continuously updated Chinese large language model (LLM) benchmarking project named ReLE (Really Reliable Live Evaluation). It collects and evaluates hundreds of models — the README states coverage of 288 commercial and open models — across many domains relevant to Chinese users. It provides standardized test suites, per-model rankings and aggregated leaderboards that summarize accuracy, latency, token consumption and cost metrics. The project organizes evaluations into six major fields including education, medical and mental health, finance, law and public administration, reasoning and mathematics, and language and instruction following, and further subdivides into roughly 300 fine-grained dimensions. The repo also publishes raw evaluation outputs and a large defect library of model failure cases. The stated goals are public, objective benchmarking, helping model selection and exposing strengths and limitations of diverse Chinese LLMs. The project includes changelogs and mechanisms for ongoing additions of models and test sets.