chinese llm benchmark

Report Abuse

Basic Information

This repository is a continuously updated Chinese large language model (LLM) benchmarking project named ReLE (Really Reliable Live Evaluation). It collects and evaluates hundreds of models — the README states coverage of 288 commercial and open models — across many domains relevant to Chinese users. It provides standardized test suites, per-model rankings and aggregated leaderboards that summarize accuracy, latency, token consumption and cost metrics. The project organizes evaluations into six major fields including education, medical and mental health, finance, law and public administration, reasoning and mathematics, and language and instruction following, and further subdivides into roughly 300 fine-grained dimensions. The repo also publishes raw evaluation outputs and a large defect library of model failure cases. The stated goals are public, objective benchmarking, helping model selection and exposing strengths and limitations of diverse Chinese LLMs. The project includes changelogs and mechanisms for ongoing additions of models and test sets.

Links

Categorization

App Details

Features
Provides extensive, domain-specific leaderboards and per-benchmark rankings covering multi-modal and text-only evaluations. Contains specialized benchmark suites for education (primary to gaokao), medical and psychology exams, finance and legal qualification tests, reasoning/math tasks (including BBH, arithmetic, Sudoku and olympiad problems), and many language understanding tasks such as sentiment, entailment, pronoun resolution and idiom matching. Includes multi-modal test sets added for various school subjects. Publishes raw evaluation data in an eval directory and maintains a large badcase/defect repository reportedly exceeding two million examples. Scores are computed by per-question grading on a 1–5 scale and normalized to a 100-point system. Leaderboards report accuracy, average response time, average token consumption and estimated cost per thousand calls. The repo is versioned with detailed changelogs and weekly model updates.
Use Cases
Helps researchers, engineers and product teams compare Chinese LLM performance across many realistic, domain-specific tasks and curricula. The fine-grained dimensions and public raw outputs enable error analysis, reproducibility and model debugging by exposing failure cases and quantitative metrics for accuracy, latency, token use and monetary cost. Teams without in-house evaluation pipelines can use the published datasets and leaderboards to inform model selection and procurement. Educational and exam-specific datasets (including multi-year gaokao items and new multimodal school-subject sets) support evaluation for tutoring or assessment applications. The large defect library supports safety, regression testing and targeted fine-tuning. The project also offers contact routes for free private-model evaluation services and operates a community for ongoing feedback and benchmarking updates.

Please fill the required fields*