Report Abuse

Basic Information

optillm is an OpenAI API–compatible inference proxy designed to improve the accuracy and performance of large language models at inference time. It is intended for developers and researchers who want to apply state-of-the-art inference-time techniques to boost reasoning on coding, logical and mathematical queries without changing client code. The proxy can run locally or forward to remote providers, expose the same chat completions endpoint as OpenAI, and act as a drop-in replacement by changing the base_url. It supports running a built-in local inference server with HuggingFace models and LoRAs, wrapping other providers via LiteLLM, and connecting to external model servers. The project centralizes many optimization strategies and plugins to let engineers experiment with and deploy inference-time enhancements in existing tools and workflows.

Links

Categorization

App Details

Features
The repository implements many inference-time optimization approaches including CePO, CoT with reflection, PlanSearch, ReRead, Self-Consistency, Z3 integration, R* algorithm, LEAP, Round-Trip Optimization, Best-of-N, Mixture-of-Agents, MCTS, PV Game, CoT and Entropy decoding, and AutoThink. It includes plugins for system prompt learning, deep think, long-context processing, majority voting, MCP client integration, routing, chain-of-code, memory, privacy/anonymization, URL reading, code execution, structured JSON outputs, generative selection, web search, and deep research. optillm supports provider flexibility (OpenAI, Azure, Cerebras, LiteLLM, local HuggingFace), LoRA stacking, configurable parameters and CLI/docker deployment, approach control via model-name slugs or request fields, and an automated test suite with CI.
Use Cases
optillm helps teams and researchers get better results from existing models by applying extra compute and inference strategies that improve reasoning and coding performance. As a transparent OpenAI-compatible proxy it integrates with existing clients and tools with minimal changes, enabling experimentation with single or combined techniques in pipelines or parallel. The MCP plugin lets models access filesystem, search and database tools securely to enrich context. Built-in local inference, LoRA support and provider wrapping let users run private or custom models and apply specialized decoding methods. The README documents benchmark and SOTA improvements on public evaluations and provides configuration, Docker and testing guidance to aid deployment, evaluation and reproducible experimentation.

Please fill the required fields*