phoenix

Report Abuse

Basic Information

Phoenix is an open-source AI observability platform for experimentation, evaluation, and troubleshooting of LLM applications. It is designed to capture traces of model calls, benchmark responses and retrievals, and organize versioned datasets and experiments to evaluate prompts, models, and retrieval components. The platform provides a playground for prompt engineering, tools for prompt management with versioning and tagging, and the ability to replay traced LLM calls. Phoenix is vendor- and language-agnostic and integrates with popular frameworks and providers through OpenTelemetry-based instrumentation. It can run locally, in notebooks, in containers, or in cloud deployments and is distributed as a full Python package plus lighter subpackages and JavaScript/TypeScript clients for deployed platforms.

Links

Categorization

App Details

Features
Phoenix offers OpenTelemetry-based tracing to record runtime behavior and LLM traces. It includes evaluation tooling to run LLM response and retrieval evaluations and to benchmark relevance and answer quality. The platform supports creating and versioning datasets and experiments to compare prompts, models, and retrieval setups. A playground enables prompt optimization, model comparison, parameter tuning, and trace replay. Prompt management features support version control, tagging, and systematic testing. Packaging includes the full arize-phoenix package plus subpackages like arize-phoenix-otel, arize-phoenix-client, arize-phoenix-evals and JavaScript packages including phoenix-client, phoenix-evals, and phoenix-mcp. Deployments are available via pip, conda, Docker images, and Kubernetes.
Use Cases
Phoenix helps developers and ML engineers observe, diagnose, and improve LLM applications by providing end-to-end telemetry and evaluation workflows. Tracing captures detailed runtime context so issues can be reproduced and performance bottlenecks identified. Evaluation tools and versioned datasets enable systematic benchmarking and A/B testing of prompts and models. Experiments track changes to prompts, LLMs, and retrieval so teams can measure impact and regressions. The playground and prompt management features accelerate prompt engineering and model selection. Vendor- and framework-agnostic integrations reduce integration work across OpenAI, Bedrock, Vertex AI, LangChain, LlamaIndex and others. Lightweight clients and container images simplify deployment in local, notebook, containerized, and cloud environments.

Please fill the required fields*