Report Abuse

Basic Information

Giskard is an open-source Python library and evaluation framework designed to automatically detect performance, bias and security issues in AI systems, with specific support for LLM-based and RAG applications as well as traditional ML models. It provides tools to wrap model calls, run automated scans that identify problems such as hallucinations, harmful generation, prompt injection, robustness failures and sensitive information disclosure, and to produce actionable scan results. The repository includes a RAG Evaluation Toolkit (RAGET) to generate evaluation datasets and to measure component-level performance of RAG applications. It targets developers and ML engineers who need systematic, reproducible testing, and it integrates with common tooling and workflows. The project is delivered as a Python package, supports Python 3.9–3.11, and offers notebooks and community resources for getting started.

Links

Categorization

App Details

Features
Giskard offers automated scanning of LLMs and ML models that detects a range of issues including hallucinations, harmful content generation, prompt injection, robustness problems, sensitive data leakage and bias. It provides a simple Python model wrapper interface that accepts pandas DataFrames and a scan function that returns detailed results which can be displayed or exported to HTML. The RAGET module can automatically generate QA testsets from a knowledge base and computes scores for RAG components such as the generator, retriever, rewriter, router and knowledge base. The repo includes examples using LangChain and FAISS, utilities to save and load generated testsets, support for multiple question types, and mentions companion tooling for computer vision evaluation.
Use Cases
Giskard helps teams find and quantify weaknesses in AI systems early in development by automating detection of critical issues that affect safety, correctness and fairness. The scan and RAGET workflows let engineers generate targeted evaluation datasets, reproduce failures, and track regressions via saved test suites. Component-level scoring for RAG systems clarifies which part of a pipeline needs improvement and the generated testsets provide realistic questions and reference answers for continuous validation. Integration examples and notebook workflows make it straightforward to plug into existing ML pipelines and CI processes. The project is oriented toward practitioners who need repeatable, explainable model assessments and tooling to improve model reliability and reduce deployment risk.

Please fill the required fields*