multimodal-agents-course

Report Abuse

Basic Information

This repository is an open-source, hands-on course called the Kubrick Course that teaches developers how to build production-ready multimodal AI agents capable of understanding video, images, audio and text. The course focuses on building an MCP (Model Context Protocol) server for video processing, designing custom agent clients, and integrating multimodal pipelines and observability tools. It is aimed at engineers who want to move beyond simple tutorials and learn how to architect, implement and operate agentic systems that combine VLMs, LLMs, prompt versioning and streaming APIs. The materials include modular lessons, code examples, and a guided path to run a Kubrick agent that demonstrates video search, tool use, and end-to-end API integrations.

Links

Categorization

App Details

Features
Structured five-module syllabus with step-by-step lessons, code and video summaries. Examples and code to build a multimodal processing pipeline covering video, image, audio and text. Guidance to create MCP servers using FastMCP and to expose resources, tools, prompts and APIs. Instructions to implement agent clients, a memory layer using Pixeltable, and custom MCP tools. Integration examples with Groq and OpenAI models, a Groq-powered agent, and LLMOps observability via Opik including prompt versioning and tracing. Includes a GETTING_STARTED guide, production-oriented engineering best practices, and ready-made demos such as a video search engine and a Kubrick agent demo.
Use Cases
The course helps ML engineers, software engineers and data practitioners learn practical skills to design, build and operate multimodal agent systems. Learners gain hands-on experience creating MCP servers and clients, connecting VLMs and LLMs, implementing stateful memory, and exposing agent tools as production APIs. It teaches prompt versioning and monitoring practices with Opik to improve reproducibility and observability. The repo includes runnable examples, a recommended setup workflow, and guidance on using freemium model providers so participants can experiment at low cost. By following the modules, users can produce a working video-capable agent, a multimodal pipeline and an API-based agent interface suitable for further development or integration.

Please fill the required fields*