gemini-multimodal-playground

Report Abuse

Basic Information

This repository provides a Python application and accompanying frontend for interacting with Google's Gemini 2.0 model using multimodal inputs. It is designed to run real-time conversations that accept voice, live camera video, and screen-sharing input while producing audio responses. The project offers two delivery options: a full-stack web application with a Python backend and a Node frontend, and a standalone Python script that runs as a desktop app. The README documents prerequisites such as Python 3.12+, Node.js 18+, a Google Cloud account, and a Gemini API key, and it explains how to configure the application via environment variables and local servers. The repo is intended as a playground for running, demoing, and experimenting with realtime multimodal interactions with Gemini.

Links

Categorization

App Details

Features
Real-time multimodal I/O that accepts webcam video, screen sharing, and microphone input while returning audio responses. Two versions are provided: a full-stack web version with a backend and a Node frontend, and a standalone Python script using a simple GUI. Configuration options include a customizable system prompt, selectable input modes (video or screen sharing), multiple voice choices named Puck, Charon, Kore, Fenrir, and Aoede, an option to enable Google search for current information, and an "allow interruptions" toggle. The README includes step-by-step setup for backend and frontend, instructions to create a .env file with GEMINI_API_KEY, and platform notes about Tkinter for the standalone build. Troubleshooting guidance addresses audio feedback loop issues and mitigation strategies.
Use Cases
This project helps developers and power users prototype and demo conversational agents that combine voice, vision, and screen context. It makes it straightforward to connect a local or self-hosted frontend to Gemini 2.0 via a documented backend, to test different system prompts and voice behaviors, and to compare a browser-based full-stack deployment against a desktop standalone experience. The configurable options let users experiment with live search, interruption behavior, and multiple synthesized voices to evaluate interaction flows. The included troubleshooting tips and setup commands reduce friction when running locally, and the sample streams and demos illustrate how realtime multimodal exchanges look in practice.

Please fill the required fields*