gemini-multimodal-playground

This repository provides a Python application and accompanying frontend for interacting with Google's Gemini 2.0 model using multimodal inputs. It is designed to run real-time conversations that accept voice, live camera video, and screen-sharing input while producing audio responses. The project offers two delivery options: a full-stack web application with a Python backend and a Node frontend, and a standalone Python script that runs as a desktop app. The README documents prerequisites such as Python 3.12+, Node.js 18+, a Google Cloud account, and a Gemini API key, and it explains how to configure the application via environment variables and local servers. The repo is intended as a playground for running, demoing, and experimenting with realtime multimodal interactions with Gemini.

Stars

301

Language

App URL

https://github.com/saharmor/gemini-multimodal-playground

Github Repository

https://github.com/saharmor/gemini-multimodal-playground/blob/main/README.md

Features

Real-time multimodal I/O that accepts webcam video, screen sharing, and microphone input while returning audio responses. Two versions are provided: a full-stack web version with a backend and a Node frontend, and a standalone Python script using a simple GUI. Configuration options include a customizable system prompt, selectable input modes (video or screen sharing), multiple voice choices named Puck, Charon, Kore, Fenrir, and Aoede, an option to enable Google search for current information, and an "allow interruptions" toggle. The README includes step-by-step setup for backend and frontend, instructions to create a .env file with GEMINI_API_KEY, and platform notes about Tkinter for the standalone build. Troubleshooting guidance addresses audio feedback loop issues and mitigation strategies.

Use Cases

This project helps developers and power users prototype and demo conversational agents that combine voice, vision, and screen context. It makes it straightforward to connect a local or self-hosted frontend to Gemini 2.0 via a documented backend, to test different system prompts and voice behaviors, and to compare a browser-based full-stack deployment against a desktop standalone experience. The configurable options let users experiment with live search, interruption behavior, and multiple synthesized voices to evaluate interaction flows. The included troubleshooting tips and setup commands reduce friction when running locally, and the sample streams and demos illustrate how realtime multimodal exchanges look in practice.

gemini-multimodal-playground

Basic Information

Links

Categorization

App Details

Categories

Similar Listings

cyber-doctor

LLM-Powered-RAG-System

Curie

Open_Data_QnA

awesome-llm-plaza

Featured Listings

Terry Bison Ranch

The Singapore Flyer

Tags

More Filters

gemini-multimodal-playground

Categories

Similar Listings

cyber-doctor

LLM-Powered-RAG-System

Curie

Open_Data_QnA

awesome-llm-plaza

Featured Listings

Terry Bison Ranch

The Singapore Flyer

Tags