Magma
Basic Information
This repository provides a research-grade foundation model and codebase for building, training, evaluating, and deploying multimodal AI agents. It centers on Magma, a vision-language-action foundation model designed to perceive images and videos, reason about spatial-temporal content, and produce goal-driven action plans across digital and physical tasks. The repo includes pretraining and finetuning pipelines, data preprocessing tools, SoM and ToM generation code, inference examples, model checkpoints referenced for Magma-8B, evaluation scripts, and several agent demos (UI, gaming, robot visual planning). It also supplies a FastAPI server and Docker deployment for serving the model as a REST API. The materials are intended for researchers and developers to reproduce experiments, adapt the model for downstream tasks, and prototype multimodal agent behaviors.