The Media Pipeline – OpenClaw Global Knowledgebase

Multi-Modal Mastery

OpenClaw isn't limited to text. The Media Pipeline allows the agent to "see" your screen, "hear" your voice memos, and process incoming documents.

How it Works

Ingestion: Media is uploaded via a channel (e.g., an image sent on WhatsApp).
Normalization: The gateway resizes or compresses the media to meet LLM input requirements (e.g., 512x512 for vision models).
Embedding/Analysis: The media is sent to the vision/audio model, and the resulting description or transcription is fed into the main conversation loop.

Local Processing: If using a local model with vision support (e.g., LLaVA), the media pipeline ensures that files never leave your local workspace.

OpenClaw Setup

See how OpenClaw works under the hood, then build your own agent.

Explore OpenClaw Setup →