Multi-Modal Mastery
OpenClaw isn't limited to text. The Media Pipeline allows the agent to "see" your screen, "hear" your voice memos, and process incoming documents.
How it Works
- Ingestion: Media is uploaded via a channel (e.g., an image sent on WhatsApp).
- Normalization: The gateway resizes or compresses the media to meet LLM input requirements (e.g., 512x512 for vision models).
- Embedding/Analysis: The media is sent to the vision/audio model, and the resulting description or transcription is fed into the main conversation loop.
Local Processing: If using a local model with vision support (e.g., LLaVA), the media
pipeline ensures that files never leave your local workspace.
OpenClaw Setup
See how OpenClaw works under the hood, then build your own agent.