Media Understanding - Inbound (2026-01-17)
OpenClaw can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto‑detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual. Vendor-specific media behavior is registered by vendor plugins, while OpenClaw core owns the sharedtools.media config, fallback order, and reply-pipeline
integration.
Goals
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support provider APIs and CLI fallbacks.
- Allow multiple models with ordered fallback (error/size/timeout).
High-level behavior
- Collect inbound attachments (
MediaPaths,MediaUrls,MediaTypes). - For each enabled capability (image/audio/video), select attachments per policy (default: first).
- Choose the first eligible model entry (size + capability + auth).
- If a model fails or the media is too large, fall back to the next entry.
- On success:
Bodybecomes[Image],[Audio], or[Video]block.- Audio sets
{{Transcript}}; command parsing uses caption text when present, otherwise the transcript. - Captions are preserved as
User text:inside the block.
Config overview
tools.media supports shared models plus per‑capability overrides:
tools.media.models: shared model list (usecapabilitiesto gate).tools.media.image/tools.media.audio/tools.media.video:- defaults (
prompt,maxChars,maxBytes,timeoutSeconds,language) - provider overrides (
baseUrl,headers,providerOptions) - Deepgram audio options via
tools.media.audio.providerOptions.deepgram - audio transcript echo controls (
echoTranscript, defaultfalse;echoFormat) - optional per‑capability
modelslist (preferred before shared models) attachmentspolicy (mode,maxAttachments,prefer)scope(optional gating by channel/chatType/session key)
- defaults (
tools.media.concurrency: max concurrent capability runs (default 2).
Model entries
Eachmodels[] entry can be provider or CLI:
{{MediaDir}}(directory containing the media file){{OutputDir}}(scratch dir created for this run){{OutputBase}}(scratch file base path, no extension)
Defaults and limits
Recommended defaults:maxChars: 500 for image/video (short, command‑friendly)maxChars: unset for audio (full transcript unless you set a limit)maxBytes:- image: 10MB
- audio: 20MB
- video: 50MB
- If media exceeds
maxBytes, that model is skipped and the next model is tried. - Audio files smaller than 1024 bytes are treated as empty/corrupt and skipped before provider/CLI transcription.
- If the model returns more than
maxChars, output is trimmed. promptdefaults to simple “Describe the .” plus themaxCharsguidance (image/video only).- If the active primary image model already supports vision natively, OpenClaw
skips the
[Image]summary block and passes the original image into the model instead. - If
<capability>.enabled: truebut no models are configured, OpenClaw tries the active reply model when its provider supports the capability.
Auto-detect media understanding (default)
Iftools.media.<capability>.enabled is not set to false and you haven’t
configured models, OpenClaw auto-detects in this order and stops at the first
working option:
- Active reply model when its provider supports the capability.
agents.defaults.imageModelprimary/fallback refs (image only).- Local CLIs (audio only; if installed)
sherpa-onnx-offline(requiresSHERPA_ONNX_MODEL_DIRwith encoder/decoder/joiner/tokens)whisper-cli(whisper-cpp; usesWHISPER_CPP_MODELor the bundled tiny model)whisper(Python CLI; downloads models automatically)
- Gemini CLI (
gemini) usingread_many_files - Provider auth
- Configured
models.providers.*entries that support the capability are tried before the bundled fallback order. - Image-only config providers with an image-capable model auto-register for media understanding even when they are not a bundled vendor plugin.
- Bundled fallback order:
- Audio: OpenAI → Groq → Deepgram → Google → Mistral
- Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
- Video: Google → Qwen → Moonshot
- Configured
PATH (we expand ~), or set an explicit CLI model with a full command path.
Proxy environment support (provider models)
When provider-based audio and video media understanding is enabled, OpenClaw honors standard outbound proxy environment variables for provider HTTP calls:HTTPS_PROXYHTTP_PROXYhttps_proxyhttp_proxy
Capabilities (optional)
If you setcapabilities, the entry only runs for those media types. For shared
lists, OpenClaw can infer defaults:
openai,anthropic,minimax: imageminimax-portal: imagemoonshot: image + videoopenrouter: imagegoogle(Gemini API): image + audio + videoqwen: image + videomistral: audiozai: imagegroq: audiodeepgram: audio- Any
models.providers.<id>.models[]catalog with an image-capable model: image
capabilities explicitly to avoid surprising matches.
If you omit capabilities, the entry is eligible for the list it appears in.
Provider support matrix (OpenClaw integrations)
| Capability | Provider integration | Notes |
|---|---|---|
| Image | OpenAI, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Qwen, Z.AI, config providers | Vendor plugins register image support; MiniMax and MiniMax OAuth both use MiniMax-VL-01; image-capable config providers auto-register. |
| Audio | OpenAI, Groq, Deepgram, Google, Mistral | Provider transcription (Whisper/Deepgram/Gemini/Voxtral). |
| Video | Google, Qwen, Moonshot | Provider video understanding via vendor plugins; Qwen video understanding uses the Standard DashScope endpoints. |
minimaxandminimax-portalimage understanding comes from the plugin-ownedMiniMax-VL-01media provider.- The bundled MiniMax text catalog still starts text-only; explicit
models.providers.minimaxentries materialize image-capable M2.7 chat refs.
Model selection guidance
- Prefer the strongest latest-generation model available for each media capability when quality and safety matter.
- For tool-enabled agents handling untrusted inputs, avoid older/weaker media models.
- Keep at least one fallback per capability for availability (quality model + faster/cheaper model).
- CLI fallbacks (
whisper-cli,whisper,gemini) are useful when provider APIs are unavailable. parakeet-mlxnote: with--output-dir, OpenClaw reads<output-dir>/<media-basename>.txtwhen output format istxt(or unspecified); non-txtformats fall back to stdout.
Attachment policy
Per‑capabilityattachments controls which attachments are processed:
mode:first(default) orallmaxAttachments: cap the number processed (default 1)prefer:first,last,path,url
mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.
File-attachment extraction behavior:
- Extracted file text is wrapped as untrusted external content before it is appended to the media prompt.
- The injected block uses explicit boundary markers like
<<<EXTERNAL_UNTRUSTED_CONTENT id="...">>>/<<<END_EXTERNAL_UNTRUSTED_CONTENT id="...">>>and includes aSource: Externalmetadata line. - This attachment-extraction path intentionally omits the long
SECURITY NOTICE:banner to avoid bloating the media prompt; the boundary markers and metadata still remain. - If a file has no extractable text, OpenClaw injects
[No extractable text]. - If a PDF falls back to rendered page images in this path, the media prompt keeps
the placeholder
[PDF content rendered to images; images not forwarded to model]because this attachment-extraction step forwards text blocks, not the rendered PDF images.
Config examples
1) Shared models list + overrides
2) Audio + Video only (image off)
3) Optional image understanding
4) Multi-modal single entry (explicit capabilities)
Status output
When media understanding runs,/status includes a short summary line:
Notes
- Understanding is best‑effort. Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use
scopeto limit where understanding runs (e.g. only DMs).