Multimodal — images, voice, screen
Modern assistants take more than text. What to feed them, what to ask for, and which tool for which medium.
"Type, get text" is the entry-level mental model. The capable assistants — Claude, ChatGPT, Gemini — now read images, watch video, hear audio, respond in voice, and can see your screen. Most of the value is in knowing which mode fits the job.
What you can feed in
| Input | Good for | Tools |
|---|---|---|
| Photo / screenshot | "What does this error mean?" · "Read this receipt." · "Describe the chart." | Claude, ChatGPT, Gemini — all chat apps |
| PDF / Word / slides | Summarize, extract, rewrite | Claude.ai, ChatGPT Plus, Gemini |
| Audio file or live voice | Transcribe meetings · talk hands-free | ChatGPT Voice, Gemini Live, Claude voice (mobile) |
| Video clip | "What happened?" · timestamps · summary | Gemini (longest video context), Claude (short) |
| Live screen share | "Help me through this UI" | ChatGPT advanced voice + screen, Gemini Live |
What you can ask out
| Output | Tool category | Examples |
|---|---|---|
| Generated images | Image generators | DALL-E (in ChatGPT), Imagen (in Gemini), Midjourney, Flux, Stable Diffusion, Ideogram |
| Generated video | Video generators | Sora (OpenAI), Veo (Google), Runway, Pika, Kling, Hailuo |
| Generated speech | TTS | ElevenLabs, OpenAI TTS, Google Cloud TTS, Play.ht |
| Music | Music generators | Suno, Udio |
| 3D models | 3D generators | Meshy, Tripo, Luma Genie |
The "drag in a screenshot" superpower
This is the single most underused trick. Take a screenshot of anything confusing — a stack trace, a confusing UI, a chart in a report, a handwritten whiteboard photo, a receipt in Korean — and drop it into the chat. Then ask in plain language.
Voice — when it actually helps
- Hands-busy tasks. Driving, cooking, walking, exercising.
- Brainstorming. Talking out loud unblocks thinking the way typing doesn't.
- Practice. Language learning, mock interviews, public-speaking rehearsal.
- Accessibility. Lower friction than typing for some users.
Modern voice modes (ChatGPT Advanced Voice, Gemini Live) are full-duplex — you can interrupt them, they can interrupt you. Latency is low enough that it feels like a conversation, not a transcription round-trip.
Picking an image generator
| You want | Try first |
|---|---|
| A polished marketing image, fast | Midjourney or Flux (aesthetic defaults) |
| Exactly-rendered text in an image | Ideogram or Imagen (best text fidelity) |
| A quick illustration inside a chat conversation | DALL-E in ChatGPT, Imagen in Gemini |
| Local / private / no per-image cost | Stable Diffusion / Flux locally |
| Photorealistic faces / product shots | Flux Pro, latest Midjourney |
What multimodal can't do yet (early 2026)
- Reliable text rendering inside generated images is improving but still flaky. Use Ideogram or Imagen for text-heavy stuff.
- Coherent multi-minute video — single shots of ~10 seconds are great; sustained narratives drift.
- Spatial / numerical precision in images — "show exactly 7 apples" sometimes gives 6 or 8.
- Real-time vision through a phone camera works (Gemini Live, ChatGPT) but battery + bandwidth costs are real.