5 min

Multimodal — images, voice, screen

Modern assistants take more than text. What to feed them, what to ask for, and which tool for which medium.

"Type, get text" is the entry-level mental model. The capable assistants — Claude, ChatGPT, Gemini — now read images, watch video, hear audio, respond in voice, and can see your screen. Most of the value is in knowing which mode fits the job.

What you can feed in

Input	Good for	Tools
Photo / screenshot	"What does this error mean?" · "Read this receipt." · "Describe the chart."	Claude, ChatGPT, Gemini — all chat apps
PDF / Word / slides	Summarize, extract, rewrite	Claude.ai, ChatGPT Plus, Gemini
Audio file or live voice	Transcribe meetings · talk hands-free	ChatGPT Voice, Gemini Live, Claude voice (mobile)
Video clip	"What happened?" · timestamps · summary	Gemini (longest video context), Claude (short)
Live screen share	"Help me through this UI"	ChatGPT advanced voice + screen, Gemini Live

What you can ask out

Output	Tool category	Examples
Generated images	Image generators	DALL-E (in ChatGPT), Imagen (in Gemini), Midjourney, Flux, Stable Diffusion, Ideogram
Generated video	Video generators	Sora (OpenAI), Veo (Google), Runway, Pika, Kling, Hailuo
Generated speech	TTS	ElevenLabs, OpenAI TTS, Google Cloud TTS, Play.ht
Music	Music generators	Suno, Udio
3D models	3D generators	Meshy, Tripo, Luma Genie

The "drag in a screenshot" superpower

This is the single most underused trick. Take a screenshot of anything confusing — a stack trace, a confusing UI, a chart in a report, a handwritten whiteboard photo, a receipt in Korean — and drop it into the chat. Then ask in plain language.

Real cases that work: "What's wrong with this CSS?" (with screenshot) · "Translate this menu" (photo) · "Summarize the trend in this chart" · "What does this error message mean and how do I fix it?" · "Read the labels on these jars and tell me which one is the chili oil."

Voice — when it actually helps

Hands-busy tasks. Driving, cooking, walking, exercising.
Brainstorming. Talking out loud unblocks thinking the way typing doesn't.
Practice. Language learning, mock interviews, public-speaking rehearsal.
Accessibility. Lower friction than typing for some users.

Modern voice modes (ChatGPT Advanced Voice, Gemini Live) are full-duplex — you can interrupt them, they can interrupt you. Latency is low enough that it feels like a conversation, not a transcription round-trip.

Picking an image generator

You want	Try first
A polished marketing image, fast	Midjourney or Flux (aesthetic defaults)
Exactly-rendered text in an image	Ideogram or Imagen (best text fidelity)
A quick illustration inside a chat conversation	DALL-E in ChatGPT, Imagen in Gemini
Local / private / no per-image cost	Stable Diffusion / Flux locally
Photorealistic faces / product shots	Flux Pro, latest Midjourney

Prompt tip for images. The same four-part shape works: subject · style · setting · constraints. "A weathered fisherman mending a net · in the style of a Winslow Homer painting · golden-hour light on a Maine coastline · square, no text, no logo."

What multimodal can't do yet (early 2026)

Reliable text rendering inside generated images is improving but still flaky. Use Ideogram or Imagen for text-heavy stuff.
Coherent multi-minute video — single shots of ~10 seconds are great; sustained narratives drift.
Spatial / numerical precision in images — "show exactly 7 apples" sometimes gives 6 or 8.
Real-time vision through a phone camera works (Gemini Live, ChatGPT) but battery + bandwidth costs are real.

Cost — is multimodal much more expensive?

Yes, but less than you'd think. A medium-resolution image costs roughly the same as 1,000–4,000 text tokens. A minute of voice is ~1,500 tokens. Video is the priciest — multi-cent per generated second. Check the cost lesson for tier breakdowns.

Privacy concerns specific to images / voice?

Same defaults as the safety lesson — the additional wrinkle is metadata. Strip EXIF from photos before uploading (it embeds location and device). For voice, assume the audio is logged unless you've opted out.

Why are there so many image generators?

Two reasons. (1) Different models hit different style sweet spots — Midjourney is editorial, Flux is photoreal, DALL-E is illustrative, Ideogram nails text. (2) The non-LLM frontier moves fast and lab-to-product cycles are short. Six months from now this table will look different.