Learn AI
    navigate Enter open Esc close Open with K or /

    5 min

    Multimodal — images, voice, screen

    Modern assistants take more than text. What to feed them, what to ask for, and which tool for which medium.

    "Type, get text" is the entry-level mental model. The capable assistants — Claude, ChatGPT, Gemini — now read images, watch video, hear audio, respond in voice, and can see your screen. Most of the value is in knowing which mode fits the job.

    What you can feed in

    InputGood forTools
    Photo / screenshot"What does this error mean?" · "Read this receipt." · "Describe the chart."Claude, ChatGPT, Gemini — all chat apps
    PDF / Word / slidesSummarize, extract, rewriteClaude.ai, ChatGPT Plus, Gemini
    Audio file or live voiceTranscribe meetings · talk hands-freeChatGPT Voice, Gemini Live, Claude voice (mobile)
    Video clip"What happened?" · timestamps · summaryGemini (longest video context), Claude (short)
    Live screen share"Help me through this UI"ChatGPT advanced voice + screen, Gemini Live

    What you can ask out

    OutputTool categoryExamples
    Generated imagesImage generatorsDALL-E (in ChatGPT), Imagen (in Gemini), Midjourney, Flux, Stable Diffusion, Ideogram
    Generated videoVideo generatorsSora (OpenAI), Veo (Google), Runway, Pika, Kling, Hailuo
    Generated speechTTSElevenLabs, OpenAI TTS, Google Cloud TTS, Play.ht
    MusicMusic generatorsSuno, Udio
    3D models3D generatorsMeshy, Tripo, Luma Genie

    The "drag in a screenshot" superpower

    This is the single most underused trick. Take a screenshot of anything confusing — a stack trace, a confusing UI, a chart in a report, a handwritten whiteboard photo, a receipt in Korean — and drop it into the chat. Then ask in plain language.

    Real cases that work: "What's wrong with this CSS?" (with screenshot) · "Translate this menu" (photo) · "Summarize the trend in this chart" · "What does this error message mean and how do I fix it?" · "Read the labels on these jars and tell me which one is the chili oil."

    Voice — when it actually helps

    • Hands-busy tasks. Driving, cooking, walking, exercising.
    • Brainstorming. Talking out loud unblocks thinking the way typing doesn't.
    • Practice. Language learning, mock interviews, public-speaking rehearsal.
    • Accessibility. Lower friction than typing for some users.

    Modern voice modes (ChatGPT Advanced Voice, Gemini Live) are full-duplex — you can interrupt them, they can interrupt you. Latency is low enough that it feels like a conversation, not a transcription round-trip.

    Picking an image generator

    You wantTry first
    A polished marketing image, fastMidjourney or Flux (aesthetic defaults)
    Exactly-rendered text in an imageIdeogram or Imagen (best text fidelity)
    A quick illustration inside a chat conversationDALL-E in ChatGPT, Imagen in Gemini
    Local / private / no per-image costStable Diffusion / Flux locally
    Photorealistic faces / product shotsFlux Pro, latest Midjourney
    Prompt tip for images. The same four-part shape works: subject · style · setting · constraints. "A weathered fisherman mending a net · in the style of a Winslow Homer painting · golden-hour light on a Maine coastline · square, no text, no logo."

    What multimodal can't do yet (early 2026)

    • Reliable text rendering inside generated images is improving but still flaky. Use Ideogram or Imagen for text-heavy stuff.
    • Coherent multi-minute video — single shots of ~10 seconds are great; sustained narratives drift.
    • Spatial / numerical precision in images — "show exactly 7 apples" sometimes gives 6 or 8.
    • Real-time vision through a phone camera works (Gemini Live, ChatGPT) but battery + bandwidth costs are real.
    Cost — is multimodal much more expensive?
    Yes, but less than you'd think. A medium-resolution image costs roughly the same as 1,000–4,000 text tokens. A minute of voice is ~1,500 tokens. Video is the priciest — multi-cent per generated second. Check the cost lesson for tier breakdowns.
    Privacy concerns specific to images / voice?
    Same defaults as the safety lesson — the additional wrinkle is metadata. Strip EXIF from photos before uploading (it embeds location and device). For voice, assume the audio is logged unless you've opted out.
    Why are there so many image generators?
    Two reasons. (1) Different models hit different style sweet spots — Midjourney is editorial, Flux is photoreal, DALL-E is illustrative, Ideogram nails text. (2) The non-LLM frontier moves fast and lab-to-product cycles are short. Six months from now this table will look different.