Review · AI Audio & Voice

GPT-4o Audio Models

AI speech and voice models

Build voice agents with steerable TTS and sharper transcription

Reviewed by The Desk · Last verified July 2026

The Verdict

“A strong, cheap set of API models that beat Whisper on transcription accuracy and finally make text-to-speech tone-steerable, but they are raw developer building blocks — no UI, no diarization, and some hallucination quirks to design around.”

Skip it ifYou want a ready-to-use transcription or voice app rather than an API, or you need speaker diarization and reliable word-level timestamps for meetings — Deepgram or AssemblyAI serve those better, and non-developers get nothing usable out of the box.

4.3Editorial

Ease

Value

Power

What it is

GPT-4o Audio Models are OpenAI's next-generation speech stack for developers, announced in March 2025. The release bundles two speech-to-text models — gpt-4o-transcribe and the cheaper gpt-4o-mini-transcribe — with a new text-to-speech model, gpt-4o-mini-tts. All three are accessed through the OpenAI API (including the Realtime API and a dedicated realtime transcription endpoint) rather than any consumer app, positioning them as building blocks for voice agents, transcription pipelines, and narration features.

The transcription models are OpenAI's answer to its own Whisper: the company reports lower word error rates across benchmarks such as the 100+ language FLEURS suite, crediting reinforcement learning and expanded midtraining for better handling of accents, background noise, and uneven speech speed. The headline change on the synthesis side is steerability — for the first time developers can instruct gpt-4o-mini-tts on how to say something (tone, emotion, delivery), not just the words, which opens up more expressive customer-service and storytelling voices.

Pricing is usage-based and aggressive, roughly $0.003–0.006 per minute for transcription and about $0.015 per minute for generated speech, with no subscription. The trade-off is that these are raw models: there is no interface, no built-in speaker diarization, and word-level timestamp support is limited, so feature-rich meeting-transcription rivals like Deepgram and AssemblyAI still have an edge for certain jobs. Early testers, including Simon Willison, also flagged prompt-injection-style behavior where spoken content can be misinterpreted as instructions — a real consideration for production voice apps.

Key features

gpt-4o-transcribe and gpt-4o-mini-transcribe speech-to-text models
gpt-4o-mini-tts text-to-speech with instructable tone and style
Lower WER than Whisper v2/v3 on FLEURS and other benchmarks
Streaming via Realtime API and dedicated realtime transcription endpoint
50+ languages for TTS, 100+ for transcription
Works with OpenAI Agents SDK and Chat Completions
Usage-based per-minute and per-token pricing
Interactive TTS demo at OpenAI.fm

Who it’s for

Building real-time voice agents and customer-support bots

Transcribing calls, meetings, and voice notes inside apps

Adding steerable text-to-speech narration to products

Multilingual transcription across 100+ languages

Pros & cons

The good

Lower word error rate than Whisper v2/v3 across multilingual FLEURS benchmarks, with better handling of accents and noise
Steerable TTS lets you instruct tone and emotion, not just the words — genuinely useful for voice agents and storytelling
Cheap usage-based pricing (~$0.003-0.006/min STT, ~$0.015/min TTS) with no subscription and simple OpenAI SDK integration

The catch

No speaker diarization and limited word-level timestamps make it weaker than Deepgram or AssemblyAI for meeting transcription
Raw API models with no interface — only developers can use them
Early testers flagged prompt-injection-style hallucinations where spoken audio can be misread as instructions

Pricing

Tier	Price	What you get
gpt-4o-mini-transcribe (STT)	Custom	~$0.003 per minute of audio · Cost-efficient speech-to-text
gpt-4o-transcribe (STT)	Custom	~$0.006 per minute of audio · $2.50/1M audio input, $10/1M output tokens · Highest transcription accuracy
gpt-4o-mini-tts (TTS)	Custom	~$0.015 per minute of generated audio · $0.60/1M text input, $12/1M audio output tokens · Steerable tone and emotion

Alternatives to GPT-4o Audio Models

More AI Audio & Voice →

AI Voices4.2Studio-grade AI voiceovers at roughly half of ElevenLabs' price$11.99/mo

Wispr Flow4.4Dictate anywhere and get clean, formatted text 3x faster than typing$12/mo

Voicenotes4.2Capture ideas by voice and publish them without the polish$10/mo

Llama 44.0Open-weight multimodal models you can run cheap at massive contextFree tier

Questions people ask

What models are included in GPT-4o Audio Models?

Three: gpt-4o-transcribe and gpt-4o-mini-transcribe for speech-to-text, and gpt-4o-mini-tts for text-to-speech. They launched in March 2025 as OpenAI's next-generation audio models for developers.

How much do they cost?

Pricing is usage-based with no subscription. gpt-4o-mini-transcribe runs about $0.003/min, gpt-4o-transcribe about $0.006/min, and gpt-4o-mini-tts about $0.015/min of generated audio. There is no free tier.

Are these better than Whisper?

OpenAI reports lower word error rates than Whisper v2 and v3 across benchmarks like FLEURS, with better performance on accents, noise, and varying speech speed. In practice accuracy is strong, though some users have noted occasional hallucinations.

What makes the new text-to-speech different?

gpt-4o-mini-tts is steerable — you can instruct not just what it says but how it says it (tone, emotion, delivery), which is useful for customer service bots and creative narration. You can try the voices at OpenAI.fm.

Can non-developers use these?

Not directly. These are API models with no standalone app or interface, so you need to build with them or use a product that already integrates them.