OpenAI Ships GPT-Realtime-2, Translate, and Whisper, Bringing GPT-5 Reasoning Into Voice Apps
OpenAI rolled out three new voice models on May 7 — a reasoning agent with a 128K context window, a 70-language live translator at $0.034 a minute, and a streaming Whisper that transcribes as you speak.
OpenAI took a sweeping swing at the realtime voice market on May 7, 2026, releasing three new models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — through a single Realtime API endpoint. The headline of the launch is that voice agents now get GPT-5–class reasoning natively, instead of routing user audio through a separate text model and stitching the response back into speech.
GPT-Realtime-2 expands the context window from 32,000 tokens to 128,000 and adds parallel tool calls, so a single voice agent can hold a long conversation while pulling from calendars, booking platforms, internal databases, and enterprise APIs without losing the thread. OpenAI prices it at $32 per million audio input tokens and $64 per million output tokens, with reduced rates for cached input — a meaningful step toward voice agents that can run customer support shifts end-to-end rather than handing off to a human after a handful of turns.
The translation model is the more eye-catching consumer-facing piece. GPT-Realtime-Translate accepts speech in more than 70 languages and produces output in 13, with dynamic voice adaptation that follows the source speaker's tone, pitch, and cadence rather than locking the listener into a single synthetic voice. In multi-speaker sessions the model swaps voices as new speakers come in. OpenAI reports a 12.5% lower word error rate than competing models on Hindi, Tamil, and Telugu evaluations, and pegs pricing at $0.034 per minute — cheap enough to plausibly underwrite live broadcast captioning, multilingual conferencing, and over-the-top translation in video calls.
GPT-Realtime-Whisper, the third model, is the budget streaming transcription option at $0.017 per minute. Unlike the original Whisper, which processed audio in chunks after the fact, the new version emits a live transcript as the speaker talks, targeting use cases such as courtroom and newsroom captioning, accessibility tooling, classroom note-taking, and healthcare documentation. The three models share an API, so a single application can mix and match — for instance, transcribing with Whisper, reasoning with Realtime-2, and outputting in another language through Translate — without juggling separate SDKs.
Microsoft is shipping the trio in parallel through Azure AI Foundry, which means enterprise customers already on Azure get the same models without a separate procurement cycle. With Anthropic and Google both deepening their voice and agent roadmaps, the launch puts pressure on the rest of the field to match a stack that combines reasoning, translation, and transcription at sub-second latency under one developer surface.