Microsoft Launches Three MAI Models to Rival OpenAI and Google in Speech, Voice, and Image

Microsoft unveils MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 through its Foundry platform, marking the company's boldest push yet to build AI infrastructure independent of its OpenAI partnership.

Microsoft fired a significant shot across the bow of OpenAI and Google on April 2, 2026, announcing three new foundational AI models developed entirely in-house: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. The models are now available through Microsoft Foundry and a newly launched MAI Playground, representing the most aggressive move yet by the company to develop AI capabilities independent of its $13 billion OpenAI investment.

The announcement was led by Mustafa Suleiman, CEO of Microsoft AI, whose Superintelligence team — formed in November 2025 — delivered the trio of models in just five months. MAI-Transcribe-1 is a speech recognition model supporting 25 languages that logs an error rate of 3.9%, beating both Google's Gemini 3.1 Flash and OpenAI's GPT-Transcribe on standard benchmarks. It delivers batch transcription at 2.5x the speed of Microsoft's existing Azure Fast service while cutting GPU costs by approximately 50% compared to leading alternatives.

MAI-Voice-1 pushes text-to-speech into new territory with a model capable of generating 60 seconds of expressive, human-quality audio in under one second on a single GPU. It preserves speaker identity across long-form content and supports custom voice creation from just a few seconds of reference audio — a feature directly aimed at enterprise customers building branded voice experiences. Meanwhile, MAI-Image-2, a high-capability text-to-image model that first appeared on the MAI Playground in March, has now been made broadly available and debuted at number three on the Arena.ai image leaderboard.

Pricing is positioned as a deliberate competitive weapon. MAI-Transcribe-1 costs $0.36 per hour of transcribed speech; MAI-Voice-1 starts at $22 per million characters; MAI-Image-2 runs $5 per million input tokens and $33 per million output tokens. Microsoft characterized these rates as substantially below comparable offerings from Google and OpenAI. The launch underscores a broader strategic shift: even as Microsoft continues to distribute OpenAI models through Azure, Redmond is now building a parallel, self-owned AI stack capable of standing on its own.

Microsoft Launches Three MAI Models to Rival OpenAI and Google in Speech, Voice, and Image

Related Articles

Mistral Medium 3.5 Lands With Cloud Coding Agents and 77.6% on SWE-Bench

DeepSeek V4 Preview Closes Gap With Frontier Models at a Fraction of the Price

NVIDIA Unveils Nemotron 3 Nano Omni: Open Multimodal Model with 9x Throughput