Products·2 min read·xAI

xAI Ships Grok Imagine 1.5 Image-to-Video — and Makes Grok the Voice of Vapi

In a busy few days, xAI previewed Grok Imagine 1.5, a single-image-to-video model that debuted at No. 1 on the Artificial Analysis arena with audio baked into every clip — and named Grok the default engine behind the voices on Vapi’s 2.5-million-agent platform.

AUDIO IN EVERY CLIP · GROK VOICEPRODUCTS · JUNE 3, 2026xAIGrok Imagine 1.5One image becomes a videoNo. 1 on the image-to-video arena (preview)Audio baked into every clip · multi-shotPlus: Grok now powers Vapi’s voicesBITSMINDS.COMSource: xAI · Artificial Analysis
Share:

xAI used the first days of June to push hard into generative media. On June 3 it released Grok Imagine 1.5 as an API preview — a single-image-to-video model — and around the same time announced that Grok now powers the voices on Vapi, one of the largest platforms for building voice agents. Together the two launches stake out territory in AI video and AI voice at once.

Grok Imagine 1.5 takes a single still image and animates it — adding motion, camera moves, atmosphere and physics while preserving the source image's lighting and detail. According to the Artificial Analysis image-to-video arena, the model debuted in first place (around a 1404 Elo) as of June 3. The capability traces back to xAI's March 2025 acquisition of Hotshot, a San Francisco video-generation startup that had built several video foundation models.

The specs lean toward production use. Grok Imagine 1.5 outputs up to 720p (480p is also supported), and — notably — generates synchronized audio in every clip at no extra charge. It supports multi-shot sequencing, letting creators stage a frame, animate it, and chain shots into longer scenes that hold a consistent look across a project. The preview is rate-limited to 60 requests per minute and priced at $0.08 per second at 480p and $0.14 per second at 720p, with a $0.01 image-input cost per generation.

On the audio side, Grok became the default engine for Vapi's 12 core voices, bringing more naturalness and emotional range to the 2.5-million-plus voice agents built on the platform; xAI says Grok took the top spot in Vapi's head-to-head evaluation. Grok Speech-to-Text and Text-to-Speech are now selectable in the Vapi dashboard, and teams can integrate them directly through the Grok Voice API — including custom voice cloning — for use cases such as narration, podcasts, advertising and voiceover.

The through-line is strategy. xAI is packaging creative video and voice as paid API layers, aimed squarely at OpenAI's Sora and realtime-voice stack and Google's media models. Bundling audio into every clip and slotting Grok beneath an existing army of voice agents are both bids to win developers by default rather than by demo — a sign the generative-media race is shifting from flashy showcases to the unglamorous business of being the cheapest, most reliable engine inside someone else's product.

Comments

Share your thoughts. Be kind.

0/2000

Loading comments…

Related Articles