Together AI Open-Sources OSCAR, a 2-Bit KV Cache That Cuts Long-Context Serving Costs in Half
OSCAR — Offline Spectral Covariance-Aware Rotation — squeezes the KV cache that bloats every long-context request down to two bits per element, with no measurable accuracy loss on Qwen3-32B or GLM-4.7 across reasoning, math and coding benchmarks. Together AI shipped it into SGLang on May 25, framing it as one of the more practical wins for actually-affordable 1M-token serving.
On May 25, Together AI open-sourced OSCAR (Offline Spectral Covariance-Aware Rotation), an attention-aware 2-bit quantization scheme for the KV cache — the per-token activation buffer that has quietly become the single most expensive thing about serving long-context LLMs. The release, written up by MarkTechPost, lands inside SGLang as a drop-in INT2 KV cache mode with full paged attention compatibility, meaning operators can switch on 2-bit storage without rewriting their serving stack.
The hard part of pushing a KV cache to two bits is not the math but the outliers. KV activations contain a handful of channels that carry extremely large values, with the rest of the channels well-behaved. Naive INT2 quantization — four representable levels per element — burns almost the entire range on those rare spikes, collapsing every normal value into one or two buckets and shredding attention quality. OSCAR’s answer is to rotate the activations into a basis where the outliers are spread across many channels before they ever hit the quantizer, using a spectral covariance analysis computed offline from a calibration set.
The runtime path is engineered to keep that math invisible. On the write side, each token is rotated, clipped to a calibration-derived percentile threshold, then quantized with per-token asymmetric INT2. On the read side, the INT2 kernel unpacks the packed bytes, dequantizes, applies the inverse rotation and hands results to the attention kernel — all in a single fused pass with no extra memory traffic. Because the rotation matrix and clipping thresholds are precomputed, there is no online overhead beyond the unpack-dequantize-rotate sequence that already lives inside any quantized attention kernel.
Together evaluated OSCAR against four reference models: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B and the 358B-parameter GLM-4.7-FP8 — the same model family covered in xAI’s and Alibaba Cloud’s recent long-context benchmarks. On AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6 and MATH500, INT2 OSCAR matched the FP16 baselines within a few tenths of a point, while shrinking the per-token KV cache footprint by roughly 8x compared to the FP16 baseline and 4x compared to typical INT8 caches.
The strategic read is simple: the long-context arms race has shifted from “who can claim a million tokens” to “who can actually afford to serve them.” Anthropic’s 1M-token context on Claude and Gemini’s 2M-token window are priced like luxury goods today in part because the KV cache scales linearly with sequence length and dominates the GPU memory bill. Plugging a 2-bit cache that survives reasoning benchmarks into the open serving stack is the kind of plumbing improvement that quietly resets the floor on what self-hosted long-context inference costs, and it is exactly the angle Together has been working since its Q1 funding round.
Comments
Share your thoughts. Be kind.
Loading comments…
Related Articles
