AI Intel: ARC-AGI-3 Humbles Frontier Models + Claude’s 1,487% Surge + More

AI Intel Benchmarks Claude · March 27, 2026 · 7 min read

The benchmark crowd got a rude wake-up call this week. ARC-AGI-3 launched on March 25, and the headline number is ugly for anyone claiming frontier models are close to general intelligence: humans score 100%, while frontier AI is at 0.26% on the new interactive benchmark. Reddit noticed fast, because it points at a real weakness: current models are still bad at learning a brand-new world by trial and error.

Claude is still riding a huge usage wave, but the rollout is getting noisy, OpenAI has officially pulled the plug on Sora, and developers are once again refreshing their timelines for the next DeepSeek drop. Here’s the Friday briefing.

ARC-AGI-3 just made the benchmark debate interesting again

What happened. ARC Prize officially launched ARC-AGI-3 this week at Y Combinator HQ. The new benchmark swaps static puzzles for interactive environments and game-style levels. Agents get no instructions, no rules, and no stated goals. ARC Prize says humans score 100% while frontier AI scores 0.26%, and the 2026 competition carries more than $2 million in prizes.

Why it matters. This is the clearest reminder in months that pattern matching is not the same thing as adaptive intelligence. Many benchmark wins happen in settings where the task is already framed for the model. ARC-AGI-3 removes that framing. When the model has to probe an unfamiliar system, notice feedback, and change strategy, the floor drops out.

Developer angle. If you’re building agents that browse, operate tools, or handle messy workflows, ARC-AGI-3 is your warning label. Prompt quality still matters, but state management, memory, retries, and action selection matter more than the average demo suggests. Teams that treat agents like fancy autocomplete will keep getting blindsided the moment the environment changes.

Claude’s growth is real, but the rough edges are getting louder

What happened. Claude usage has been exploding. Forbes, citing Appfigures data, reported a 1,487% surge in sessions, from roughly 1,112 in mid-January to 17,648 in the second week of March. Anthropic also ran a March 13-28 promotion that doubled usage limits during weekday off-peak hours. At the same time, Reddit filled up with complaints about Claude Code burning through limits too fast and hanging during longer sessions. Anthropic’s status page showed repeated incidents on March 25-27, including MCP issues and elevated errors on Opus 4.6 and Sonnet 4.6.

Why it matters. The easy take is “Claude is winning.” The better take is that demand for coding agents is now large enough that every reliability wobble gets amplified instantly. Growth promotions are great until they collide with real usage. Claude still has serious momentum, but momentum and operational smoothness are not the same thing.

Developer angle. Don’t architect your workflow as if one provider will always be available, generous, and perfectly stable. Keep prompts checkpointed. Break long coding sessions into smaller units. Build fallback paths for key automations. And if you want Claude quality without locking yourself into a big monthly subscription, this is where pay-as-you-go gateways like KissAPI make sense: you can route to Claude when it’s the right tool, then switch when it isn’t. On the API side, the pricing spread is still real: Claude Opus 4.6 sits around $15 input and $75 output per million tokens, while Sonnet 4.6 is closer to $3 and $15. Reliability problems feel a lot worse when the expensive model is also the flaky one.

Sora’s shutdown says more about AI economics than AI video

What happened. OpenAI said on March 24 that it is shutting down Sora, its video-generation product. The announcement landed fast across major outlets, and the reaction on Reddit was part surprise, part “well, that was expensive.” OpenAI did not give a long technical postmortem. It simply made clear that Sora, once one of its flashiest products, is done.

Why it matters. This is a business story dressed up as a product story. AI video looks magical in demos, but it is brutally expensive to run at scale, and consumer excitement does not guarantee a healthy margin. The Sora shutdown is a clean example of what happens when GPU-heavy products stop looking like efficient use of compute. The lesson is bigger than video: every AI feature is now being judged by whether it earns its keep, not whether it wins a keynote.

Developer angle. If your app depends on a flashy, single-provider media API, price in the risk that it disappears or changes shape fast. Abstract your model layer early. Store generation settings outside hardcoded paths. Keep a fallback vendor list. Vendor lock-in feels cheap right up until the vendor decides your favorite endpoint is dead weight.

DeepSeek rumor season is back, and that alone is moving the market

What happened. Reddit spent another day watching for signs of a new DeepSeek release. Reports outside Reddit say the company is close to unveiling a fresh model, but there still isn’t a broad public launch to point at. That hasn’t stopped the speculation loop.

Why it matters. DeepSeek changed expectations. It trained developers to assume that a serious model can show up with aggressive pricing and scramble the routing stack overnight. Even unconfirmed DeepSeek chatter now puts pressure on premium inference vendors.

Developer angle. The smartest move is not to guess the winner. It’s to be structurally ready when the next model lands. Keep your eval harness current. Run the same workload across multiple providers. Measure quality, latency, and cost together instead of chasing screenshots from social media. If you’re exploring alternatives, this is exactly why OpenAI-compatible access matters: the teams that can test Claude, GPT, Gemini, and DeepSeek on the same day will adapt faster than the teams waiting to rewrite their integrations.

Quick Hits

Anthropic’s usage promo is about to end. The temporary 2x off-peak usage boost ends after March 28.
Status noise stayed high. Claude’s status page logged multiple incidents this week, including MCP issues and elevated Opus 4.6 and Sonnet 4.6 errors.
Claude Code trust took a small hit. Users now have one more reason to keep a backup workflow ready.

The pattern this week is pretty simple: the benchmarks got harsher, the traffic got heavier, and the market got less patient with expensive AI toys. That’s healthy. It forces vendors to prove they can do more than trend for 48 hours, and it forces developers to build stacks that survive contact with reality.

Need a safer way to route across models?

Use one OpenAI-compatible endpoint for Claude, GPT, DeepSeek, Gemini, and more. Pay as you go, compare models faster, and avoid baking one vendor’s drama into your whole product.

Get Started Free →