AI Intel: Ex-Manus Lead Ditches Function Calling, llama.cpp Gets MCP, and You Can Run 1M Tokens on a Mac

The former backend lead at Manus — the AI agent startup that had everyone talking earlier this year — just dropped a bombshell on r/LocalLLaMA: after two years of building production agents, he stopped using function calling entirely. The post racked up 1,300+ upvotes and 310 comments in a day. It's the kind of take that makes you rethink your entire agent architecture. Meanwhile, the local LLM scene is having its best week in months.

"I Stopped Using Function Calling Entirely" — The Manus Confession

The post, written in Chinese and translated with AI help, laid out a simple argument: function calling — the structured JSON tool-use pattern that every major API provider pushes — is a trap for production agent systems. Instead, the Manus team exposed their tools as CLIs that the agent interacts with through plain text commands.

Why? Function calling looks clean in demos but breaks in ugly ways at scale. Schema validation errors, hallucinated parameter names, rigid type constraints that don't match how LLMs actually think. The CLI approach lets the model use tools the same way a developer would — by reading help text and typing commands. It's messier on paper but far more resilient in practice.

The 310-comment thread is worth reading in full. Several commenters confirmed that Manus discussed this philosophy in blog posts and interviews before, but this is the first time someone from the team spelled out the reasoning so clearly. Others pushed back, arguing that function calling with good schema design works fine for simpler use cases.

Here's the developer takeaway: if you're building agents that chain multiple tools together, consider whether structured function calling is actually helping or just adding a brittle abstraction layer. The Manus approach — treat the LLM like a junior developer who reads docs and runs commands — scales better than most people expect. And it works with any model that can follow instructions, not just ones with native tool-use support.

For API users, this has cost implications too. Function calling adds overhead tokens for schema definitions on every request. A CLI-based approach with concise help text can actually be cheaper per interaction, especially when you're running agents that make dozens of tool calls per task. If you're routing through a provider like KissAPI that supports multiple models, you can test this pattern across Claude, GPT, and open-source models without changing your agent code.

llama.cpp Merges MCP: Local LLMs Get Real Tool Use

The Model Context Protocol (MCP) PR for llama.cpp merged last week, and the community is already going wild with it. A post titled "llama.cpp + Brave search MCP — not gonna lie, it is pretty addictive" hit the front page of r/LocalLLaMA today, with users sharing setups that give local models the same tool-calling capabilities that used to require cloud APIs.

The setup is straightforward: run llama-server with the --webui-mcp-proxy flag, connect an MCP server (Brave Search, DuckDuckGo, SearXNG, whatever you want), and your local Qwen or Llama model can now search the web, read files, and call external APIs. Docker Desktop 4.42 even added streaming and tool calling support for llama.cpp out of the box.

The community is already debating the best search backend. Brave gives you 1,000 free searches but costs $5/1K after that. SearXNG — a self-hosted meta-search engine — is the popular free alternative. One commenter pointed to Perplexica, an Apache 2.0 licensed project that wraps SearXNG with all the features you'd want.

This matters because it closes the last major gap between local and cloud LLMs. A year ago, "local model with web search" meant janky Python scripts and prayer. Now it's a config flag. The agent ecosystem is decentralizing fast, and MCP is the protocol making it happen.

Nemotron 3 Super: 1 Million Tokens on an M1 Ultra

Someone on r/LocalLLaMA ran NVIDIA's Nemotron 3 Super through llama-bench with a 1-million-token context window on an M1 Ultra Mac Studio. And it worked.

The numbers aren't fast — we're talking minutes for prompt processing at that context length — but the fact that a consumer desktop can handle a million tokens locally is a milestone worth noting. A year ago, 128K context was the bleeding edge for local inference. Now we're at 8x that on hardware you can buy at the Apple Store.

Nemotron 3 Super is NVIDIA's latest open-weight model, optimized for long-context tasks. Running it locally means no API costs, no rate limits, and no data leaving your machine. For use cases like codebase analysis, document processing, or RAG over large corpora, this changes the math on build-vs-buy decisions.

The practical limit is still memory. The M1 Ultra with 192GB unified memory can handle it; most machines can't. But Apple Silicon prices keep dropping, and the M4 Ultra is around the corner. Local million-token inference is going from "stunt" to "Tuesday" faster than anyone expected.

Qwen 3.5 on a Raspberry Pi: The Floor Keeps Dropping

On the opposite end of the hardware spectrum, a developer got Qwen 3.5 35B (the A3B mixture-of-experts variant) running on a Raspberry Pi 5. The post got 52 upvotes — modest by r/LocalLLaMA standards — but the technical achievement is wild.

They modified llama.cpp with a mix of the original repo, ik_llama optimizations, and custom tweaks to squeeze a 35-billion-parameter model onto a $80 single-board computer. It's not fast. It's not practical for production. But it proves that the inference optimization community is nowhere near hitting a wall.

Two years ago, running a 7B model locally was impressive. Now we're at 35B on a Pi. The trajectory here is clear: the hardware floor for useful local AI keeps dropping. Combined with MCP tool use, you could theoretically build a fully autonomous agent on a device that fits in your palm and costs less than a month of ChatGPT Pro.

⚡ Quick Hits

  • New York passes GenAI warning bill: The state legislature passed a bill on March 9 requiring all GenAI systems to display a "conspicuous warning" that outputs may be inaccurate. It's the latest in a wave of state-level AI legislation — HB 2321 (the AI-Generated Content Accountability Act), Colorado's healthcare AI bills, and Oregon's chatbot bill all advanced this week. 2026 is shaping up as the year AI regulation actually lands.
  • 2026: "The Year of AI Agents": India Today ran a piece cataloging Claude Cowork, Perplexity Computer, and Manus as evidence that autonomous agents are replacing traditional SaaS workflows. The article argues that by year-end, agent-building skills will be the most in-demand capability in tech hiring. Given what we're seeing in the Reddit communities, hard to disagree.
  • API prices down 80% year-over-year: Yesterday's Reddit intel roundup flagged that API pricing across major providers has dropped roughly 80% compared to March 2025. Claude Sonnet, GPT-4o, and Gemini Pro all cost a fraction of what they did a year ago. For developers running agents that make hundreds of API calls per task, the economics have fundamentally shifted — and aggregators that negotiate volume rates can push costs even lower.

Building Agents? Your Model Choice Shouldn't Lock You In.

Whether you're testing the Manus CLI pattern across Claude and GPT, or mixing cloud APIs with local inference — KissAPI gives you one endpoint for 200+ models with pay-as-you-go pricing.

Explore Models →