Qwen 3.5 API Complete Guide: Every Model, Pricing & Code Examples (2026)
Alibaba's Qwen team dropped the full Qwen 3.5 family across February and early March 2026, and it's a lot to take in. There are dense models, MoE models, API-only models, and tiny models you can run on a phone. Some of them punch way above their weight class.
This guide covers every Qwen 3.5 model available right now, what each one costs, and how to actually use them through an API or locally. No fluff — just the stuff you need to pick the right model and start building.
The Full Qwen 3.5 Model Lineup
Qwen 3.5 isn't one model. It's a family of 10+ models spanning four tiers. Here's the breakdown:
Flagship (API-only)
| Model | Type | Context | Released |
|---|---|---|---|
| Qwen3.5-Plus | MoE (API-only) | 262K | Feb 15, 2026 |
| Qwen3.5-Flash | MoE (API-only) | 262K | Feb 15, 2026 |
Plus is the flagship. Flash is the speed-optimized variant — cheaper, faster, slightly less capable. Both are multimodal (text + vision) out of the box.
Large Open-Weight Models
| Model | Params | Type | Context | Released |
|---|---|---|---|---|
| Qwen3.5-397B-A17B | 397B (17B active) | MoE | 262K | Feb 24 |
| Qwen3.5-122B-A10B | 122B (10B active) | MoE | 262K | Feb 24 |
| Qwen3.5-35B-A3B | 35B (3B active) | MoE | 262K | Feb 24 |
| Qwen3.5-27B | 27B | Dense | 262K | Feb 24 |
Small Open-Weight Models
| Model | Params | Type | Context | Released |
|---|---|---|---|---|
| Qwen3.5-9B | 9B | Dense | 262K | Mar 2 |
| Qwen3.5-4B | 4B | Dense | 262K | Mar 2 |
| Qwen3.5-2B | 2B | Dense | 262K | Mar 2 |
| Qwen3.5-0.8B | 0.8B | Dense | 262K | Mar 2 |
The standout here is the 9B model. On GPQA Diamond, it scored 81.7 — matching models 13x its size. That's not a typo. For local inference or cost-sensitive API calls, the 9B is the sweet spot of the entire family.
What Makes Qwen 3.5 Different
Three things stand out compared to Qwen 3 and competing model families:
Unified vision-language architecture. Every Qwen 3.5 model handles both text and images natively. No separate "-VL" variant needed. You send an image in the same API call as text, and it just works.
262K context across the board. Even the 0.8B model gets 262K tokens of context. That's double what GPT-5 offers (128K) and more than Claude's 200K. For processing long documents or codebases, this matters.
MoE efficiency at every scale. The 35B-A3B model has 35 billion parameters but only activates 3 billion per token. You get 35B-quality output at 3B inference cost. The 122B-A10B follows the same pattern — 122B knowledge, 10B compute.
API Pricing Comparison
Qwen 3.5 is available through multiple providers. Prices vary significantly:
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| Qwen3.5-Plus | DashScope | $0.10 | $0.30 |
| Qwen3.5-Plus | OpenRouter | $0.26 | $1.56 |
| Qwen3.5-Flash | DashScope | $0.02 | $0.06 |
| Qwen3.5-27B | OpenRouter | $0.10 | $0.18 |
| Qwen3.5-9B | Groq | ~$0.05 | ~$0.10 |
For context, GPT-5 charges $2.50/$10.00 per million tokens. Claude Sonnet 4.6 is $3/$15. Qwen3.5-Plus at $0.10/$0.30 through DashScope is 25x cheaper on input and 33x cheaper on output than Claude Sonnet.
That's not a fair comparison in terms of raw capability — Claude and GPT-5 are still stronger on complex reasoning. But for many production workloads (summarization, extraction, classification, simple code generation), Qwen 3.5 gets the job done at a fraction of the cost.
Setting Up the API: Python
Qwen 3.5 uses the OpenAI-compatible API format, so you can use the standard OpenAI SDK. Here's how to call Qwen3.5-Plus through DashScope:
from openai import OpenAI
client = OpenAI(
api_key="your-dashscope-api-key",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
response = client.chat.completions.create(
model="qwen-plus",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function that validates email addresses using regex"}
],
temperature=0.3,
max_tokens=1024
)
print(response.choices[0].message.content)
To send an image (vision), add it to the message content:
response = client.chat.completions.create(
model="qwen-plus",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this screenshot?"},
{"type": "image_url", "image_url": {"url": "https://example.com/screenshot.png"}}
]
}]
)
Setting Up the API: Node.js
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});
const response = await client.chat.completions.create({
model: "qwen-plus",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain the difference between Promise.all and Promise.allSettled" }
],
temperature: 0.3,
});
console.log(response.choices[0].message.content);
Setting Up the API: curl
curl https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"messages": [
{"role": "user", "content": "What are the SOLID principles in software engineering?"}
],
"temperature": 0.5
}'
Using Qwen 3.5 Through an OpenAI-Compatible Gateway
If you're already using the OpenAI SDK in your app and don't want to manage multiple API keys, you can access Qwen 3.5 through an OpenAI-compatible gateway like KissAPI. Same SDK, same code — just change the base URL and model name:
client = OpenAI(
api_key="your-kissapi-key",
base_url="https://api.kissapi.ai/v1"
)
# Use Qwen 3.5 Plus
response = client.chat.completions.create(
model="qwen-plus",
messages=[{"role": "user", "content": "Optimize this SQL query: ..."}]
)
# Switch to Claude or GPT-5 by changing the model name
response = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Same query, different model"}]
)
One API key, one endpoint, every model. That's the point of an OpenAI-compatible gateway — you don't rewrite your code when you switch models.
Running Qwen 3.5 Locally with Ollama
Every open-weight Qwen 3.5 model is available on Ollama. For the small models, setup takes about 30 seconds:
# Pull and run the 9B model (recommended sweet spot)
ollama run qwen3.5:9b
# Or the tiny 0.8B for edge/mobile use cases
ollama run qwen3.5:0.8b
# The 27B dense model for serious local work
ollama run qwen3.5:27b
# MoE models (need more RAM but activate fewer params)
ollama run qwen3.5:35b-a3b
ollama run qwen3.5:122b-a10b
Memory requirements for the quantized (Q4) versions:
| Model | Min RAM/VRAM | Recommended |
|---|---|---|
| 0.8B | 1 GB | 2 GB |
| 2B | 2 GB | 4 GB |
| 4B | 3 GB | 6 GB |
| 9B | 6 GB | 10 GB |
| 27B | 16 GB | 24 GB |
| 35B-A3B | 20 GB | 32 GB |
| 122B-A10B | 64 GB | 96 GB |
The 9B model runs comfortably on any modern GPU with 10+ GB VRAM. On an M4 Mac with 24 GB unified memory, it does about 40 tokens/second — fast enough for interactive use.
Which Qwen 3.5 Model Should You Use?
Here's the decision tree:
- Maximum quality, don't care about cost: Qwen3.5-Plus via API. It's the flagship and it shows.
- Best bang for buck via API: Qwen3.5-Flash. Absurdly cheap at $0.02/$0.06 per million tokens. Good enough for classification, extraction, and simple generation.
- Best local model (GPU with 10+ GB): Qwen3.5-9B. The benchmark numbers relative to its size are hard to argue with.
- Best local model (24+ GB GPU): Qwen3.5-27B. Dense architecture means predictable performance, and 27B is the sweet spot before you need multi-GPU setups.
- Edge/mobile/embedded: Qwen3.5-0.8B or 2B. They won't win any benchmarks, but they run on a Raspberry Pi.
- Cost-sensitive production at scale: Qwen3.5-35B-A3B via API or self-hosted. MoE means you get 35B quality at 3B inference cost.
Qwen 3.5 vs the Competition
| Model | Input Price | Output Price | Context | Vision |
|---|---|---|---|---|
| Qwen3.5-Plus | $0.10 | $0.30 | 262K | Yes |
| GPT-5 | $2.50 | $10.00 | 128K | Yes |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K | Yes |
| Gemini 3.1 Pro | $1.25 | $5.00 | 2M | Yes |
| DeepSeek V4 | $0.14 | $0.28 | 128K | No |
Qwen 3.5 sits in the "absurdly cheap" tier alongside DeepSeek V4. The difference: Qwen 3.5 has native vision, longer context (262K vs 128K), and a wider range of model sizes for different deployment scenarios.
Is it as smart as Claude Opus or GPT-5 on hard reasoning tasks? No. But it doesn't need to be. Most API calls in production aren't asking the model to solve PhD-level math. They're asking it to parse JSON, summarize text, classify intent, or generate boilerplate code. For those tasks, paying 25-50x more for a frontier model is just burning money.
Enabling Reasoning Mode
Qwen 3.5 models support a "thinking" mode similar to OpenAI's o1. For the small models (0.8B-9B), reasoning is disabled by default. To enable it via the API:
response = client.chat.completions.create(
model="qwen-plus",
messages=[{"role": "user", "content": "Prove that the square root of 2 is irrational"}],
extra_body={"enable_thinking": True}
)
Reasoning mode uses more tokens (the model "thinks" before answering), so expect higher costs per request. Use it selectively — math, logic, and multi-step planning benefit from it. Simple Q&A doesn't.
Try Qwen 3.5 Through One API
Access Qwen 3.5, Claude, GPT-5, and 200+ models through a single OpenAI-compatible endpoint. Pay-as-you-go, no subscription.
Start Free →Getting Started in 2 Minutes
Fastest path to a working Qwen 3.5 API call:
- Sign up for DashScope (Alibaba's API platform) or an OpenAI-compatible gateway like KissAPI
- Grab your API key from the dashboard
- Install the OpenAI SDK:
pip install openai - Run this:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_KEY",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
r = client.chat.completions.create(
model="qwen-plus",
messages=[{"role": "user", "content": "Hello, Qwen 3.5!"}]
)
print(r.choices[0].message.content)
That's it. You're running one of the most cost-effective model families available in 2026, through the same SDK you already know.