Qwen 3.5 API Complete Guide: Every Model, Pricing & Code Examples (2026)

Published March 14, 2026 · 10 min read

Alibaba's Qwen team dropped the full Qwen 3.5 family across February and early March 2026, and it's a lot to take in. There are dense models, MoE models, API-only models, and tiny models you can run on a phone. Some of them punch way above their weight class.

This guide covers every Qwen 3.5 model available right now, what each one costs, and how to actually use them through an API or locally. No fluff — just the stuff you need to pick the right model and start building.

The Full Qwen 3.5 Model Lineup

Qwen 3.5 isn't one model. It's a family of 10+ models spanning four tiers. Here's the breakdown:

Flagship (API-only)

Model	Type	Context	Released
Qwen3.5-Plus	MoE (API-only)	262K	Feb 15, 2026
Qwen3.5-Flash	MoE (API-only)	262K	Feb 15, 2026

Plus is the flagship. Flash is the speed-optimized variant — cheaper, faster, slightly less capable. Both are multimodal (text + vision) out of the box.

Large Open-Weight Models

Model	Params	Type	Context	Released
Qwen3.5-397B-A17B	397B (17B active)	MoE	262K	Feb 24
Qwen3.5-122B-A10B	122B (10B active)	MoE	262K	Feb 24
Qwen3.5-35B-A3B	35B (3B active)	MoE	262K	Feb 24
Qwen3.5-27B	27B	Dense	262K	Feb 24

Small Open-Weight Models

Model	Params	Type	Context	Released
Qwen3.5-9B	9B	Dense	262K	Mar 2
Qwen3.5-4B	4B	Dense	262K	Mar 2
Qwen3.5-2B	2B	Dense	262K	Mar 2
Qwen3.5-0.8B	0.8B	Dense	262K	Mar 2

The standout here is the 9B model. On GPQA Diamond, it scored 81.7 — matching models 13x its size. That's not a typo. For local inference or cost-sensitive API calls, the 9B is the sweet spot of the entire family.

What Makes Qwen 3.5 Different

Three things stand out compared to Qwen 3 and competing model families:

Unified vision-language architecture. Every Qwen 3.5 model handles both text and images natively. No separate "-VL" variant needed. You send an image in the same API call as text, and it just works.

262K context across the board. Even the 0.8B model gets 262K tokens of context. That's double what GPT-5 offers (128K) and more than Claude's 200K. For processing long documents or codebases, this matters.

MoE efficiency at every scale. The 35B-A3B model has 35 billion parameters but only activates 3 billion per token. You get 35B-quality output at 3B inference cost. The 122B-A10B follows the same pattern — 122B knowledge, 10B compute.

API Pricing Comparison

Qwen 3.5 is available through multiple providers. Prices vary significantly:

Model	Provider	Input (per 1M tokens)	Output (per 1M tokens)
Qwen3.5-Plus	DashScope	$0.10	$0.30
Qwen3.5-Plus	OpenRouter	$0.26	$1.56
Qwen3.5-Flash	DashScope	$0.02	$0.06
Qwen3.5-27B	OpenRouter	$0.10	$0.18
Qwen3.5-9B	Groq	~$0.05	~$0.10

For context, GPT-5 charges $2.50/$10.00 per million tokens. Claude Sonnet 4.6 is $3/$15. Qwen3.5-Plus at $0.10/$0.30 through DashScope is 25x cheaper on input and 33x cheaper on output than Claude Sonnet.

That's not a fair comparison in terms of raw capability — Claude and GPT-5 are still stronger on complex reasoning. But for many production workloads (summarization, extraction, classification, simple code generation), Qwen 3.5 gets the job done at a fraction of the cost.

Setting Up the API: Python

Qwen 3.5 uses the OpenAI-compatible API format, so you can use the standard OpenAI SDK. Here's how to call Qwen3.5-Plus through DashScope:

from openai import OpenAI

client = OpenAI(
    api_key="your-dashscope-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen-plus",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function that validates email addresses using regex"}
    ],
    temperature=0.3,
    max_tokens=1024
)

print(response.choices[0].message.content)

To send an image (vision), add it to the message content:

response = client.chat.completions.create(
    model="qwen-plus",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this screenshot?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/screenshot.png"}}
        ]
    }]
)

Setting Up the API: Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});

const response = await client.chat.completions.create({
  model: "qwen-plus",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Explain the difference between Promise.all and Promise.allSettled" }
  ],
  temperature: 0.3,
});

console.log(response.choices[0].message.content);

Setting Up the API: curl

curl https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-plus",
    "messages": [
      {"role": "user", "content": "What are the SOLID principles in software engineering?"}
    ],
    "temperature": 0.5
  }'

Using Qwen 3.5 Through an OpenAI-Compatible Gateway

If you're already using the OpenAI SDK in your app and don't want to manage multiple API keys, you can access Qwen 3.5 through an OpenAI-compatible gateway like KissAPI. Same SDK, same code — just change the base URL and model name:

client = OpenAI(
    api_key="your-kissapi-key",
    base_url="https://api.kissapi.ai/v1"
)

# Use Qwen 3.5 Plus
response = client.chat.completions.create(
    model="qwen-plus",
    messages=[{"role": "user", "content": "Optimize this SQL query: ..."}]
)

# Switch to Claude or GPT-5 by changing the model name
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Same query, different model"}]
)

One API key, one endpoint, every model. That's the point of an OpenAI-compatible gateway — you don't rewrite your code when you switch models.

Running Qwen 3.5 Locally with Ollama

Every open-weight Qwen 3.5 model is available on Ollama. For the small models, setup takes about 30 seconds:

# Pull and run the 9B model (recommended sweet spot)
ollama run qwen3.5:9b

# Or the tiny 0.8B for edge/mobile use cases
ollama run qwen3.5:0.8b

# The 27B dense model for serious local work
ollama run qwen3.5:27b

# MoE models (need more RAM but activate fewer params)
ollama run qwen3.5:35b-a3b
ollama run qwen3.5:122b-a10b

Memory requirements for the quantized (Q4) versions:

Model	Min RAM/VRAM	Recommended
0.8B	1 GB	2 GB
2B	2 GB	4 GB
4B	3 GB	6 GB
9B	6 GB	10 GB
27B	16 GB	24 GB
35B-A3B	20 GB	32 GB
122B-A10B	64 GB	96 GB

The 9B model runs comfortably on any modern GPU with 10+ GB VRAM. On an M4 Mac with 24 GB unified memory, it does about 40 tokens/second — fast enough for interactive use.

Which Qwen 3.5 Model Should You Use?

Here's the decision tree:

Maximum quality, don't care about cost: Qwen3.5-Plus via API. It's the flagship and it shows.
Best bang for buck via API: Qwen3.5-Flash. Absurdly cheap at $0.02/$0.06 per million tokens. Good enough for classification, extraction, and simple generation.
Best local model (GPU with 10+ GB): Qwen3.5-9B. The benchmark numbers relative to its size are hard to argue with.
Best local model (24+ GB GPU): Qwen3.5-27B. Dense architecture means predictable performance, and 27B is the sweet spot before you need multi-GPU setups.
Edge/mobile/embedded: Qwen3.5-0.8B or 2B. They won't win any benchmarks, but they run on a Raspberry Pi.
Cost-sensitive production at scale: Qwen3.5-35B-A3B via API or self-hosted. MoE means you get 35B quality at 3B inference cost.

Qwen 3.5 vs the Competition

Model	Input Price	Output Price	Context	Vision
Qwen3.5-Plus	$0.10	$0.30	262K	Yes
GPT-5	$2.50	$10.00	128K	Yes
Claude Sonnet 4.6	$3.00	$15.00	200K	Yes
Gemini 3.1 Pro	$1.25	$5.00	2M	Yes
DeepSeek V4	$0.14	$0.28	128K	No

Qwen 3.5 sits in the "absurdly cheap" tier alongside DeepSeek V4. The difference: Qwen 3.5 has native vision, longer context (262K vs 128K), and a wider range of model sizes for different deployment scenarios.

Is it as smart as Claude Opus or GPT-5 on hard reasoning tasks? No. But it doesn't need to be. Most API calls in production aren't asking the model to solve PhD-level math. They're asking it to parse JSON, summarize text, classify intent, or generate boilerplate code. For those tasks, paying 25-50x more for a frontier model is just burning money.

Enabling Reasoning Mode

Qwen 3.5 models support a "thinking" mode similar to OpenAI's o1. For the small models (0.8B-9B), reasoning is disabled by default. To enable it via the API:

response = client.chat.completions.create(
    model="qwen-plus",
    messages=[{"role": "user", "content": "Prove that the square root of 2 is irrational"}],
    extra_body={"enable_thinking": True}
)

Reasoning mode uses more tokens (the model "thinks" before answering), so expect higher costs per request. Use it selectively — math, logic, and multi-step planning benefit from it. Simple Q&A doesn't.

Try Qwen 3.5 Through One API

Access Qwen 3.5, Claude, GPT-5, and 200+ models through a single OpenAI-compatible endpoint. Pay-as-you-go, no subscription.

Start Free →

Getting Started in 2 Minutes

Fastest path to a working Qwen 3.5 API call:

Sign up for DashScope (Alibaba's API platform) or an OpenAI-compatible gateway like KissAPI
Grab your API key from the dashboard
Install the OpenAI SDK: pip install openai
Run this:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

r = client.chat.completions.create(
    model="qwen-plus",
    messages=[{"role": "user", "content": "Hello, Qwen 3.5!"}]
)
print(r.choices[0].message.content)

That's it. You're running one of the most cost-effective model families available in 2026, through the same SDK you already know.