Gemini 3.1 Pro API Python Quickstart: Streaming, Thinking Levels & Cost Tips
Google dropped Gemini 3.1 Pro on February 12, and the benchmarks are hard to ignore. 77.1% on ARC-AGI-2 (more than double the previous version), 80.6% on SWE-Bench Verified, and it costs the same as Gemini 3 Pro — $2 per million input tokens. That's 7.5x cheaper than Claude Opus 4.6 for input.
If you haven't tried it yet, this guide gets you from zero to working code in about 10 minutes. We'll cover the basics, streaming, thinking levels, context caching, and some real cost numbers so you know what to expect on your bill.
Prerequisites
You need two things:
- Python 3.10+ installed
- A Google AI API key (free from AI Studio) or an API gateway key that supports Gemini models
Install the SDK:
pip install google-genai
That's the new unified SDK. If you're still using google-generativeai, it works too, but the new google-genai package is cleaner and what Google recommends going forward.
Basic Request: Your First API Call
Here's the simplest possible call to Gemini 3.1 Pro:
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents="Write a Python function that finds the longest palindromic substring in a string."
)
print(response.text)
The model name is gemini-3.1-pro-preview while it's in preview. Google will likely drop the -preview suffix once it goes GA.
One thing you'll notice right away: responses feel faster than Gemini 3 Pro, and the output is more concise. Google specifically optimized 3.1 Pro to use fewer tokens while maintaining quality — which also means lower costs per request.
Streaming Responses
For anything user-facing, you want streaming. Nobody likes staring at a blank screen for 5 seconds. Here's how:
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
response = client.models.generate_content_stream(
model="gemini-3.1-pro-preview",
contents="Explain how B-trees work and why databases use them."
)
for chunk in response:
print(chunk.text, end="", flush=True)
The streaming API returns chunks as they're generated. Each chunk has a .text property with the partial response. Time-to-first-token is typically under 500ms, which feels snappy in a chat interface.
Thinking Levels: Low, Medium, and High
This is where Gemini 3.1 Pro gets interesting. Unlike the binary "thinking on/off" in some other models, Gemini gives you three thinking levels. This lets you trade off between speed, cost, and reasoning depth per request.
from google import genai
from google.genai.types import GenerateContentConfig, ThinkingConfig
client = genai.Client(api_key="YOUR_API_KEY")
# Low thinking — fast, cheap, good for simple tasks
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents="Convert this SQL to a SQLAlchemy ORM query: SELECT * FROM users WHERE age > 25 ORDER BY name",
config=GenerateContentConfig(
thinking_config=ThinkingConfig(thinking_budget=1024)
)
)
# High thinking — slower, more tokens, but nails hard problems
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents="Find and fix the race condition in this Go code: [your code here]",
config=GenerateContentConfig(
thinking_config=ThinkingConfig(thinking_budget=8192)
)
)
print(response.text)
The thinking_budget controls how many tokens the model can use for internal reasoning. Lower budget = faster and cheaper. Higher budget = more thorough reasoning.
Here's a rough guide for when to use each level:
| Thinking Budget | Best For | Typical Latency |
|---|---|---|
| 1024 (low) | Simple Q&A, formatting, translation, code conversion | 1-3s |
| 4096 (medium) | Code generation, debugging, analysis, summarization | 3-8s |
| 8192+ (high) | Complex reasoning, multi-step math, architecture decisions | 8-20s |
The cost difference is real. A high-thinking request might use 3-5x more tokens than a low-thinking one. For production apps, routing simple queries to low thinking and hard queries to high thinking can cut your bill significantly.
Context Caching: The Cost Hack Most People Miss
If you're sending the same system prompt or reference documents with every request, you're paying full price for those tokens every time. Context caching fixes this.
from google import genai
from google.genai.types import Content, Part, CreateCachedContentConfig
client = genai.Client(api_key="YOUR_API_KEY")
# Cache your system prompt + reference docs (must be 32K+ tokens)
cache = client.caches.create(
model="gemini-3.1-pro-preview",
config=CreateCachedContentConfig(
contents=[
Content(
role="user",
parts=[Part(text="[Your 50K-token codebase or documentation here]")]
)
],
system_instruction="You are a senior developer reviewing code for security vulnerabilities.",
ttl="3600s" # Cache lives for 1 hour
)
)
# Now use the cache — input tokens from cached content cost 75% less
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents="Review the authentication module for SQL injection risks.",
config=GenerateContentConfig(
cached_content=cache.name
)
)
print(response.text)
Cached input tokens cost $0.50 per million instead of $2.00 — a 75% discount. The catch: your cached content needs to be at least 32,768 tokens, and you pay a small storage fee ($1.00 per million tokens per hour). For repeated queries against a large codebase or document set, the savings add up fast.
Pricing Breakdown: What It Actually Costs
Let's put real numbers on this. Say you're building a coding assistant that handles 100 requests per day, averaging 2K input tokens and 1K output tokens per request.
| Model | Input Cost/day | Output Cost/day | Monthly Total |
|---|---|---|---|
| Gemini 3.1 Pro | $0.40 | $1.20 | ~$48 |
| Claude Sonnet 4.6 | $0.60 | $1.50 | ~$63 |
| Claude Opus 4.6 | $3.00 | $7.50 | ~$315 |
| GPT-5.2 | $2.00 | $3.00 | ~$150 |
Gemini 3.1 Pro is the cheapest option here by a solid margin, and it's competitive with Claude Opus on benchmarks. For cost-sensitive production workloads, that's a compelling combination.
Using Gemini 3.1 Pro Through an OpenAI-Compatible Endpoint
If you're already using the OpenAI SDK in your codebase and don't want to add another dependency, you can access Gemini through an OpenAI-compatible API gateway. This is handy when you want to switch between models without rewriting your client code.
from openai import OpenAI
# Use an OpenAI-compatible gateway like KissAPI
client = OpenAI(
api_key="your-gateway-api-key",
base_url="https://api.kissapi.ai/v1"
)
response = client.chat.completions.create(
model="gemini-3.1-pro-preview",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Redis caching decorator in Python with TTL support."}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Same OpenAI SDK, same code structure — just a different model name. If Gemini is slow or down, you can swap to claude-sonnet-4-6 or gpt-5 by changing one string. Gateways like KissAPI handle the translation between OpenAI format and Google's native API behind the scenes.
5 Tips to Get the Most Out of Gemini 3.1 Pro
- Match thinking level to task difficulty. Don't burn 8K thinking tokens on "convert this JSON to YAML." Use low thinking for simple tasks, save high thinking for genuinely hard problems.
- Use context caching for repeated contexts. If you're sending the same codebase or docs with every request, cache it. The 75% input discount pays for itself after a handful of requests.
- Stream everything user-facing. Gemini 3.1 Pro's time-to-first-token is fast. Streaming makes your app feel responsive even when the full response takes 10+ seconds.
- Be specific in your prompts. Gemini 3.1 Pro is good at following detailed instructions. Instead of "review this code," try "review this code for memory leaks, focusing on the connection pool management in lines 45-80."
- Compare before committing. Run the same prompts through Gemini 3.1 Pro and Claude Sonnet 4.6. For coding tasks, they're closer than the benchmarks suggest. Pick the one that works better for your specific use case.
Try Gemini 3.1 Pro API Today
Access Gemini 3.1 Pro, Claude, GPT-5, and 100+ models through one OpenAI-compatible API. Sign up and get $1 in free credits.
Get Started Free →When to Use Gemini 3.1 Pro vs. Other Models
No single model wins at everything. Here's a practical decision framework:
- Gemini 3.1 Pro — Best price-performance ratio. Strong at reasoning, math, and agentic tasks. Great default choice for production workloads where cost matters.
- Claude Sonnet 4.6 — Better at nuanced writing and following complex instructions. Preferred by many developers for code review and refactoring.
- Claude Opus 4.6 — When you need the absolute best quality and cost isn't the primary concern. Expert-level tasks, research, complex multi-file changes.
- GPT-5 — Strong all-rounder with the largest ecosystem of tools and integrations.
The smart move is to not lock yourself into one model. Use an API gateway, test different models on your actual workload, and route requests to whichever model gives you the best results for the price.
Wrapping Up
Gemini 3.1 Pro is a serious contender. The benchmarks are strong, the pricing is aggressive, and the thinking levels give you fine-grained control over the cost-quality tradeoff. Whether you use it through Google's native SDK or an OpenAI-compatible gateway, getting started takes about 10 minutes.
The AI model landscape moves fast. Six months ago, nobody was talking about thinking budgets or context caching as standard features. Today they're table stakes. The best approach is to stay flexible, keep your code model-agnostic, and let the models compete for your workload.