Gemini 3.1 Pro launched February 19, 2026, and posted 77.1% on ARC-AGI-2, 80.6% on SWE-bench Verified, and 54.2% on SWE-bench Pro. The ARC-AGI-2 number deserves attention: that benchmark tests genuinely novel logic patterns that cannot be solved by training-data pattern matching, and 77.1% more than doubles Gemini 3.0 Pro’s 31.1% in three months. The second story is price. Gemini 3.1 Pro costs $2/$12 per million tokens (up to 200K input) with a 2M-token context window at standard pricing, not as a paid upgrade. The recurring r/MachineLearning and r/Bard threads land on the same takeaway: it is the cheapest closed-source frontier model with the largest standard-tier long context and the most production-believable RAG scores. This is the review.
Quick answer: for RAG over large corpora, multilingual workloads, and long-context single-shot tasks, Gemini 3.1 Pro is the price-per-task leader among closed-source frontier models in May 2026. For hard cross-package refactors, agent loops, and strict-JSON pipelines at volume, Opus 4.7 and GPT-5.3-Codex are tighter. Pair both for the best coverage of cost and quality.
Gemini 3.0 Pro vs 3.1 Pro: what actually changed
Google called this a point-version update but broke their own naming convention to do it. Previous mid-cycle updates used the “.5” format; the shift to “.1” signals a capability jump that warranted its own tier. Gemini 3.0 Pro was discontinued on Vertex AI March 26, 2026, with all projects migrated to 3.1 Pro by default.
| Feature | Gemini 3.0 Pro | Gemini 3.1 Pro |
|---|---|---|
| ARC-AGI-2 | 31.1% | 77.1% |
| SWE-bench Verified | ~70% | 80.6% |
| Max output tokens | ~21,000 | 65,536 |
| Thinking levels | LOW / HIGH | LOW / MEDIUM / HIGH |
| Context window | 2M | 2M |
| API pricing (≤200K input) | $2/$12 | $2/$12 (same) |
| Agentic endpoint | Standard only | gemini-3.1-pro-preview-customtools |
| Availability | GA | Preview (GA coming) |
The API price is unchanged while the ARC-AGI-2 score more than doubled. The output token ceiling jumped from roughly 21K to 65,536, which eliminates truncation on whole-file refactors and long-form generation. JetBrains reported more than 50% improvement on benchmark task completion over 3.0 Pro. Databricks tested it on their OfficeQA benchmark and recorded best-in-class results on multi-document reasoning tasks.
TurboQuant: the infrastructure story
Alongside the model, Google DeepMind shipped TurboQuant, a KV cache quantization algorithm that compresses 16-bit floating-point cache values to 3 bits with statistically negligible accuracy loss. The results on NVIDIA H100 GPUs: 8x faster inference and 6x less memory usage. When the results were published, Micron and SK Hynix stock prices dropped; investors had been pricing those companies on the assumption that AI scale-up means proportional memory demand. TurboQuant cracked that assumption.
Two techniques drive it: per-head calibration (each attention head gets individually tuned compression parameters) and outlier-aware compression (high-magnitude values are handled separately to prevent accuracy spikes). Previous quantization approaches failed one or both of these. TurboQuant solves both, which is why the accuracy loss is effectively zero in production.
Caveat for self-hosters: TurboQuant’s 8x speedup and 6x memory reduction are validated on H100 GPUs. A100 and V100 hardware sees smaller gains. If you are evaluating long-context workloads on older infrastructure, the efficiency story does not fully apply.
MEDIUM thinking level: the cost-optimization feature
Gemini 3.1 Pro introduces a third thinking_level parameter value: MEDIUM. Previous Gemini versions had LOW (fast, cheap) and HIGH (deep, expensive). MEDIUM delivers reasoning quality equivalent to Gemini 3.0 Pro’s HIGH setting at LOW pricing. JetBrains confirmed that MEDIUM matches previous-generation HIGH on the majority of software engineering tasks at lower latency and token cost.
The practical guidance: default all production deployments to MEDIUM and escalate to HIGH only for genuinely complex multi-step reasoning, scientific derivations, or deep agent chains where the quality difference is measurable. On standard engineering tasks, MEDIUM is the right setting.
Thinking token cost warning: thinking tokens are billed at the standard output rate ($12/M). On HIGH mode, complex agentic tasks can consume 3-5x more tokens in thinking than in visible output, creating unexpected invoice spikes. Always set a token budget cap in production and benchmark thinking token consumption before deploying HIGH mode at scale.
Public benchmarks (April-May 2026)
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.7 | GPT-5.5 | GPT-5.3-Codex |
|---|---|---|---|---|
| ARC-AGI-2 | 77.1% | ~68% | ~72% | ~71% |
| SWE-bench Verified | 80.6% | 87.6% | n/p | n/p |
| SWE-bench Pro | 54.2% | 64.3% | 58.6% | n/p |
| GPQA Diamond | 94.3% | 94.2% | 92.8% | n/p |
| Terminal-Bench 2.0 | 68.5% | 69.4% | 82.7% | n/p |
| BrowseComp | 85.9% | 79.3% | 89.3% | n/p |
| MMMLU multilingual | 92.6% | 91.5% | n/p | n/p |
| MCP-Atlas tool use | 78.2% | 79.1% | n/p | 70.6% |
Three outright leads: ARC-AGI-2 abstract reasoning (77.1%), GPQA Diamond graduate science (94.3%), and MMMLU multilingual (92.6%). GPT-5.5 takes Terminal-Bench 2.0 by a significant 14-point margin, which matters for agentic terminal automation. Claude Opus 4.7 leads on SWE-bench Verified and SWE-bench Pro, which matters for coding agent quality. The Gemini leads are in reasoning generalization, science, and non-English tasks.
Where it wins, in our 14-task editorial scoring
| Domain | Gemini 3.1 Pro | vs Opus 4.7 | vs GPT-5.3-Codex | vs GPT-5.5 |
|---|---|---|---|---|
| Refactor | 7.8 | -1.2 | -0.6 | -0.9 |
| Test-gen | 7.6 | -0.8 | -1.1 | -1.0 |
| Debug | 7.9 | -0.9 | -0.5 | -0.8 |
| Agent & tool use | 7.4 | -1.7 | -1.2 | -1.9 |
| RAG (long context) | 9.2 | +0.6 | +0.7 | +0.7 |
| Multilingual coding | 9.0 | +0.8 | +1.1 | +0.9 |
| Cost per successful task | $0.31 | -$0.10 | +$0.10 | -$0.15 |
The 9.2 on RAG retrieval over a 1M+ context corpus is the editorial peak. Gemini 3.1 Pro does not just hold the context; it retrieves correctly at positions across the window. The 2M context is the most production-believable on the market right now. The cost line is the other unlock: $0.31 per successful task on our suite is 75% of Opus 4.7 and 67% of GPT-5.5. For workloads where cost per successful task is the key metric, Gemini 3.1 Pro wins when the prompt fits under the 200K threshold.
Where it loses
Hard agent loops and agentic terminal work. GPT-5.5 hits 82.7% on Terminal-Bench 2.0 versus Gemini’s 68.5% and Opus 4.7’s 69.4%. That 14-point gap is a concrete story: for multi-step terminal automation, neither Gemini nor Opus competes with GPT-5.5 right now. For cross-package TypeScript refactors with re-exported types, Gemini 3.1 Pro misses 4 of 14 call sites; Opus 4.7 misses 3. On 6-step agent plans with a forced give-up at step 4, Gemini 3.1 Pro overruns the budget on 2 of 5 runs. Opus 4.7 hits the stop cleanly on 5 of 5. The community r/Bard threads on agent setups converge on the same shape: great model for long reading and one-shot reasoning, average for multi-step agent loops.
There is also a hallucination rate gap. Independent testing puts Gemini 3.1 Pro at approximately 6% hallucination rate, versus Claude Opus 4.7 at around 3% and GPT-5 at around 4.8%. For scientific, medical, or legal pipelines where factual precision is non-negotiable, this gap requires a verification layer in production.
The dedicated agentic API endpoint
Google shipped a specialized endpoint alongside the main model: gemini-3.1-pro-preview-customtools. This endpoint is optimized for mixed-tool agentic workflows where the agent chooses between bash commands, custom functions, and web search. Previous Gemini versions would sometimes hallucinate tool calls in these environments, selecting a web search when a local file read would have sufficed. The custom tools endpoint addresses this with tuned tool-priority calibration. Pricing is identical to the standard endpoint. For teams building agentic coding workflows and evaluating Gemini as an alternative backend, use this endpoint, not the standard chat completion API.
Pricing, context, and caching
| Model | Input $/M | Output $/M | Cache $/M | Context | TCC per-task |
|---|---|---|---|---|---|
| Gemini 3.1 Pro (≤200K input) | $2.00 | $12.00 | $0.20-0.40 | 2M | $0.31 |
| Gemini 3.1 Pro (>200K input) | $4.00 | $18.00 | $0.40 | 2M | $0.62 |
| Claude Opus 4.7 | $5.00 | $25.00 | n/p | 1M | $0.41 |
| GPT-5.5 | $5.00 | $30.00 | n/p | 1M | $0.46 |
| GPT-5.3-Codex | $1.75 | $14.00 | n/p | 400K | $0.21 |
| DeepSeek V4-Pro | $1.74 | $3.48 | n/p | 1M | ~$0.18 |
The 200K input threshold is critical to plan around. Above 200K tokens, Vertex AI reprices the entire request at $4/$18, doubling the input cost. Most standard engineering tasks stay below this threshold; whole-repository analysis tasks can breach it. Design retrieval batches to stay under 200K input per call or budget the doubled rate explicitly.
Context caching reduces costs by up to 90% on repeated-context workloads: applications that repeatedly send the same system prompt, code context, or document context alongside user queries. Cached input tokens run $0.20-0.40/M. For applications that cache aggressively, effective cost per query can drop below $0.01/M input tokens. This is the real cost engineering lever.
What about GPT-5.5?
OpenAI shipped GPT-5.5 on April 23, 2026. First fully retrained base since GPT-5. 1M-token context. $5/$30 per M tokens. Two numbers reframe where Gemini 3.1 Pro sits. Terminal-Bench 2.0: GPT-5.5 hits 82.7% versus Gemini’s 68.5% and Opus 4.7’s 69.4%. A 14-point lead on agentic terminal automation. Artificial Analysis Intelligence Index: GPT-5.5 (xhigh) scores 60 versus 57 for both Opus 4.7 and Gemini 3.1 Pro.
Gemini 3.1 Pro keeps two structural wins. Output is 60% cheaper at $12/M versus GPT-5.5’s $30/M; if generation is long, the math still favors Gemini. Long context at standard pricing up to 200K input, where GPT-5.5 also has tiered pricing above 272K. For agentic tool use and terminal automation, GPT-5.5 takes the slot Opus 4.7 used to hold. For long-context retrieval and multilingual work at scale, Gemini 3.1 Pro holds the position.
Consumer access
- Google AI Pro: $19.99/month via gemini.google.com. Limited daily Gemini 3.1 Pro messages.
- Google AI Ultra: Higher subscription. Expanded limits and priority access.
- Free tier: Very limited, primarily for testing.
- Google One AI Premium: $19.99/month includes Gemini 3.1 Pro across Gmail, Docs, Sheets, and Drive. Gemini 3 Pro was discontinued March 26, 2026; 3.1 Pro is now the default in Workspace for premium subscribers.
- API / Vertex AI: $300 free credit (90-day expiry) for new Google Cloud accounts. Production deployment via Vertex AI at standard token pricing.
What the threads are saying
Three patterns dominate community discussion since February 2026:
- Long context that holds up. The r/MachineLearning “Needle in a 1M haystack” threads post real retrieval results that match the 9.2 TCC score. The community sees what we measure: the 2M context is production-believable, not just benchmark-visible.
- Multilingual is the practical superpower. Non-English coding tasks (Russian docstrings, Japanese variable names, German specs) post higher accuracy on Gemini 3.1 Pro than on any other frontier model. The MMMLU lead shows up in inline comment translation and non-Latin string handling.
- Tool use lags in long chains. r/Bard threads on agent setups converge on the same conclusion: Gemini is the model you route heavy reading to, and Opus 4.7 handles the last-mile agent steps. “Great reader, average actor” is the fair summary.
Pros and cons
| Pros | Cons |
|---|---|
| ARC-AGI-2 leadership (77.1%): highest abstract reasoning benchmark score of any frontier model | 29-second time-to-first-token: 24x slower than peer median, a real constraint for interactive chat |
| 2M context window at standard pricing: no paid upgrade for long context | Hallucination rate (~6%) higher than Opus 4.7 (~3%): requires a verification layer for high-stakes output |
| Same API price as Gemini 3.0 Pro: substantially better performance for identical cost | Preview status: behavior may change before GA; production deployments carry stability risk |
| MEDIUM thinking level: previous-gen HIGH quality at LOW cost | 200K input threshold doubles cost: whole-repo analysis regularly breaches it |
| Context caching up to 90% cost reduction on repeated-context workflows | HIGH thinking mode: thinking tokens billed at output rate, causing cost spikes without monitoring |
| Best multilingual scores (MMMLU 92.6%): non-English work is a clear win | Agent loops: lags GPT-5.5 by 14 points on Terminal-Bench 2.0 |
| Deep Google Workspace integration for existing GCP teams | TurboQuant gains are H100-specific; older GPU infrastructure sees smaller efficiency improvements |
Best Gemini 3.1 Pro stacks
| Goal | Recommended use | Why |
|---|---|---|
| RAG over large corpora | Gemini 3.1 Pro (MEDIUM thinking) + Cohere Rerank v4.0 | 2M context; cheapest closed-source per token; 9.2 TCC RAG score |
| Multilingual coding | Gemini 3.1 Pro | MMMLU 92.6%; non-English work is a clear lead |
| Scientific research | Gemini 3.1 Pro (HIGH thinking) | GPQA Diamond 94.3%; ARC-AGI-2 77.1% for novel reasoning |
| Cheap one-shot code gen | Gemini 3.1 Pro for first draft, Opus 4.7 review pass | Cost-blended frontier quality |
| Agentic terminal automation | GPT-5.5 | Terminal-Bench 2.0: 82.7% vs Gemini’s 68.5% |
| Hard multi-file refactors | Claude Opus 4.7 | SWE-bench Verified 87.6% vs Gemini’s 80.6% |
| Strict JSON pipelines at volume | GPT-5.3-Codex | Near-zero unescaped-quote failures; $0.21/task at 400K context |
Who should use it
Use Gemini 3.1 Pro if: your workload is RAG over large corpora, multilingual code or content, research-grade reasoning, or any long-context task that fits under the 200K input threshold. Teams already on Google Cloud get the easiest adoption path. Budget-sensitive teams running high-volume inference get the best cost math in closed-source frontier models.
Look elsewhere if: latency matters for interactive users (29s TTFT disqualifies it for chat), agentic terminal work is your primary use case (GPT-5.5 leads by 14 points), your pipelines require absolute factual precision (Opus 4.7’s lower hallucination rate wins), or you need stability for a production system before GA.
How it compares: full editorial scorecard
| TCC editorial score | Gemini 3.1 Pro | Claude Opus 4.7 | GPT-5.5 | GPT-5.3-Codex |
|---|---|---|---|---|
| Refactor | 7.8 | 9.0 | 8.7 | 8.4 |
| Test-gen | 7.6 | 8.4 | 8.6 | 8.7 |
| Debug | 7.9 | 8.8 | 8.7 | 8.4 |
| Agent & tool use | 7.4 | 9.1 | 9.3 | 8.6 |
| Strict JSON | 7.9 | 8.2 | 8.9 | 9.0 |
| RAG long context | 9.2 | 8.6 | 8.5 | 8.5 |
| Multilingual | 9.0 | 8.2 | 8.1 | 7.8 |
| Cost per successful task | $0.31 | $0.41 | $0.46 | $0.21 |
FAQ
When was Gemini 3.1 Pro released?
February 19, 2026, in public preview on Google AI Studio and Vertex AI. Gemini 3.0 Pro was discontinued on Vertex AI March 26, 2026. General availability for 3.1 Pro is pending as of May 2026.
What is TurboQuant?
TurboQuant is a KV cache quantization algorithm from Google DeepMind that compresses cache values from 16-bit to 3 bits using per-head calibration and outlier-aware compression. The result is 8x faster inference and 6x less memory usage on H100 GPUs at effectively zero accuracy loss. It directly lowers inference costs and improves latency for long-context workloads at scale.
What is the MEDIUM thinking level?
MEDIUM is a new thinking_level parameter value introduced in Gemini 3.1 Pro, sitting between LOW and HIGH. It delivers reasoning quality equivalent to Gemini 3.0 Pro’s HIGH mode at significantly lower token cost and latency. JetBrains confirmed MEDIUM matches previous-generation HIGH on most software engineering tasks. Default to MEDIUM for production and use HIGH only for genuinely complex reasoning tasks.
Is Gemini 3.1 Pro better than Claude Opus 4.7?
For ARC-AGI-2 abstract reasoning (77.1% vs ~68%), long-context RAG, and multilingual tasks, Gemini 3.1 Pro wins. For SWE-bench Verified (80.6% vs 87.6%), agentic loops, and pipelines requiring low hallucination rates, Claude Opus 4.7 wins. For most workloads, the cost difference ($2/$12 vs $5/$25) makes Gemini the default and Opus the upgrade for hard tasks.
How much does the context window cost above 200K tokens?
Above 200K input tokens, Vertex AI reprices the entire request at $4/$18 per million tokens (up from $2/$12). This doubles the input cost. Plan retrieval batches to stay under 200K per call. Use context caching ($0.20-0.40/M cached tokens) to reduce repeated-context costs by up to 90%.
Can I use Gemini 3.1 Pro for agentic coding workflows?
Yes, with the dedicated gemini-3.1-pro-preview-customtools endpoint, which has tuned tool-priority calibration for mixed bash/function environments. For raw agent loop quality, Claude Opus 4.7 scores higher on MCP-Atlas tool use (79.1% vs 78.2%) and GPT-5.5 leads on Terminal-Bench 2.0. Gemini 3.1 Pro is the right choice when the agentic task involves heavy long-context reading rather than multi-step execution chains.
Is Gemini 3.1 Pro free?
Gemini 3.1 Pro is free for prototyping via Google AI Studio with no credit card required. Google Cloud new accounts get $300 in free credits (90-day expiry) for Vertex AI testing. Paid consumer access starts at $19.99/month (Google AI Pro or Google One AI Premium). API usage at standard token pricing ($2/$12/M) for production workloads.
Verdict
Gemini 3.1 Pro is the cheapest closed-source frontier model with a production-believable 2M context window and the highest ARC-AGI-2 abstract reasoning score of any model in May 2026. It sits in the top five on SWE-bench leaderboards but leads outright on cost-per-task for RAG, multilingual, and research-grade workloads. The hallucination rate is higher than Opus 4.7, the latency disqualifies it for interactive chat, and agentic terminal loops belong to GPT-5.5. Pair Gemini 3.1 Pro for the cheap-and-long reading tasks with Opus 4.7 for the hard agent work and you cover both sides of the cost/quality envelope.
For methodology behind the scores above, see the 14-task scorecard. For long-context behavior in production RAG, see the long-context divergence trend post. For the reranker stack that closes the gap between vendor needle numbers and production retrieval, see the RAG defaults cheatsheet.