Gemini 3.5 Flash raises accuracy across every domain and the Gemini app integrates enterprise content with Box

|
Share

Today, Box is sharing results from our internal benchmarking of Gemini 3.5 Flash against Gemini 3 Flash on agentic workflows. The findings make a clear case for the newer model, arriving at a pivotal moment: The Box MCP Server is coming to the Gemini app, as announced at Google I/O, bringing Box's enterprise-grade content and agentic capabilities to consumer, education, and business users in Gemini for the first time.

Across multiple industry domains and thousands of tasks, Gemini 3.5 Flash reached an accuracy of 74% versus 62% for Gemini 3 Flash — a near 20% relative increase in improvement, with gains observed across every domain we tested.

As enterprise AI use cases grow more complex, the bar for a production-ready model keeps rising. Retrieving information or drafting a passable response is no longer sufficient. Models need to pull precise data from dense documents, perform multi-step numerical reasoning, and synthesize findings across multiple sources without losing accuracy along the way. Our testing shows that Gemini 3.5 Flash is materially better at exactly these tasks.

All metrics below come from the Box Complex Work Evaluation, which was deliberately designed to demand heavy reasoning across longer-horizon tasks: analysis, report drafting, due diligence, and expert verification across a wide variety of industries.

Key insights:

  • Gemini 3.5 Flash delivered stronger accuracy across every domain Box tested, especially in high-stakes areas like healthcare and life sciences
  • The gains came from more careful work: It makes more tool calls, reads source material more thoroughly, and verifies inputs before answering
  • Box content and agentic workflows will become accessible to a much broader set of users with Box MCP Server coming to the Gemini app

Higher accuracy across every domain

Gemini 3.5 Flash didn't just improve on average: It was the stronger model in every domain we tested.

That kind of across-the-board consistency is what makes the result meaningful. The largest gains came in the domains where errors carry the highest cost. For example, healthcare improved by 22 percentage points and life sciences by 20 — areas where a wrong number isn't a minor issue, it's a downstream problem someone has to correct.

Seeing Gemini 3.5 Flash pull ahead most sharply in exactly these domains is what makes the results substantive rather than cosmetic. The gains show up where the work is hardest and where reliability matters most. The accuracy gains don't come from a smarter model writing better prose. They come from a different approach to the work. Gemini 3.5 Flash makes about 40% more tool calls per task than Gemini 3 Flash, spending real time reading source material and verifying its inputs before committing to an answer. That behavioral shift, not raw intelligence, is what contributes to a higher accuracy.

Industry subset

Specialized industry benchmarks

Performance is strongest in the domains where document complexity and reasoning demands are highest (Gemini 3.5 Flash vs Gemini 3 Flash):

  • Financial services: 81% vs 73% (+8pp)Generate financial reports from structured data with accurate number transcription — reading source data methodically before drafting, rather than generating before retrieval is complete
  • Public sector: 76% vs 59%, (+17pp) — Apply complex, math-focused policies to raw datasets, correctly extracting and computing numerical values for large-scale compliance and assessment reporting
  • Healthcare: 73% vs 51%, (+22pp) — Generate accurate quantitative analyses of utilization data, correctly computing multi-step derivations from dense clinical records and cross-referencing multiple data sources with tight variance
  • Life Sciences: 67% vs 47%, (+20pp) — Extract and compute scientific ratios from diagnostic testing records by systematically reading and parsing source documents, rather than reasoning from incomplete context

The largest improvements appear precisely where errors carry the highest cost. These are the domains characterized by dense, schema-rich documents — clinical records, diagnostic data, regulatory filings, and policy documents — where Gemini 3.5 Flash's ability to hold chains of interdependent reasoning intact translates directly into accuracy gains.

Use Case Subset

Where reasoning defines the frontier

Gemini 3.5 Flash's most significant results came from tasks requiring sustained, multi-step reasoning — those demanding chained logic, calculation, and synthesis across the full pipeline.

  • Report drafting from data: 80% vs 65% (+15pp)
  • Expert review/verification: 78% vs 75% (+3pp)
  • Data analysis: 77% vs 60% (+17pp)
  • Due diligence: 58% vs 51% (+7pp)

The data analysis gap is the most striking. At 17 percentage points, it reflects what happens as task complexity increases — moving from structured review toward open-ended analysis and inference. These tasks require a model to not just retrieve and summarize, but actively transform and reconcile information across sources to reach a defensible conclusion.

Why this matters for enterprises

On every dimension we examined — accuracy by domain, task-level win rates, mathematical reasoning, end-to-end reliability — Gemini 3.5 Flash did the work more carefully and produced answers you can trust. That's the combination enterprises need: not just a faster model, but one whose reasoning holds up under the kind of multi-step, document-heavy work teams are actually doing. A model that reads before it reasons, retrieves before it generates, and works from source data rather than around it changes what's practical at scale.

Workflows that previously required a human to verify every number become viable for agents — financial reporting, regulatory analysis, clinical data review, public sector reporting. Depth of reasoning and cost are often treated as a tradeoff in agentic systems, but Gemini 3.5 Flash makes a strong case that they don't have to be. The additional tool calls and tokens aren't waste. They're why the answers are right.

For Box customers, the practical upshot is straightforward: AI you can rely on for the work that actually matters. Less rework on data-heavy tasks. Agents you can trust with the numbers, not just the narrative. And enough consistency across runs — tight variance, reliably high scores — that agentic workflows can finally move from pilot to production.

A broader moment for agentic AI

The Box MCP Server will be coming to the Gemini app, as part of new MCP connector support Google announced at Google I/O. That means the same Box content and agentic capabilities that power enterprise workflows are now accessible to consumer, education, and small business Gemini users — a direct line from Gemini to their files, folders, and Box-powered workflows.

This collaboration reflects where agentic AI is headed. The most valuable models aren't the ones that generate the most fluent prose — they're the ones that can reach into real systems, retrieve real data, and act on it reliably. That's what the Box MCP Server is built to enable. And it's exactly the behavior we see in Gemini 3.5 Flash: a model designed to navigate real-world complexity with even more speed and efficiency. Together, Box and Gemini turn a connector into something you can actually trust to do the work. 

For Box, this is a meaningful step in making our content and capabilities available wherever people want to work — whether that's in Box AI Studio, through the Box API, or now in the Gemini app. For Google, it brings unstructured data and actions into a surface that millions of people use every day. For users, it means agentic AI that finally connects to the systems where the actual work lives.

More coming soon

Gemini 3.5 Flash will be available soon in Box AI Studio and through the Box API. The Box MCP Server will soon be available in the Gemini app with more details to come.