How Gemini 3 Pro in Box AI unlocks true enterprise reasoning

Today, Google launched Gemini 3 Pro. This is an exciting advancement in large language model innovation and represents a new class of capability for AI. For the past year, we’ve been evaluating the world's top AI models, but the AI frontier is moving so fast that our understanding of performance has needed to evolve.This has pushed us to expand how we test, moving beyond simple accuracy to measure complex, real-world reasoning. Gemini 3 Pro is extremely intelligent at working with complex, real-world work, a performance leap represented by a 19% gain in performance to prove it.

Evolving our evals for a new class of AI

LLM innovation has become more advanced with each new model introduction. To keep pace, we're introducing more specialized benchmarks, focused on complex, multi-step reasoning, in addition to the extraction evals we’ve done in the past.

This focus is essential because it highlights complex task automation. Multi-step reasoning is about the "workflow." It’s the ability to find 15 different pieces of information at once (like dates, vendor codes, and line items), understand the quantitative and logical relationships between them, and then correctly apply them to achieve an end goal. This is what it takes to automate an entire process, not just one part of it, which is why we mature our evals to measure capabilities deeply.

Each answer is checked against a variety criteria per question, using weighted scores across line items to evaluate nuance, not just accuracy. This rubric adds a critical layer to measure an AI's ability to handle the true complexity of enterprise workflows.

Raising the Baseline for Complex Reasoning

It’s important to note that this new benchmark is designed to be difficult. Gemini 2.5 Pro, which is itself a highly capable model for complex reasoning, set a formidable baseline. It proved itself a top-tier performer for handling the nuanced demands of enterprise workflows. What our testing revealed, however, is that Gemini 3 Pro represents a new class of capability, taking that high baseline and delivering a significant leap in performance.

For example, Gemini 2.5 Pro—a model which also excels at complex reasoning—achieved a score of 64%. In comparison, Gemini 3 Pro scored 83%.

The performance leap demonstrates a particular strength in specialized vertical domains:

In Healthcare & Life Sciences, Gemini 3 Pro achieved 94% accuracy, compared to just 45% for Gemini 2.5 Pro.
In Media & Entertainment, Gemini 3 Pro reached 92% accuracy, a massive increase from 47% for the previous model.
In Financial Services, Gemini 3 Pro saw 60% accuracy, up from 51% for Gemini 2.5 Pro.

In all of these verticals, Gemini 3 Pro exemplified mastery in large, complex multi-step reasoning and showed a significant improvement in handling quantitative data when requested across multiple fields at the same time.

What this means for your business

This new capability improves the ability to execute complex workflow automation.

For Legal Teams: You can now apply Gemini 3 Pro to an entire portfolio of 500 supplier contracts at the same time and trust the output.
For Finance Departments: You can run a single, complex query to find information from 1,000 invoices at once, with a higher degree of accuracy than ever before.
For R&D and Marketing: You can synthesize findings from dozens of disparate research reports and marketing plans in a single request, pulling nuanced data from all of them to create a single, comprehensive summary.

Simply put - Gemini 3 Pro in Box AI unlocks true, large-scale automation without sacrificing performance.

Get started today

Gemini 3 Pro is a frontier-level model. It pushes the known boundaries of AI, and as with any true frontier technology, it provides a new level of performance that feels faster and more responsive.

We are making this new model available to you today, giving you access to the cutting edge of enterprise AI, in Box AI Studio and via the Box AI APIs.