First look: Grok 3 and Box AI, coming soon to Box AI Studio

With each new model launch, the boundaries of capability – enhanced reasoning, faster processing, and more nuanced understanding – keep pushing forward. With today’s launch of Grok 3, we turn our focus to xAI to see how the latest Grok model stacks up against demanding Intelligent Content Management (ICM) workflows.

To assess Grok 3's capabilities against the demands of ICM workflows, we utilized our Box eval process and challenge document set derived from CUAD. These are the same intricate legal contracts we’ve previously used in our analyses of other leading models. This benchmark specifically tests performance on real-world use cases: complex, multi-faceted questions requiring careful data extraction and computation, which demand single-shot extraction accuracy. On this specific benchmark, Grok 3 performed at parity with other similar top models. This performance marks a significant advancement, confirming Grok 3's position at the cutting edge for handling demanding, enterprise-grade content.

First Look: The Box AI evaluation of Grok 3

Underlying this performance, Grok 3 shows capabilities in advanced, analytical, tasks demonstrating potential in multi-step reasoning, information retrieval, and quantitative analysis – particularly where deep document understanding is required. Furthermore, we found Grok 3 to be an impressive 9% more capable than its predecessor, Grok 2, enhancing its ability to effectively retrieve and utilize information from individual documents.

These promising initial results warrant a closer look. Let's dive into the data.

Grok 3 excels in sophisticated analytical tasks

Grok 3 shows capabilities when tackling sophisticated analytical tasks that require more than simple information retrieval. One observation is its potential in multi-step reasoning and computation within complex queries. The model can deconstruct intricate questions, extract relevant data points from documents, perform necessary calculations, and synthesize results according to instructions. This capacity for handling chained operations suggests an architecture suitable for managing involved analytical workflows.

Another area explored is Grok 3's information retrieval and relevance filtering. It shows an aptitude for discerning and extracting specific or nuanced information that addresses the core of a query, even when embedded within larger documents. This suggests an understanding of context and the ability to filter pertinent details while disregarding less relevant information, leading to responses that aim to be complete and targeted.

Furthermore, Grok 3 was tested on quantitative analysis and criteria-based ranking. When tasks require identifying and ranking entities based on specific numerical metrics within a dataset, the model works to pinpoint top performers or specific data points according to defined parameters. This suggests potential for tasks demanding quantitative assessment and the application of evaluation criteria.

Collectively, these performance characteristics highlight Grok 3's potential for complex analysis, particularly in scenarios involving deep document understanding, computation, and information synthesis.

Evaluating Grok 3's multi-document recall capabilities

To truly understand Grok-3’s capabilities, it’s necessary to go beyond broad benchmarks. Direct comparisons against other state-of-the-art (SOTA) models on specific, challenging tasks reveal the nuances of performance. The following analysis focuses on multi-document question-answering, a critical area for real-world applications that require synthesizing information from various sources.

Contextual understanding: Grok 3 exhibits strong contextual understanding in most scenarios. Testing on policy-related question-answering tasks shows its context recall performing comparably, sometimes with a slight advantage over, other leading models. This proficiency in accurately identifying relevant passages across multiple documents is fundamental for effective information synthesis from disparate sources.
Factual correctness: Regarding factual accuracy, Grok 3 demonstrates high reliability in many scenarios. Generally Strong correctness scores across these evaluations indicate a dependable level of accuracy when generating information based on the provided documents.
Answer recall and helpfulness: While other SOTA models currently show higher performance in directly recalling specific answers and overall perceived helpfulness within these particular multi-document QA comparisons, Grok 3 remains highly competitive, and within standard confidence intervals against other SOTA models. This indicates on par performance in overall answer generation and utility for these tasks.

What this means for you

Where can you use Grok 3 based on these strengths? We tested it against business documents like data tables, HR frameworks, and SEC filings.

From a data table with economic information about each country, Grok 3 identified countries by GDP, then extracted related figures to calculate and round their median population density and GDP per capita. Success in handling data and performing computations across multiple steps in this instance—a task where other SOTA models stumbled during testing—shows its capability for users needing sophisticated analysis within documents.
From a career framework guide, Grok 3 provided answers listing specified job attributes, aiming for comprehensive information aligned with the specific queries asked.
Across SEC filings, Grok 3’s performance in criteria-based identification was observed when ranking companies based on revenue data, offering users a way to pinpoint key entities or data points according to specific parameters.

Grok 3 shows potential when tackling analytical problems—it can dig into documents to pull out data, run calculations, and follow detailed instructions. However, this capability doesn't always translate to perfect wording; Grok 3 can sometimes be less precise with language, a bit wordy, or occasionally stumble with math or complex logic. This positions Grok 3 as a tool to explore for complex research and data analysis tasks.
To try Grok 3 in Box AI Studio and Box AI APIs today, send us an email at [email protected] to request access.