Claude Sonnet 5 raises the bar for enterprise intelligence across key industries

|
Share

Anthropic's Claude Sonnet 5 delivers comparable enterprise-grade quality to Sonnet 4.6 on Box's Complex Work Eval — our proprietary benchmark for enterprise document intelligence — and pulls ahead in several of the operational domains enterprises rely on most: Energy, Retail, Professional Services, and Technology. It reaches those results through an efficient, streamlined agent loop, making it a strong fit for the high-volume workflows that define real enterprise deployment.

How we evaluate

Box's Complex Work Eval is built to measure models the way enterprises actually experience them — not on isolated question-and-answer pairs, but on complete, multi-step work performed against real business documents. Every model runs inside the same end-to-end agent framework: it plans an approach, retrieves the relevant source files, reads and reconciles multi-format content (spreadsheets, PDFs, presentations, and images), performs the required analysis, and produces a finished deliverable — a report, a due-diligence assessment, a recommendation. This mirrors the conditions of production deployment, where the quality that matters is the quality of the final output, not any single intermediate step.

The benchmark spans tasks across 12 industries and a range of enterprise use cases, from data analysis and report drafting to due diligence and expert review. Each task is scored against a detailed rubric of weighted pass/fail criteria — often several dozen per task — that check for the specific facts, figures, and conclusions a domain expert would expect in the deliverable. To ensure results are stable rather than the product of a single lucky or unlucky run, every task is evaluated across numerous trials and the scores are aggregated by criteria weight. Both Sonnet 5 and Sonnet 4.6 ran under identical agent configurations, so the differences below reflect the models themselves.

Industry subset

Pulling ahead where enterprises operate

Sonnet 5's quality gains concentrate in the structured, operational domains that drive day-to-day enterprise work:

  • Energy (+4pp): Sonnet 5 reaches 68% vs Sonnet 4.6's 64% — the largest industry gain in the evaluation, on tasks spanning operational reporting and multi-document analysis.
  • Retail (+4pp): 76% vs 72%, with stronger performance on due-diligence and verification workflows across product and supplier data.
  • Professional Services (+2pp): 71% vs 69%, reflecting more reliable handling of multi-source analytical deliverables.
  • Technology (+1pp): 63% vs 62%, extending Sonnet 4.6's lead on technical analysis tasks.

Across these domains, Sonnet 5 delivers higher accuracy through a streamlined agent loop — better answers on the workflows enterprises run at the highest volume.

The practical effect shows up where it actually matters: in document-heavy operational work, this upgrade should translate into fewer reconciliation errors across multi-document reports, more reliable due-diligence and verification output, and less manual double-checking before a deliverable ships. Quality gains like these compound at volume — small accuracy improvements on the highest-frequency workflows add up to meaningfully less rework over time. For enterprises building production workflows that thousands of employees depend on, that repeatability matters as much as peak performance: a model that is reliably right is easier to deploy with confidence than one that is occasionally brilliant and occasionally off.

Built for production scale

The pattern across the evaluation is consistent: Sonnet 5 matches the frontier quality enterprises expect and improves on it in key operational industries, all through an efficient, streamlined agent loop. For teams moving AI from pilot to production — where every task multiplies across thousands of users — that profile makes Sonnet 5 a compelling choice for scaling enterprise AI.

Get started

Claude Sonnet 5 is available soon for Box AI customers. To explore how it performs on your enterprise workflows, visit Box AI Studio or contact your Box account team.