Anthropic's Claude Fable 5 represents the most significant model improvement we've measured on Box's Complex Work Eval — our proprietary benchmark for enterprise document intelligence. Across tasks spanning 11 industries, Fable 5 achieves 83% accuracy compared to 79% for Opus 4.8, a 4.4 percentage point improvement that concentrates in the tasks enterprises care about most: report drafting from data, financial analysis, and multi-step document reasoning.
Box evaluates every frontier model using an end-to-end agent framework that mirrors real enterprise workflows. The agent orchestrates document retrieval, processes multi-format content (spreadsheets, PDFs, presentations, images), and produces deliverables — exactly as it would in production.
Our Complex Work Eval comprises tasks scored against detailed rubric criteria with weighted pass/fail evaluation. Both models ran identical agent configurations, and we evaluated each task with numerous trials and criteria to ensure testing stability.
Where Fable 5 pulls ahead
Fable 5's advantage concentrates in tasks that demand numerical precision across multi-step calculations and careful reasoning about edge cases — exactly the scenarios where errors compound and carry real consequences.
Report Drafting from Data (+6.8pp): The largest task category in our benchmark, report drafting tests whether a model can synthesize numerical analysis from spreadsheets and source documents into structured deliverables. Fable 5 scores 87% vs Opus 4.8's 81% — producing more complete and accurate analytical reports with fewer missed calculations.
Data Analysis (+4.6pp): Across tasks requiring extraction and computation of metrics from complex documents, Fable 5 scores 80% vs 76%. The improvement stems from more precise formula application and fewer arithmetic errors on multi-step problems.
Due Diligence (+3.4pp): On tasks requiring verification of document details against compliance criteria, Fable 5 scores 77% vs 74%.

Performance by industry
Fable 5's largest advantages appear in content-heavy and numerically complex domains. Media & Entertainment shows the most dramatic improvement, with Fable 5 reaching 78% compared to Opus 4.8's 61% — a 17 percentage point gap. Technology follows at 81% vs 73% (+7.9pp), Healthcare at 66% vs 60% (+6.2pp), Financial Services at 89% vs 83% (+5.9pp), Retail at 93% vs 88% (+5.5pp), and Consumer Products at 89% vs 84% (+5.3pp).
The remaining industries (Legal, Industrial Goods & Automotive, Energy, Professional Services) show smaller but consistent improvements in the 1–2pp range.
What the improvement looks like in practice
The data tells us where Fable 5 wins. But the more instructive question is how it wins — and the answer reveals a behavioral shift in how the model approaches complex enterprise tasks.
Media & Entertainment — Genre profitability projection: A task required projecting net profits across multiple content genres with region-specific tax treatments. A 20% Argentine tax deduction was already embedded in the historical revenue figures in the source spreadsheet. Fable 5 traced the lineage of the numbers and recognized it was already factored in. Opus 4.8 applied the deduction again on top — a compounding error across four genre calculations that produced significantly inflated cost projections.
Financial Services — Debt facility modeling: On a 5-year projection requiring multi-step financial calculations, Fable 5 correctly applied interest to opening balances and used the appropriate capex treatment. Opus 4.8 applied interest to the total facility amount and computed tax from the wrong base — two errors that cascade through every downstream cell in the model.
Legal — M&A due diligence: Reviewing NDA terms against a semiconductor company's contracting policy, Fable 5 correctly identified that a joint-ownership clause violates exclusivity requirements while a fixed liability cap is permitted under a Super Cap exception. Opus 4.8 missed the nuance of the ownership structure. On a separate M&A task, Fable 5 correctly identified that an HSR filing exemption applies due to foreign issuer status, while Opus 4.8 asserted filing is required — incorrect guidance with real regulatory consequences.
Retail — Investment benchmark analysis: A task required identifying which high-growth product articles exceed a specific investment growth threshold. Fable 5 computed each article's growth rate individually and correctly identified that only 2 of 5 exceeded the 15.10% benchmark. Opus 4.8 confused "high growth relative to the category average" with "above the absolute benchmark" — a subtle but consequential analytical error.
Consistency as a feature
Beyond raw accuracy, Fable 5 demonstrates significantly higher consistency across repeated runs of the same task. On complex multi-document workflows, it produces materially the same answer nearly every time. Opus 4.8 exhibited notably higher variance — sometimes producing excellent results and sometimes making errors on identical inputs.
For enterprises building production workflows that thousands of employees depend on, this predictability matters as much as peak accuracy. A model that is reliably correct is more deployable than one that is occasionally brilliant.
Get started
Claude Fable 5 will be available soon for Box AI customers as a Beta model. When available, explore how it performs on your enterprise workflows in Box AI Studio.



