How OpenAI’s GPT-5.4 improves data extraction across complex enterprise documents and industries

|
Share

Today OpenAI released GPT-5.4, and we're excited to share the results of our latest model evaluation; a comparison of GPT-5.4 with its predecessor, GPT-5.2.

GPT-5.4 delivers a 6-percentage-point improvement in overall extraction accuracy across all document effort levels, rising from 72% to 78%. This upgrade translates to more reliable outputs and fewer errors, which is especially important for enterprises that rely on AI to extract data, analyze a variety of document types, and automate complex workflows. GPT-5.4 particularly shines for tasks that require multi-step calculations, deep reasoning across varied document types, and inference from complex content.

Overall extraction improvement

Overall extraction improvement

Our evaluation shows that extraction performance has improved across all categories, with significant gains across diverse document types and industries. By reliably capturing metadata from inconsistent structures, organizations can now automate complex financial and legal workflows with higher confidence and less manual verification.

Performance by document type

To understand how these improvements manifest in daily operations, we evaluated the models against several specific document types:

  • Clinical Data: Improved by 5 percentage points (81% → 86%) — supports categorizing patient risk groups and extracting numerical values for trial recruitment and clinical workflows.
  • Legal Agreement: Improved by 3 percentage points (82% → 85%) — assists in identifying procedural details, mapping them to specific legal principles, and capturing contract terms during due diligence.
  • Regulatory Filing: Improved by 5 percentage points (79% → 84%) — extracts compliance data and metadata from structured enterprise filings.
  • Research Publication: Improved by 7 percentage points (71% → 78%) — parses experimental findings and data from scientific literature.
  • Government Statistical Publication: Improved by 10 percentage points (60% → 70%) — analyzes large-scale datasets and demographic data for public sector and strategic planning.
  • Industry & Media Report: Improved by 7 percentage points (61% → 68%) — extracts market trends, cost-per-click metrics, and performance data from analyst reports.

These improvements deliver a measurable increase in effectiveness for specialized content. By providing a higher baseline for data extraction, GPT-5.4 streamlines high-stakes workflows—from clinical trial recruitment and R&D to legal due diligence—significantly reducing the need for manual oversight.

Performance on complex reasoning tasks

Use case subset

Beyond document types, GPT-5.4 showed broad improvements across our core high-effort extraction categories:

  • Fields with heavy calculations (85% → 89%): Yields fewer errors when extracting data that requires quantitative analysis.
  • Many steps of reasoning (73% → 79%): Surpasses GPT-5.2 on nuanced legal and procurement reviews by more effectively executing multi-step tasks.
  • Many fields to extract across long documents (64% → 71%): Reduces gaps and inconsistencies when populating large metadata templates that require numerous fields to be extracted from extensive files

Q&A performance increased across industries

Industry subet

Our evaluations confirm that GPT-5.4 delivers more consistent data extraction in complex, industry-specific scenarios. The model is now significantly better at identifying correct answers directly from source text, largely eliminating instances where it previously omitted information or incorrectly returned "not applicable."

  • Healthcare: Gained 9 percentage points (57% → 66%). In a healthcare recruitment analysis task, GPT-5.4 successfully mapped categories from underlying data patterns and extracted precise numerical values. In the same test, GPT-5.2 incorrectly marked several categories as "not applicable" and extracted a slightly inaccurate numerical value.
  • Legal: Saw an 11-percentage-point gain (52% → 63%). In related evaluation tasks, such as legal brief drafting, GPT-5.4 successfully navigated multi-criteria document requirements and avoided negative-weight criteria—like citing irrelevant authorities—that resulted in score penalties for GPT-5.2.
  • Energy: Improved by 16 percentage points (44% → 60%). This gain was driven by stronger performance in expert review and verification tasks, where the model demonstrated a better ability to infer correct information directly from the source text.

Now available in Box AI Studio

GPT-5.4 will be available today in Box AI Studio, giving you the ability to build custom AI workflows with a more capable model. Whether you're extracting data from spreadsheets, automating document review, or building workflows for healthcare, legal, or energy use cases, you can leverage these improvements directly within your enterprise content environment.