Box AI enterprise eval: OpenAI's o3 and o4-mini for data extraction with Box AI
This week, OpenAI released the o3 and o4-Mini reasoning models. Today, we look at how these two models handle a critical enterprise task: accurate data extraction.
Using the Box AI Enterprise Eval framework, we tested OpenAI's o3 and o4-Mini on a challenging subset of the CUAD dataset, representative of complex enterprise documents requiring nuanced understanding for accurate extraction. Key performance metrics included correctness percentage and a composite F1 score, which we use to balance precision and recall.
Extraction performance insights

Our testing revealed strong extraction capabilities from both models, positioning them competitively alongside other leading models evaluated by Box AI:
- o4-Mini: Demonstrated excellent performance on the hard CUAD subset, achieving 84% correctness with an F1 score of 0.85. This indicates a high degree of accuracy and reliability in identifying and extracting the correct information from complex legal documents.
- o3: Also showed robust performance, achieving 80% correctness with an F1 score of 0.81 on the same challenging dataset.
While Both models successfully extracted a high volume of the requested information, o4-Mini exhibited a noticeable edge in overall accuracy and F1 score. This level of performance makes both models valuable tools for automating information retrieval from enterprise content.
Understanding the performance edge: o4-Mini vs. o3
Both models deliver strong results suitable for enterprise use, but the nuanced difference in performance warrants closer examination.
- o4-Mini: Peak Accuracy for Critical Tasks: With 84% correctness and a higher overall effectiveness score (0.85 F1), o4-Mini stands out for its reliability. This suggests it makes fewer errors (higher precision) while still capturing most of the relevant information (high recall). This edge is particularly valuable for use cases where accuracy is paramount – think final reviews of legal agreements, compliance checks involving sensitive data, or financial reporting where even small errors can have significant consequences. Its performance places it among the top-tier models we've evaluated for these demanding extraction tasks.
- o3: Robust and Capable: Achieving 80% correctness and an 0.81 F1 score on the challenging CUAD subset is a strong showing. o3 proves itself capable of handling complex extraction demands reliably. While o4-Mini demonstrates slightly higher accuracy, o3 provides a powerful baseline of performance that is more than sufficient for a wide array of enterprise tasks. Organizations might consider o3 for high-volume processing or scenarios where its robust capabilities meet the requirements, potentially offering a different balance of performance and efficiency compared to o4-Mini.
Putting o3 and o4-Mini to work across your organization
The refined extraction capabilities of these OpenAI models unlock powerful use cases across various teams:
- Legal & Compliance: Use o4-Mini for high-stakes contract analysis, identifying specific clauses, dates, and obligations with maximum accuracy for risk assessment and compliance verification. Employ o3 for initial contract review sweeps or categorizing large volumes of legal documents efficiently.
- Finance: Leverage o4-Mini to extract precise figures, terms, and counterparty details from financial agreements or regulatory filings. Use o3 for processing batches of invoices or expense reports where robust accuracy is sufficient.
- Sales Operations: Automatically populate CRM fields by extracting key terms, deal values, and renewal dates from sales contracts. o4-Mini can ensure the highest data integrity for critical fields, while o3 can handle broader data capture needs.
- Procurement: Analyze supplier agreements to extract delivery timelines, payment terms, and service level agreements (SLAs). Choose o4-Mini for verifying critical SLA commitments and o3 for general supplier contract data management.
- HR: Quickly process employment agreements or policy documents to extract start dates, compensation details, or non-compete clauses. o4-Mini ensures precision for sensitive employee data, while o3 aids in efficiently managing large volumes of HR records.
Get started today
With OpenAI's o3 and o4-Mini models available in Box AI, you have powerful new options for tackling complex data extraction challenges. Whether you need the peak accuracy of o4-Mini for critical workflows or the robust, capable performance of o3 for broader tasks, Box AI provides the flexibility to choose the right tool for the job.
Unlock the potential of precise, efficient AI for your enterprise content. To access OpenAI's o3 and o4-Mini in Box AI Studio and via the Box AI APIs send us an email at [email protected].