First Look: GPT-4.1 now available with Box AI Studio

Today, OpenAI is launching GPT-4.1, a revamped version of, and successor to, the GPT-4o multimodal model. In our assessment of GPT-4.1, we found that it to be a powerful model that does well at complex tasks. It demonstrates strong performance comparable to top models in hard data extraction, and represented one of the largest gains we’ve seen in a model family across this data set. GPT-4.1 also shows excellent capabilities in multi-document Q&A, exhibits superior reasoning on complex visual data and showcases robust step-by-step problem-solving skills. For Box customers, GPT-4.1 is available today, upon request, in Box AI Studio.

GPT-4.1 demonstrates strong performance on hard content extraction tasks with 80% correctness on the CUAD subset and excellent multi-document QA scoring at parity with other leading state-of-the-art models in both categories. GPT-4.1 also shows superior reasoning on complex visual data and strong step-by-step problem-solving abilities. These strengths, particularly the high accuracy in metadata extraction for fields like "Expiration Date" and "Warranty Duration," suggest that GPT-4.1 isn't just pattern-matching but has developed the sophisticated reasoning skills necessary to handle the complexity and nuance common in real-world enterprise documents.

Excelling at tough data extraction

Extracting specific details from complex documents is tough, especially finding multiple related pieces of information scattered throughout – its like finding several needles in a haystack – and doing it accurately in a single-shot extraction. But this is one of the areas where GPT-4.1 performed really well, getting it right 85% of the time on our hard test set, which is on par with to other state-of-the-art models and a 27% improvement over GPT-4o. This improved accuracy in complex, single-shot extraction is crucial; it means faster processing of enterprise content, enabling reliable downstream automation, cutting down manual review time, and ultimately reducing the risk of missing critical information. For example, in our eval, GPT-4.1 showed a strong ability to extract tricky, interlinked details like calculating correct expiration dates based on other related terms within a contract.

Strong single-document question answering

GPT-4.1 answered questions from a single document 5% more accurately than GPT-4o. This strong reasoning helps users quickly find reliable answers within dense reports or policies. For instance, GPT-4.1 correctly identified the number of distinct authors in sample texts by analyzing writing styles and understood how local laws (like those in Japan) would apply to a general document like a liability waiver, tasks where the older model made errors.

Advanced image question answering & visual reasoning

GPT-4.1 is adept at interpreting images, also performing 5% better than GPT-4o. This allows GPT-4.1 to analyze business performance from charts or make sense of technical diagrams. GPT-4.1 particularly excels with complex visuals, for instance, accurately reading a detailed heatmap to determine developer utilization over consecutive months or correctly analyzing expense trends from a financial chart where the older model made errors.

Leading the pack in multi-document question answering

Multi-doc Q&A is the standout strength for GPT-4.1, scoring over 4% better than GPT-4o at synthesizing information from multiple sources. Synthesizing information this way is crucial for complex tasks like compliance checks, research, or just getting the full picture from scattered files. For example, it correctly identified the primary company policy to consult for handling accidental email disclosures and accurately found which pension provider Scottish employees should contact, pulling details correctly from different documents, which had been less consistently accurate with GPT-4o.

The takeaway

GPT-4.1 represents a meaningful step forward from GPT-4o, particularly in its ability to extract business critical information from your documents. It also shows a step change vs GPT-4o in GPT-4.1s ability reason across multiple documents and analyze complex visual data with high fidelity. GPT-4.1’s synthesis and complex reasoning abilities make it a compelling option for enterprise use cases demanding nuanced understanding and accuracy across diverse information types. Its performance in multi-document QA, visual analysis, and hard extraction tasks positions it as a powerful tool for unlocking deeper insights from your business content.

GPT-4.1 is available today, upon request, in Box AI Studio. To test GPT-4.1 yourself, send us an email at [email protected].