Box and Braintrust on AI agents and the future of AI observability

Call it a hot take or an obvious truth: AI is better at grading than generating from scratch.

If someone asked you, "What year was the Magna Carta issued?" would you know the answer right off the bat? You might recognize the name of the royal charter from a high school history class, but it’s likely you've forgotten the date.

Now, how about if someone asks, "Was the Magna Carta issued in 1215?" Chances are, it would jog your memory to say "Why yes."

This scenario is a simple way of understanding the concept of validation within enterprise AI. In a recent Box Partner Podcast, Box CTO Ben Kus and Ankur Goyal, CEO of Braintrust, discussed a counterintuitive truth about AI systems. While we focus on making AI smarter at generating answers, the real breakthrough comes from teaching it to recognize good answers. Just as humans find it easier to grade an essay than to write one, AI models are even better at validation than they are at content generation.

This "grading paradox" is the key to moving AI projects from impressive demos to reliable production systems — and it's why the most successful enterprises are building evaluation frameworks that leverage this strength.

Why grading beats generating in enterprise AI

Traditional software is comfortably predictable. Run 20 lines of code multiple times, and you get the same result. AI agents shatter this certainty, because every step of an agent is non-deterministic.

Agents often run hundreds of steps, generating giant traces of data where each step compounds the probability of an error. This leads to common production failures:

Context confusion: Mixing logic between different domains (like banking vs. invoicing workflows)
Confidence without accuracy: Providing wrong answers with apparent certainty
Drift: Performance degrading as real-world usage diverges from training scenarios

Here's where the grading paradox becomes crucial: While an AI might struggle to generate the perfect response to a complex query, it can reliably identify whether a given response is good or bad. This insight fundamentally changes how we should approach AI development.

The shift toward an evaluation model with non-deterministic AI

This shift from predictable outcomes to non-deterministic AI requires a fundamental change in how we build software. Goyal thinks about it this way: "AI evals are the new product requirements documents."

In the old world, product managers defined requirements and engineers built to meet them. But in the AI world, the most valuable thing a product manager can do is capture real-world failures and articulate them in "crystal clear English" that an LLM can use to identify problems — essentially creating a grading rubric for AI outputs.

In the AI world, the most valuable thing a product manager can do is capture real-world failures and articulate them in "crystal clear English".

At Box, this approach has proved transformative. Kus notes that having product managers write specific examples of desired outputs was "dramatically useful.” He elaborates: "For a while, we would tell our product teams and engineers 'We expect it to work like this,' but at some point we instead started having the product manager write an example of the correct output. That turned out to be a big revelation for us."

These examples become the foundation of evaluation frameworks — systematic ways to leverage AI's grading ability to assess its own generation quality. For instance, Goyal notes: "A lot of the leverage and help a product manager can offer happens after you ship the chatbot. Product managers can look at what happens with your agent in production and figure out how it's misbehaving," then use AI's grading capability to automatically identify similar failures in the future.

Building a measurement-first culture

The grading paradox explains why successful teams are shifting from a model-centric to a measurement-first mindset. Rather than obsessing over which AI model to use or how to optimize prompts, they're investing heavily in building comprehensive evaluation frameworks that capture what "good" looks like for their specific use cases.

This approach works because:

LLMs excel at grading other LLMs — validation is often more accurate than generation
Evaluations are durable assets while models and implementations change constantly
Systematic grading enables rapid iteration without sacrificing quality standards

Model-centric approaches often fall prey to "marketing evals" that prioritize showmanship over real-world utility. But by focusing on measurement — using AI's strength in grading — teams can manage the inherent uncertainty of non-deterministic systems by quantifying accuracy and correctness rather than just speed and reliability.

Practical steps for leveraging the grading paradox

To adopt this measurement-first approach, Goyal recommends three key practices:

Embrace failure as grading data: Failing evaluations provide a roadmap of what isn't possible today. Capture these failures as "lightning in a bottle" opportunities to refine your grading criteria.
Test like a human would grade: Ask if a smart human without domain context could identify a good answer given the same inputs. If yes, your AI grading system can likely do the same.
Build systematic feedback loops: Create workflows that reconcile offline testing with online production data, using multi-path analysis to check for consistency across different evaluation methods.

"Every time a new model comes out, you should be prepared to rewrite your entire agent," Goyal advises. However, the work invested in defining and measuring success — the eval set that leverages AI's grading ability — stays relevant across model generations.

Leveraging the grading paradox for better enterprise AI

Kus reflects that if he could travel back in time as CTO of Box, he would prioritize his evaluation sets over the agents themselves. Those sets encapsulate the collective work of many people defining exactly what constitutes a good answer in specific situations — the grading rubric that makes everything else possible.

Success in enterprise AI doesn't come from finding the perfect model or crafting the ideal prompt. It emerges from understanding and leveraging the grading paradox.

Success in enterprise AI doesn't come from finding the perfect model or crafting the ideal prompt. It emerges from understanding and leveraging the grading paradox: AI is better at recognizing good answers than generating them. By building robust evaluation systems that harness this strength, enterprises can finally move from impressive demos to reliable production systems.

In the end, making AI agents work isn't about making them smarter at generating. It's about getting smarter about how we use their natural ability to grade.

Watch the full podcast or browse more Box Partner Series episodes.