Pilot programs: Pressure-testing AI big bets in advance

This article is part of a series on how Box executes as an AI-first company. See our first article here.

In this series we’re following one of Box’s “hero agents” — an Intelligent Prospecting Agent that streamlines messaging for Sales. Creator Alex Hudzik, Senior Director of Sales Development, sees this agent as essentially “a product marketer in every seller’s back pocket.”

But the prospective new tool must first move through Box’s four-part approach to developing AI agents, graduating from ideation to Piloting and Rollout before reaching the agentic promised land of Scaled Adoption.

AI-First Transformation: Box's Principles, Strategy, and Execution Framework

From 100 Agents to Strategic Big Bets: How Box Found Focus in AI Deployment

Having survived the initial Ideation phase, the Prospecting Agent now transitions into piloting, a 25-user stress test that puts new AI agents in the hands of users solving actual business problems. “It’s a unique opportunity to take ideas, and test and iterate and measure feedback in a condensed timeframe,” explains Nora Soza, Senior Director of GTM Strategy and Operations. “You’re incubating that to get the highest value as quickly as possible.”

Your Pilot speaking: A program for agent validation

Before committing resources to a full build, you need a structured way to validate whether an proposed AI agent truly warrants company-wide investment. Piloting is a focused opportunity to test, measure, and decide whether to scale — or sunset — an agent before costs compound.

“It’s the point at which the leader says, ‘Yes, this is something I think we should build that I want my entire team to use,’” says Robert Ferguson, Box’s Head of Corporate Strategy and Chief of Staff to the CEO.

The moment an idea becomes a Pilot

For the Prospecting Agent, that transition point came when Hudzik stepped back from the experimental agents scattered across his team and recognized an opportunity. Multiple overlapping initiatives had emerged from both EMEA and US teams, each addressing similar outbound prospecting challenges. Rather than continue running efforts in parallel, leadership saw the potential to fuse them into something more powerful — something worth formal validation.

“Alex essentially took a step back,” says Soza, “and thought more holistically about what it would really take to create an intelligent Prospecting Agent. It would require very clean best-practice messaging, how we market to different verticals and personas and leverage information about current product usage.”

This shift from exploration to execution distinguishes ideation from Piloting. During ideation, you ask what could this be? During piloting, you ask, will this actually work at scale?

Building for validation, not perfection

Enabling the Prospecting Agent’s functionality meant connecting multiple data sources: curated product marketing content, Salesforce data, product usage analytics, and an industry-specific knowledge Hub. Building all of that at once would have delayed testing and feedback indefinitely.

Instead, the team took what Greg Keiser, Sales Strategy and Operations Director, calls “an MVP (minimum viable product) approach.”

“We wanted to get feedback as quickly as possible,” Keiser explains. “So we sprinted on the foundational development, and once the tool offered users enough value to prove the concept, we launched the Pilot. Post-launch, we’re continuing to sprint forward on dev work, now with the benefit of feedback from our Pilot users.”

This iterative approach — build minimum, test with real users, add based on feedback — exemplifies what makes Piloting distinct from both ideation and full rollout. You’re building something real enough to validate, but not so complete that you’ve over-invested before confirming it works.

Inside the Ideation engine powering Box’s big bets

Pilot programs: pressure-testing AI big bets in advance

The roles you need in order to run an effective Pilot

To run an effective Pilot, you need clear ownership across four distinct functions: functional leaders, AI manager, build teams, and test users. Assign these roles carefully and early — without them, accountability gaps will stall progress and muddy feedback.

I. FUNCTIONAL LEADERS: Setting strategy and define success

“The Functional Leader is going to be someone fairly senior who’s overseeing an entire departmental area,” Ferguson explains. “Their role is to define the Pilot’s outcomes — what success looks like.”

Before a Pilot can deliver meaningful results, leaders have to green-light the fundamentals: Do we have the right people to build it? Do we have the time? Has the architecture been defined and approved?

“You need to lay out what the architecture is, get that signed off,” Ferguson says. “Then you need to figure out what work is required to build that and who needs to build it.”

Only once those building blocks are in place can teams move into defining success metrics and measuring outcomes with confidence.

II. AI MANAGERS: Handling planning and ground-level work

AI Managers sit closer to execution. “These individuals need to be extremely consultative and collaborative in working through the design and the approach,” Soza notes.

For simpler agents built entirely within Box AI, the AI Manager might handle everything. For complex builds like the Prospecting Agent, they become a critical liaison between business needs and the Build team that’s driving technical execution.

III. BUILD TEAM: Making sure context delivers quality

For complex agents like the Prospecting Agent, Box’s Build team — combining expertise from Enterprise Solutions, IT, and strategic operations — handles technical execution that functional teams can’t. “To be truly powerful, it needed to be connected to customer and product data,” Ferguson explains. “So we ended up getting the data science team involved and built a data insights portal, which sits on top of our product data in GCP.”

One of the Build Team's initial tasks is auditing content the agent will draw from. For the Prospecting Agent, this meant structuring product marketing messaging into formats the agent could effectively leverage. “You really need that curated knowledge for the agent to leverage to be effective,” Soza notes.

This structure enables functional teams to maintain ownership of business requirements while the Build team provides the specialized technical expertise needed to architect agents for production-grade performance and cross-functional reuse.

IV. TEST USERS: Delivering feedback to promote agent growth

Choose your test users carefully — they’re the ones who’ll determine whether your fledgling agent’s feedback is signal or noise.

Not everyone is suited to the task. Look for people with dedicated time, clear accountability, and genuine investment in outcomes. “Your testing group has to have a sense of ownership,” Soza emphasizes. “They have to have time to dedicate to this. Testing agents isn’t a side hobby, it has to be part of their job, and they have to feel personally accountable for the end product.”

For the Prospecting Agent, Hudzik turned to his managers and team leads — people he felt were close enough to the problem to provide meaningful feedback. “They were running tests side by side,” Soza says. “They’d write sales messages and then try them with the agent to pressure-test their language’s efficacy.”

Defining the metrics that track real success

You can’t improve what you don’t measure, or justify continued investment without clear metrics tied to business outcomes. Before launching a pilot, ensure you have already defined the metrics to determine whether it’s successful or not.

Box’s CIO Ravi Malick has a clear philosophy on measurement. “It’s not as complex as people make it out to be,” he says. “At the end of the day, your success metrics need to be linked to revenue or margin. It all leads back to the balance sheet.”

To measure Pilot agent success, Ferguson explains, teams work backward to identify leading indicators by validating at least one of three categories of performance metrics:

Efficiency gains — Does the agent deliver measurable time savings, such as hours reduced to produce marketing copy or assemble a recruiting kick-off pack?
Automation rates — Does the agent increase the percentage of work handled without human intervention, like support tickets deflected to self-service?
Net-new work enabled — Does the agent enable capabilities that didn’t exist before, such as industry-specific meeting preparation or persona-tailored outreach at scale?

Track ROI, time saved, and revenue alignment in one place

Download this worksheet to bridge the gap between technical experimentation and measurable business impact by tracking your AI agent's performance, resourcing, and alignment with revenue goals.

“Is this actually saving them time or making them better at their jobs?” Ferguson asks. “Is it not hallucinating — and essentially doing its job reliably and accurately?”

For the Prospecting Agent, efficiency and automation gains showed up in specifics like reliability and accuracy. Time per message dropped from 30-60 minutes to roughly 5 minutes. Messaging became much more consistent. Adoption signals also played a role.

Beyond faster outreach, the Prospecting Agent actually unlocked net-new work, enabling the SDR team to tailor messaging to specific industries and personas at scale, a powerful personalization exercise that they lacked the time or expertise to do before.

Imagining the emerging role of the AI manager

Sending your Pilot into wider use — or off into the sunset

Not every Pilot succeeds, and that’s okay. Even failed Pilots — sometimes especially failed Pilots — generate long-term value as an organization learns what tools will and won’t facilitate its AI transformation.

“Even if the use case itself isn’t successful,” says Soza, “there are often learnings that in and of themselves are valuable. The Prospecting Agent’s current architecture, for instance, may not be its final form — but the team has learned critical lessons about accessing structured data and curating knowledge that will inform future builds.

The key to keeping AI transformation on track while adhering to a rigorous Pilot program is aggressive timelines that maximize momentum. “Six months feels like a long, long time,” says Malick. “90 days is generous. You should try to review every 30 days.”

“The beauty of a Pilot,” Soza adds, “is that you’re doing your turns in such quick increments that your loss in terms of investment is minimal compared to committing to a big nine-month long project.”

For the Prospecting Agent, the Pilot phase continues. The team pushes updates to production every two to three weeks, gathering feedback, iterating, and building toward business-wide availability.

“We still have so much work to be done,” Keiser admits. “I don’t want to sound like we just won the Super Bowl. We’ve maybe won a few games in the preseason.”

But with Pilots, that’s a fine place to be — far enough along to know you’re onto something interesting, but with enough time left to improve it before the metrics start to impact that aforementioned balance sheet.

For our Prospecting Agent, thankfully, there’s no sunset in sight. Preseason is over, and the regular season is about to begin.

In our next article, we’ll follow this agent into the Rollout phase, where teams develop a production-ready version in preparation for company-wide scaling.

Track ROI, time saved, and revenue alignment in one place

Download this worksheet to bridge the gap between technical experimentation and measurable business impact by tracking your AI agent's performance, resourcing, and alignment with revenue goals.