Navigating the unstructured data problem: a CyberWire discussion with Heather Ceylan

AI poses remarkable opportunities for enterprises to take advantage of what can be decades’ worth of content. But the independent nature of autonomous AI agents also poses unprecedented security challenges. Heather Ceylan, Chief Information Security Officer at Box, sat down with Dave Bittner, a co-founder of the UK-based CyberWire podcast, to talk about how forward-looking enterprises are managing their AI transformation, and navigating the unstructured data problem.

This transcript has been edited for brevity and clarity.

We're talking about how governed AI starts with solving the unstructured data problem. Could you describe this problem for folks who aren't familiar with it?

For a long time, enterprises have struggled to govern their unstructured data: all that content that lives on users’ local devices, in the cloud, Microsoft 365, Box, in other cloud storage providers. This is the distributed content problem; it may not be labeled or classified correctly, it may not be permissioned and controlled correctly. Now, organizations have tolerated the unstructured content governance problem for a long time because when you’re talking about humans accessing that content, the blast radius was limited. But now, as we start to talk about agents accessing that content on such a large scale, that problem becomes much more critical to solve.

Let's talk about that problem. What’s the issue with providing access to this sort of data to an AI?

I’ll start with an example. Let's say the legal team is working on a new partnership deal that includes material non-public information. They have this data in their folders: it's not classified appropriately, access isn't properly controlled. But a human who's not supposed to have access probably won’t find that data because they're not looking for it.

Agents will access whatever content they can to try to solve a problem

Heather Ceylan, Box CISO

But say the product team is using an agent to research that same company. The agent is taking that data, surfacing it to the product team, creating roadmaps. It's acting on that knowledge. Now you've got a really big problem. Agents will access whatever content they can to try to solve a problem. So the blast radius is just much bigger.

Agents have an insatiable desire to gobble up every piece of data that they have permission to touch.

That's right. And the permissions for these agents are often vast because they need to do multiple-step processes. That's why it's so important to have restrictive permissions on your content and have that content classified.

Walk me through the process. If I want to get my data in proper condition before I expose it to AI, what sorts of things do I need to do in preparation?

This is one thing that I get really excited about because AI can help us here. It's not just part of the problem, it's part of the solution.

Organizations have really struggled with data classification. It's really hard to do that manually, to get humans to go in and properly classify data, especially when you're talking about years and years and years of unstructured content.

But AI can help us classify that content. And not just based on keywords, but actually understanding what the content is, how sensitive it is, and proactively labelling it and applying permissions based on that understanding.

How exactly does this process work? Are we making copies of the data, or creating a roadmap to where it’s stored?

You aren’t creating content or creating copies of the data. You have an agent that reads and understands what's in the document and then applies a label to that content. Then once that label is attached to the content, it flows with it no matter where it's going across your systems.

How exactly does this process work? Are we making copies of the data, or creating a roadmap to where it’s stored?

You aren’t creating content or creating copies of the data. You have an agent that reads and understands what's in the document and then applies a label to that content. Then once that label is attached to the content, it flows with it no matter where it's going across your systems.

Does this improve the quality of the output I could get out of my AI agent?

That's another great benefit. One struggle you can have with AI is that when you have distributed content, the context is also distributed. Agents do better work when they have more context to work from. And if that context is distributed across multiple systems, and the agent has access to one system but not another, it may not have the context that it needs. And we're talking about agents that can then take action based on that data, even if it's missing the context.

Let’s dig into some of those risks? I can imagine this would be an area where if you don't have the tools you need, you’ll be open to shadow AI.

Yes, that risk is very real. And all the security leaders that I speak with on a regular basis are working actively to try to solve this problem. As security leaders and practitioners, it's important for us to say not to say no to AI. We just have to equip people with the tools that they need to do their jobs in a secure, safe manner.

Here at Box, it’s very important to us that our people use AI. We want them to be productive, we want them to be better at their jobs. We just want to make sure they do it securely.

What is it like when an organization decides to go down this path? How heavy a lift is it? I can imagine people being a little intimidated.

It's a very real challenge, especially for large organizations that might have forty, fifty, even a hundred years of content. Where do you even start? You don't want agents to go back and search or try to classify a hundred years worth of content. So archiving content that's no longer relevant is another thing AI can really help with:

How do you dial in human oversight?

Again: the blast radius for agents across content is much bigger than it is for humans. Take that legal example I mentioned, involving sensitive non-public information. If an agent that’s working on that deal is grabbing information and acting on it, putting it into product roadmaps, slacking product leaders, your blast radius just gets so much bigger. Having a human in the loop for these types of sensitive actions is still necessary.

Now, as we start to get more comfortable with agents, there will be some low-risk things that we’ll let them do on their own, because we're comfortable with the controls and guardrails that we have in place. But there are going to be some actions that are so sensitive that we’ll still want humans in the loop.

What are your recommendations for organizations that want to go down this path? How do you prepare?

I’d say, start by knowing where all your content is and storing as much of it as possible in a single location. Once you have all of your context and controls and governance in a single plane, it's about understanding what data you have, what the sensitivity is, classifying it, applying labels.

And again, AI can help us do that in a way that’s actually feasible. If we were relying on humans, that project might take years for security compliance teams to oversee. Now we can do it much more quickly.

And then once you have the data classified, you need to make sure you apply controls based on those classifications. You may want to restrict permissions. You may want to restrict file sharing. You may want to restrict agent access to some of that content.

Then once you feel you have a strong, secure content layer, you can talk about agents operating on that content and thinking about guardrails you need to have in place.