When AI can see, hear, and understand: The power of multimodal data extraction

AI has created enormous potential for organizations to structure their unstructured data, so they can take advantage of the context that lies within it. With AI-powered data extraction, you can surface critical details from your content in order to drive more efficient workflows and make stronger business decisions. In episode #14 of the Box AI Explainer podcast, Box CTO, Ben Kus, explains,

“Data extraction is the idea of structuring unstructured data, so you get the attributes that you want”

Box CTO, Ben Kus

With Intelligent Content Management, you can add that structure to your unstructured data by applying AI, metadata, and workflow automation — not just when it comes to text, but all kinds of files including images, videos, and audio files. Multimodal data extraction is a key part of this capability.

In the latest installment of the AI Explainer Series, Kus and host Meena Ganesh explore AI-driven multimodal data extraction and what it means for business. As Ganesh puts it, “It’s not just extracting data. It’s helping businesses better understand the massive amounts of content they have, better manage it, and get more insights from it.”

From there, they can put that data and all those insights to work — automating processes that could never have been automated before, and creating new ways of working altogether.

Key takeaways:

Multimodal AI systems go beyond basic labeling to extract meaningful insights and make complex decisions based on a holistic understanding of the data presented to them, much in the same way humans interpret multiple types of sensory input at once
Transformer neural networks process content by tokenizing information — whether text or converted pixel data from images — to predict patterns, enabling AI to effectively see, hear, and understand multimedia content

By taking advantage of multimodal data extraction, enterprises can process massive amounts of content at scale and improve workflows like compliance audits, production quality assessments, and DAM metadata labeling while reducing manual effort and cost

Advanced AI enables multimodal extraction

Multimodal data extraction uses advanced AI models to analyze and retrieve structured information from various types of content — text, images, audio, video — making the data that lies within accessible and actionable for businesses. This approach goes beyond basic labeling or transcription, enabling AI to interpret and understand this unstructured content similarly to how humans perceive and analyze information, and glean meaningful insights from diverse file formats that were previously challenging to process.

Of course, the ability for AI to conduct image labeling with object recognition has been around for a while. Audio transcription is also nothing new. But multimodal data extraction is different. Multimodal AI models are designed to mimic human cognitive processes, analyzing content from diverse sources. As Ben Kus describes it, “The AI model can actually see, hear, and understand an image” — almost in the same way a person could see, hear, and understand it.

For example, if you open up an image within Box AI and ask the model to describe it, it will tell you exactly what it sees in human language. If you give it a complex illustrated storyboard for a movie, it can create a detailed text storyline.

“In the world of data extraction, you can now do a lot of interesting things. Almost anything you could ask a person to do, you can have an AI system do.”

Box CTO, Ben Kus

Under the hood of AI

Transformers are neural networks integral to natural language processing. “Under the hood” of AI, transformers systematically go through files and “tokenize” words. “What it’s doing,” says Kus, “is actually trying to predict the next word. That’s what all these models do; they just predict the next words. Everything you see from agentic models is based on this fundamental premise.”

With text-based content, this task seems like it would be easy enough for AI, but when it comes to other file formats, the process gets a little more complicated. For instance, AI “reads” an image by analyzing the pixels. Kus explains: “You can take little groups of these pixels — like 16-by-16 blocks of pixels — and put them into a long sequence with some positional information.”

You can then convert these strings of pixel information into tokens and feed them to an LLM in much the same way you would text. “In practice,” Kus says, “this comes across as AI being able to ‘see’ the image.”

Hands-on enterprise use of multimodal data extraction

By simulating the way humans process information from multiple sensory inputs, multimodal AI systems can extract meaningful insights and make complex decisions based on a holistic understanding of the data presented to them. Multimodal AI can also pull insights from various sources simultaneously, revolutionizing how we interact with and derive value from data across multiple sensory domains.

Now, enterprises can process massive amounts of content at scale, improving workflows like compliance audits, production quality assessments, and DAM metadata labeling while reducing manual effort and cost. This innovation empowers industries that naturally have a lot of different types of content, like M&E and retail, turning their raw data into actionable intelligence.

Consider:

A retail company automatically labeling an entire catalog of product images in various ways
A construction company searching for security cam footage of workers to make sure there are no workplace violations
An insurance company analyzing media submitted by policyholders to assess things like flood damage and car accidents

For enterprise organizations grappling with vast unstructured data repositories, multimodal data extraction offers a solution to a longstanding challenge: how to extract the value from all of that data. Kus says, “Now, they can structure it. It’s almost like having a huge army of people going in and labeling all of it — which you would never do. It would be way too expensive. But now, AI can do it for you.

Extraction on an enterprise scale, finally

You don’t know what you don’t know, as they say. For a lot of leaders, just figuring out how to think about structuring, labeling, and categorizing the information buried in their unstructured data is the first hurdle. “It was often so hard to understand what you were looking at,” Kus says. “Now, you can have AI assist you in the process of structuring, labeling, filtering, sorting, and so on.”

Kus uses a Box customer as an example of a typical candidate for multimodal data extraction. The company has terabytes of images to sort through on a regular basis, and wanted to use AI to extract the best ones for use. This is a basic data extraction problem, and with multimodal AI, they can easily apply metadata to large repositories of images in order to quickly identify the best files.

Extrapolating this capability to a typical M&E company — which has tons of video, audio, and storyboards — Kus says, “Often, their eyes light up when they think about what they can do with all their assets.” By tagging assets with metadata via multimodal data extraction, they can easily find, for instance, every asset associated with a particular scene in a movie or a certain actor.

“For businesses at this scale — especially ones that might operate on a global level — this doesn't just mean an image here, an audio there. It means doing this level of extraction, getting this level of insights, on a massive scale — and in a way that that helps them take full advantage of AI.”

Box CTO, Ben Kus

Catch the full episode

For companies interested in getting more out of their full spectrum of digital assets, the innovation of multimodal data extraction is a gamechanger. Watch the full episode to discover how multimodal AI is transforming the enterprise.