OCR-powered metadata extraction with Box AI and MCP

Box AI just gained a significant capability that developers should know about: OCR support for image files in the structured metadata extraction endpoints. No marketing fluff here — this is a practical enhancement that eliminates preprocessing steps and expands what you can do with the Box AI API.

What changed?

Previously, if you wanted to extract structured data from scanned documents or images, you’d need to convert them to PDF first. Now, Box AI’s structured extraction endpoints can directly process:

TIFF, PNG, and JPEG files alongside the existing PDF support
Multiple languages: English, Japanese, Chinese, Korean, and Cyrillic scripts

This works with both the standard and enhanced structured extraction endpoints. Note that the freeform extraction API doesn’t include OCR — this is specifically for structured extraction where you define fields or templates.

Why this matters

From a development perspective, this removes a conversion step from your workflow. Scanned receipts, invoices, forms, or any image-based document can go straight into your extraction pipeline. For applications serving Japanese, Chinese, Korean, or Russian-speaking markets, this opens up metadata extraction use cases that simply weren’t feasible before.

Building a receipt processing system

Let me show you how this works in practice. I’ll use the Box Community MCP Server — a Model Context Protocol implementation that lets Claude interact directly with Box APIs.

The scenario

I have five images of restaurant receipts that need to be processed and cataloged. Rather than manually defining a metadata structure, I’ll let Box AI analyze the images and suggest an appropriate schema, then use that to extract data from all the receipts.

Locate the files

First, I need to find my folder containing the receipt images:

Locating the OCR folder and listing its contents in Claude

The MCP server’s box_search_folder_by_name_tool and box_list_folder_content_by_folder_id make this straightforward. I now have the folder ID and file IDs for all five receipt images.

Generate a metadata template schema

Here’s where it gets interesting. Instead of manually defining fields, I’ll ask Box AI to analyze these images and suggest a metadata structure:

Using Box AI to analyze images and suggest metadata template structure

Box AI examines the images and proposes fields like:

Restaurant name
Date
Total amount
Payment method
Line items with descriptions and prices
Tax and tip amounts

The AI understands the domain context and suggests appropriate field types (string, date, float, enum for payment methods).

Create the metadata template

Using the MCP server’s box_metadata_template_create_tool, I can turn this suggestion into an actual Box metadata template:

Creating metadata template based on Box AI’s suggestions

The template is now available in Box’s admin console:

The created metadata template displayed in Box administration page

Extract data from all images

Now comes the payoff. For each receipt image, I’ll use Box AI’s enhanced structured extraction with OCR:

Extracting data from each file using the newly created template

The box_ai_extract_structured_enhanced_using_template_tool processes each image, applies OCR, and extracts the data according to our template schema. The enhanced endpoint uses more sophisticated models (like Google's Gemini) for better accuracy with complex layouts.

Apply the metadata

Finally, I apply the extracted data to each file as metadata instances:

And here’s the result in the Box web app:

Document metadata displayed in the Box app

Each receipt now has structured, searchable metadata extracted directly from the image.

Technical considerations

API Endpoints

The OCR capability is available in these endpoints:

POST /2.0/ai/extract_structured (standard extraction) This accepts file_ids pointing to TIFF, PNG, JPEG, or PDF files.

Language support

The OCR engine automatically detects and processes:

English
Japanese
Simplified and Traditional Chinese
Korean
Cyrillic scripts (Russian, Ukrainian, etc.)

No language parameter is required — detection is automatic.

Limitations

Freeform extraction (POST /2.0/ai/extract) does not include OCR
You need to use structured extraction with defined fields or templates
OCR quality depends on image resolution and clarity (as with any OCR system)

Building with the Box Community MCP Server

TheBox Community MCP Server provides convenient access to Box AI capabilities through Claude or any other MCP Client. Key tools demonstrated here:

box_search_folder_by_name_tool - Locate folders by name
box_list_folder_content_by_folder_id - List folder contents
box_ai_ask_file_multi_tool - Ask Box AI questions about multiple files
box_metadata_template_create_tool - Create metadata templates programmatically
box_ai_extract_structured_enhanced_using_fields_tool - Extract with custom fields
box_ai_extract_structured_enhanced_using_template_tool - Extract with predefined templates
box_metadata_set_instance_on_file_tool - Apply metadata to files

The MCP server handles authentication and API complexity, letting you focus on building functionality.

Practical applications

With OCR-enabled structured extraction, you can build:

Invoice processing systems that extract vendor info, line items, and totals from scanned invoices
Receipt management for expense tracking applications
Document classification pipelines that extract key fields from forms
Multilingual document processing for global operations
Automated data entry from images uploaded by users

Getting started

Set up Box AI: Ensure your Box instance has Box AI enabled
Create metadata templates: Define your data schema or use Box AI to suggest one
Call the extraction API: Pass image file IDs and your template key
Process the results: The API returns structured JSON matching your template

For detailed API documentation, check the Box AI Developer Documentation.

If you want to experiment with the MCP server approach, check out the Box Community MCP Server repository.

Conclusion

Adding OCR to Box AI’s structured extraction endpoints removes a common preprocessing bottleneck. You can now build document processing workflows that handle images natively, without conversion steps. Combined with the ability to programmatically create metadata templates (or have Box AI suggest them), this creates a flexible foundation for intelligent document management systems.

The real value isn’t in the technology itself. It’s in what you can build with it. Eliminating the PDF conversion step and supporting multiple languages means fewer moving parts in your architecture and broader applicability across markets.

Now go build something useful with it.