OCR-powered metadata extraction with Box AI and MCP

|
Share
OCR-powered metadata extraction with Box AI and MCP

Box AI just gained a significant capability that developers should know about: OCR support for image files in the structured metadata extraction endpoints. No marketing fluff here  —  this is a practical enhancement that eliminates preprocessing steps and expands what you can do with the Box AI API.

What changed?

Previously, if you wanted to extract structured data from scanned documents or images, you’d need to convert them to PDF first. Now, Box AI’s structured extraction endpoints can directly process:

  • TIFF, PNG, and JPEG files alongside the existing PDF support
  • Multiple languages: English, Japanese, Chinese, Korean, and Cyrillic scripts

This works with both the standard and enhanced structured extraction endpoints. Note that the freeform extraction API doesn’t include OCR — this is specifically for structured extraction where you define fields or templates.

Why this matters

From a development perspective, this removes a conversion step from your workflow. Scanned receipts, invoices, forms, or any image-based document can go straight into your extraction pipeline. For applications serving Japanese, Chinese, Korean, or Russian-speaking markets, this opens up metadata extraction use cases that simply weren’t feasible before.

Building a receipt processing system

Let me show you how this works in practice. I’ll use the Box Community MCP Server — a Model Context Protocol implementation that lets Claude interact directly with Box APIs.

The scenario

I have five images of restaurant receipts that need to be processed and cataloged. Rather than manually defining a metadata structure, I’ll let Box AI analyze the images and suggest an appropriate schema, then use that to extract data from all the receipts.

Fake restaurant invoices stored in Box

Locate the files

First, I need to find my folder containing the receipt images:

Locating the OCR folder and listing its contents in Claude

The MCP server’s box_search_folder_by_name_tool and box_list_folder_content_by_folder_id make this straightforward. I now have the folder ID and file IDs for all five receipt images.

Generate a metadata template schema

Here’s where it gets interesting. Instead of manually defining fields, I’ll ask Box AI to analyze these images and suggest a metadata structure:

Using Box AI to analyze images and suggest metadata template structure

Box AI examines the images and proposes fields like:

  • Restaurant name
  • Date
  • Total amount
  • Payment method
  • Line items with descriptions and prices
  • Tax and tip amounts

The AI understands the domain context and suggests appropriate field types (string, date, float, enum for payment methods).

Create the metadata template

Using the MCP server’s box_metadata_template_create_tool, I can turn this suggestion into an actual Box metadata template:

Creating metadata template based on Box AI’s suggestions

The template is now available in Box’s admin console:

The created metadata template displayed in Box administration page

Extract data from all images

Now comes the payoff. For each receipt image, I’ll use Box AI’s enhanced structured extraction with OCR:

Extracting data from each file using the newly created template

The box_ai_extract_structured_enhanced_using_template_tool processes each image, applies OCR, and extracts the data according to our template schema. The enhanced endpoint uses more sophisticated models (like Google's Gemini) for better accuracy with complex layouts.

Apply the metadata

Finally, I apply the extracted data to each file as metadata instances:

Applying metadata to all 5 files

And here’s the result in the Box web app:

Document metadata displayed in the Box app

Each receipt now has structured, searchable metadata extracted directly from the image.

Technical considerations

API Endpoints

The OCR capability is available in these endpoints:

POST /2.0/ai/extract_structured (standard extraction) This accepts file_ids pointing to TIFF, PNG, JPEG, or PDF files.

Language support

The OCR engine automatically detects and processes:

  • English
  • Japanese
  • Simplified and Traditional Chinese
  • Korean
  • Cyrillic scripts (Russian, Ukrainian, etc.)

No language parameter is required  —  detection is automatic.

Limitations

  • Freeform extraction (POST /2.0/ai/extract) does not include OCR
  • You need to use structured extraction with defined fields or templates
  • OCR quality depends on image resolution and clarity (as with any OCR system)

Building with the Box Community MCP Server

TheBox Community MCP Server provides convenient access to Box AI capabilities through Claude or any other MCP Client. Key tools demonstrated here:

  • box_search_folder_by_name_tool - Locate folders by name
  • box_list_folder_content_by_folder_id - List folder contents
  • box_ai_ask_file_multi_tool - Ask Box AI questions about multiple files
  • box_metadata_template_create_tool - Create metadata templates programmatically
  • box_ai_extract_structured_enhanced_using_fields_tool - Extract with custom fields
  • box_ai_extract_structured_enhanced_using_template_tool - Extract with predefined templates
  • box_metadata_set_instance_on_file_tool - Apply metadata to files

The MCP server handles authentication and API complexity, letting you focus on building functionality.

Practical applications

With OCR-enabled structured extraction, you can build:

  • Invoice processing systems that extract vendor info, line items, and totals from scanned invoices
  • Receipt management for expense tracking applications
  • Document classification pipelines that extract key fields from forms
  • Multilingual document processing for global operations
  • Automated data entry from images uploaded by users

Getting started

  1. Set up Box AI: Ensure your Box instance has Box AI enabled
  2. Create metadata templates: Define your data schema or use Box AI to suggest one
  3. Call the extraction API: Pass image file IDs and your template key
  4. Process the results: The API returns structured JSON matching your template

For detailed API documentation, check the Box AI Developer Documentation.

If you want to experiment with the MCP server approach, check out the Box Community MCP Server repository.

Conclusion

Adding OCR to Box AI’s structured extraction endpoints removes a common preprocessing bottleneck. You can now build document processing workflows that handle images natively, without conversion steps. Combined with the ability to programmatically create metadata templates (or have Box AI suggest them), this creates a flexible foundation for intelligent document management systems.

The real value isn’t in the technology itself. It’s in what you can build with it. Eliminating the PDF conversion step and supporting multiple languages means fewer moving parts in your architecture and broader applicability across markets.

Now go build something useful with it.