
Box AI just gained a significant capability that developers should know about: OCR support for image files in the structured metadata extraction endpoints. No marketing fluff here — this is a practical enhancement that eliminates preprocessing steps and expands what you can do with the Box AI API.
What changed?
Previously, if you wanted to extract structured data from scanned documents or images, you’d need to convert them to PDF first. Now, Box AI’s structured extraction endpoints can directly process:
- TIFF, PNG, and JPEG files alongside the existing PDF support
- Multiple languages: English, Japanese, Chinese, Korean, and Cyrillic scripts
This works with both the standard and enhanced structured extraction endpoints. Note that the freeform extraction API doesn’t include OCR — this is specifically for structured extraction where you define fields or templates.
Why this matters
From a development perspective, this removes a conversion step from your workflow. Scanned receipts, invoices, forms, or any image-based document can go straight into your extraction pipeline. For applications serving Japanese, Chinese, Korean, or Russian-speaking markets, this opens up metadata extraction use cases that simply weren’t feasible before.
Building a receipt processing system
Let me show you how this works in practice. I’ll use the Box Community MCP Server — a Model Context Protocol implementation that lets Claude interact directly with Box APIs.
The scenario
I have five images of restaurant receipts that need to be processed and cataloged. Rather than manually defining a metadata structure, I’ll let Box AI analyze the images and suggest an appropriate schema, then use that to extract data from all the receipts.

Locate the files
First, I need to find my folder containing the receipt images:

The MCP server’s box_search_folder_by_name_tool and box_list_folder_content_by_folder_id make this straightforward. I now have the folder ID and file IDs for all five receipt images.
Generate a metadata template schema
Here’s where it gets interesting. Instead of manually defining fields, I’ll ask Box AI to analyze these images and suggest a metadata structure:

Box AI examines the images and proposes fields like:
- Restaurant name
- Date
- Total amount
- Payment method
- Line items with descriptions and prices
- Tax and tip amounts
The AI understands the domain context and suggests appropriate field types (string, date, float, enum for payment methods).
Create the metadata template
Using the MCP server’s box_metadata_template_create_tool, I can turn this suggestion into an actual Box metadata template:

The template is now available in Box’s admin console:

Extract data from all images
Now comes the payoff. For each receipt image, I’ll use Box AI’s enhanced structured extraction with OCR:

The box_ai_extract_structured_enhanced_using_template_tool processes each image, applies OCR, and extracts the data according to our template schema. The enhanced endpoint uses more sophisticated models (like Google's Gemini) for better accuracy with complex layouts.
Apply the metadata
Finally, I apply the extracted data to each file as metadata instances:

And here’s the result in the Box web app:

Each receipt now has structured, searchable metadata extracted directly from the image.
Technical considerations
API Endpoints
The OCR capability is available in these endpoints:
POST /2.0/ai/extract_structured (standard extraction) This accepts file_ids pointing to TIFF, PNG, JPEG, or PDF files.
Language support
The OCR engine automatically detects and processes:
- English
- Japanese
- Simplified and Traditional Chinese
- Korean
- Cyrillic scripts (Russian, Ukrainian, etc.)
No language parameter is required — detection is automatic.
Limitations
- Freeform extraction (POST /2.0/ai/extract) does not include OCR
- You need to use structured extraction with defined fields or templates
- OCR quality depends on image resolution and clarity (as with any OCR system)
Building with the Box Community MCP Server
TheBox Community MCP Server provides convenient access to Box AI capabilities through Claude or any other MCP Client. Key tools demonstrated here:
- box_search_folder_by_name_tool - Locate folders by name
- box_list_folder_content_by_folder_id - List folder contents
- box_ai_ask_file_multi_tool - Ask Box AI questions about multiple files
- box_metadata_template_create_tool - Create metadata templates programmatically
- box_ai_extract_structured_enhanced_using_fields_tool - Extract with custom fields
- box_ai_extract_structured_enhanced_using_template_tool - Extract with predefined templates
- box_metadata_set_instance_on_file_tool - Apply metadata to files
The MCP server handles authentication and API complexity, letting you focus on building functionality.
Practical applications
With OCR-enabled structured extraction, you can build:
- Invoice processing systems that extract vendor info, line items, and totals from scanned invoices
- Receipt management for expense tracking applications
- Document classification pipelines that extract key fields from forms
- Multilingual document processing for global operations
- Automated data entry from images uploaded by users
Getting started
- Set up Box AI: Ensure your Box instance has Box AI enabled
- Create metadata templates: Define your data schema or use Box AI to suggest one
- Call the extraction API: Pass image file IDs and your template key
- Process the results: The API returns structured JSON matching your template
For detailed API documentation, check the Box AI Developer Documentation.
If you want to experiment with the MCP server approach, check out the Box Community MCP Server repository.
Conclusion
Adding OCR to Box AI’s structured extraction endpoints removes a common preprocessing bottleneck. You can now build document processing workflows that handle images natively, without conversion steps. Combined with the ability to programmatically create metadata templates (or have Box AI suggest them), this creates a flexible foundation for intelligent document management systems.
The real value isn’t in the technology itself. It’s in what you can build with it. Eliminating the PDF conversion step and supporting multiple languages means fewer moving parts in your architecture and broader applicability across markets.
Now go build something useful with it.


