
Many companies rely on unstructured documents to store critical business data. Lease contracts, for example, contain key details like property information, lease terms, and tenant names. Extracting this data manually is inefficient and difficult to scale.
In this article, we explore a conceptual solution that automates structured data extraction from lease contracts using Box AI, Airbyte, and MotherDuck (DuckDB).
Overview of the solution
This conceptual solution enables businesses to extract structured data from lease contracts and store it in a queryable database. The workflow consists of:
- Box AI: Extracts structured data from lease contracts using the AI Extract Structured Data endpoint
- Airbyte: Moves the extracted data from Box AI to a database
- MotherDuck: Stores the structured data in an online DuckDB instance for easy querying
The role of the Airbyte connector for Box
To facilitate this process, we built an Airbyte source connector called “Box data extract” that interacts with Box AI through the Box API.
While this demo primarily uses the AI extract structured data endpoint, the connector also supports:
- AI Ask : For summarization and key point extraction (not used in this demo)
- AI Extract free form : Extracts data from documents using free-text
- Text representation : Converts documents into readable text
At the time of writing this article, the connector is yet approved by Airbyte. But if you want to give it a test, you can find the source code in our GitHub community repository. The documentation on how to use it is also available here.
Lease contracts stored in Box

With the exception of location, this is pretty much a standard lease agreement. The information that we’re looking for includes property type, lease start and end dates, contract date, lessee name and email, property location, monthly rent, and number of bedrooms.
Typically, in this type of document, the information is scattered all over contract clauses.
One of the most powerful aspects of Box AI is its ability to handle variations in document formats.
Since the AI “understands” the context of the document, it can locate specific information even when it appears in different formats and references.
To improve accuracy, users can configure each extracted field with prompt hints, guiding the AI in identifying relevant details.
Airbyte pipeline
On the Airbyte side, we have a data pipeline configured to use the Box data extract source and send the information to the MotherDuck destination:

The source configuration is set as follows:

Note the JSON schema used to specify what we’re looking for in the document, including a specific prompt for each field to help AI to find the information.
The destination configuration for MotherDuck is as simple as setting the API key:

MotherDuck database
Once you run the data pipeline in Airbyte, it will automatically create a table in MotherDuck DuckDB.

From here we can easily query the table:

Performance and automation
During testing, the system processed 50 lease contracts in approximately four minutes. While this is not a full benchmark, it demonstrates the potential efficiency of this approach. Four minutes wouldn’t be enough time for a human to read a single contract, much less 50.
From an Airbyte perspective, the process can be triggered automatically on a schedule or manually as needed.
Ensuring data quality and best practices
Given that AI-generated data can contain errors or hallucinations, you should review extracted data. Businesses should implement mechanisms to verify data consistency before relying on it for decision-making.
Expanding beyond lease agreements
While this demo focuses on lease agreements, the same methodology applies to a wide range of document types, including:
- Invoices : Extracting vendor details, invoice numbers, payment terms, and amounts due
- Other legal contracts: Identifying parties, contract duration, key clauses, and obligations
- Financial reports: Extracting revenue, expenses, profit margins, and financial statements
- HR documents: Parsing employee contracts, benefits information, and compliance records
- Insurance claims: Extracting policy numbers, claim amounts, and coverage details
- Regulatory filings: Automating the processing of compliance documents and regulatory reports
Once extracted, structured data can power multiple business processes, including:
- Automated workflows: Extracted data can trigger actions such as invoice approvals, contract renewals, and compliance checks
- Customer relationship management (CRM): Customer details from contracts and forms, and other documents, can be integrated into CRM platforms for better engagement
- Regulatory compliance and KYC: Automating compliance checks, verifying identities, and ensuring adherence to legal requirements
- Enterprise resource planning (ERP) systems: Populating ERP applications with structured financial and operational data
- Financial analysis and reporting: Enabling automated financial reporting and trend analysis
- Contract risk assessment: AI can flag contracts with unusual terms, missing clauses, or potential liabilities by comparing extracted data against standard agreements
Conclusion
This demo highlights the simplicity and power of using Box AI, Airbyte, and MotherDuck to extract structured data from lease contracts.
The ability to transform unstructured documents into structured, queryable data unlocks numerous automation and integration possibilities.
With minimal setup, businesses can streamline document-driven workflows and improve data accessibility across their organizations.


