Under the Hood: Building a Mobile Document Scanner

Overview
Today we released the ability to scan physical documents using Box Capture on iOS devices. Document scanning builds on Capture's existing support for creating photos, videos and audio files so that mobile workers can share critical information from the field quickly and securely with the right systems and people. To enable multiple media types, we took on the challenge of implementing a pipeline that could support and seamlessly transition between the different modes of photo, video, audio and document scanning.

The Challenge
Each mode requires a different combination of input and output connections, data formatters and user interface elements. We created reusable session components that we can quickly assemble to support any selected capture mode. Thus, the addition of 'scan' mode merely required us to add a rectangle detection filter to the video output stream pipeline. We then append detection metadata for downstream processing.

Due to these requirements, we decided to build the document scanning capability in-house instead of with an SDK. We needed a mobile scanning solution that would:

Scan multi-page documents
Save scans directly to Box as PDF files
Adapt regardless of the field condition (i.e. holding a document up to scan, scanning in a moving vehicle, etc.)
Support automated and manual edge detection, cropping and de-skewing

Building a dynamic, scalable solution that enables us to support a variety of different media types is critical to providing our customers with a great mobile content creation and collaboration experience. Since launching in September, we've already seen customers use Capture in a variety of ways including for retail store inspections, construction progress updates and real estate location scouting. We've even seen customers use Capture to provide perimeter security checks and document evidence at crime scenes - two scenarios we never anticipated!

Capture Session Architecture
We used Apple's AV Foundation Framework and AVCaptureSession to support our capture modes. These frameworks provide the access to the output data from the device's camera and audio hardware. Each active session supports its capture modes with the appropriate input and output connections. For 'scan' mode, edge detection uses the same session configuration as 'video' mode. We convert the frames retrieved from the raw video buffer to Core Images and pipe these frames into Apple's CIDetector framework for rectangle detection.

This design allows us to integrate not only rectangle detection capabilities, but also other technologies such as QR code detection, with little effort.

We created a session-centric architecture that allows us to switch between capture modes and easily add new modes in the future. The controller has several components: the session exchange (AV session creation and management), capture output and processing, asset creation and file uploading, user interface rendering and error handling. The session exchange is capable of coordinating all aspects of the pipeline while swapping session components, even reflecting changes to the user interface and sending notifications to asynchronous recorders. When the user switches modes, the controller reconstructs the pipeline in part or in whole minimizing the time spent suspending sessions to support quick activation of the new mode.

When developing a multimedia capture application, we learned an important lesson; it was necessary to identify the exact input and output connections required for each mode and, when switching modes, coordinate the suspension of the previous mode's session (i.e. buffers, data processing, etc.) before constructing the new mode.

Below is our list of requirements for each capture mode:

Video Mode

Session - Audio and video input connections, Video data output connection
File - Video asset writer with audio layer option
UI - OpenGL display view

Audio Mode

Session - Audio input connection, Audio data output connection
File - Audio asset writer
UI - OpenGL display view

Photo Mode

Session - Still image input connection, Still image output connection
File - JPEG writer
UI - AVPreviewLayer view

Scan Mode

Session - Video and still image input connections, Video and still image output connection
File - JPEG writer
UI - AVPreviewLayer view; Scan guide overlay
Still image connections are required for high quality photos while video required for quick feature detection

Building a Dynamically Configurable Pipeline For Different Capture Modes
As shown above, video recording sessions require video and audio connectors, while document scanning sessions require video and still image connectors. The session exchange reconfigures the pipeline persisting the video connectors and optimizes performance by tearing down only the audio connection while adding the required still image connection. The detection filter is chained into the video data pass-through, enabling feature detection.

We efficiently coordinate the user interface objects, media format helpers and file writers in the session exchange. Another example is the switch from Audio to Photo mode, which removes audio connections, adds still image connections and swaps the openGL viewer to Apple’s default AVPreviewLayer. The audio writer observers a tear down notification safely completing the recording of the data stream. The captured media file is then queued for upload to Box. In the meantime, the user might already be snapping photos for their next upload. In certain memory conditions, we teardown underlying dormant components and recreate them when the mode is reactivated.

Overview of shared components we manage between modes:

Our Capture system architecture is comprised of reusable session components that we can quickly assemble in these different combinations to support any selected capture mode. Our recording components stack dynamically, sharing redundant format converters and asynchronous task managers.

Document Detection
As shown in the 'scan mode' diagram above, the AVCaptureSession provides access to the video stream, allowing us to perform realtime feature identification in image frames. We used Apple's CIDetector class, which identifies notable features like faces or rectangles, as an image filter and apply the filter to the video frames. If a feature is found, CIDetector returns a CIFeature object providing us the rectangle points and location. We built our own filtering system in order to vet the best fitting rectangles.

In order to ultimately provide these video frames as high resolution files, we added the still image connections to our 'scan' mode. The still image output has a unique coordinate system and because of this we needed to map the rectangles to a new resolution. We managed this by adding transforms to the pipeline. In the end, we achieved high quality scans that work in a latitude low light and motion sensitive environments. We also use this data to draw a live visual overlay that shows the user which rectangles we are detecting.

We also built tools for users to polish the document - users can crop and shade files manually and when ready, the app stitches the final document into a PDF and saves directly to Box.

Capture is Built Entirely on Our Mobile iOS SDKs
Building Box Capture and expanding its capabilities present exciting milestones for mobile developers building on Box. Since we built the app using our mobile iOS SDKs, developers can build apps that power new workflows and information sharing across distributed teams. Our SDKs are modular in design and allow developers to quickly integrate key Box functionality into their apps while maintaining Box's preview, browsing and sharing experiences. Customers and partners are using Box Mobile SDKs to build custom mobile applications, to power their business.

As a team, we decided to move out of our comfort zone and extend ourselves with this ambitious challenge. We are happy to deliver Box Capture v1.2, available today on the App Store.