Making PHP Safer: Introducing Augmented Types

Introducing PHP Augmented Types, a PHP extension that enforces PHPDoc-style type annotations at runtime.
At Box, we depend heavily on PHP. While we have a diverse set of services that utilize a variety of languages, all external requests and most other significant processes end up interacting with our PHP web application. For better or for worse, PHP has been at the core of Box's architecture since the beginning. Scaling our ever-growing PHP codebase creates a very unique set of challenges. We have around 100 engineers that actively contribute to about 500,000 lines of PHP code in our web application. While we do everything in our power to buy safety in such a dynamic language (primarily through a strong focus on testing), we consistently find ourselves envious of all the nice properties that a statically-typed language would buy us. Language envy struck hard when we started working with Scala, where you can virtually write provably-correct code utilizing its type system. Even working in C I found myself jealous, because it has (mostly) static types checked at compile time. Simply put, having a consistent type system greatly boosts understandability, productivity, and safety in large codebases. Nowhere does this become more apparent than when looking back at previous issues caused by inconsistent types and our more byzantine legacy code paths. We have made efforts in the past to improve runtime type safety. The use of typehints in function signatures is strongly encouraged, and we developed a handy assertion library that is also widely used in our webapp. However, both approaches have their drawbacks: PHP function typehints are limited to objects and arrays, and it is unreasonable to place assertions everywhere. Recently, we have taken a different approach straight from Facebook's playbook: modify the language itself. Fortunately, our particular needs didn't require us to fork PHP (more on this later), rather we only had to build an extension. Today, I'm extremely excited to open-source this extension, dubbed Augmented Types, which creates a type system on top of phpDoc function annotations.

Enter Augmented Types

Augmented Types simply provides a way of enforcing argument and return types on function calls in PHP. It parses a given function's type information from phpDoc-style type annotations (based on the phpDocumenter project), and it stores this information efficiently for later enforcement during runtime. With this extension, we can now ensure that every single function and instance method in our codebase accepts and returns values of statically-known types. Valid types can be any number of things, including primitive types (int, string, bool, etc), classnames, array types (int[], FooBar[]), null, disjunctions (int|string, FooBar|null), void types and more. Full documentation can be found on the project's github page. Before starting on this project, we had a few key requirements. Our first and foremost was that we not have to fork PHP nor modify its abstract syntax tree; to do either would create a maintainability nightmare. Our second was that our extension not impact runtime performance in a significant way. Lastly, our third was that our type system be easily extensible. Each of these requirements created very interesting engineering challenges, and they warrant a deeper dive into the internals of Augmented Types.

The Zend Engine and the Unbounded Power of Zend Extensions

Satisfying our first requirement of not modifying PHP turned out to be astonishingly easy, but a little knowledge of the PHP internals is required to understand the mechanism used. PHP is completely implemented in C, and the 'Zend engine' is the portion of the PHP source that defines how PHP is compiled and executed. In most other languages, the Zend engine would constitute the entirety of the language, but the core of PHP also contains a few other bits, such as a bunch of standard libraries and the interfaces it uses with various webservers. PHP is compiled, but not in the traditional sense of taking a bunch of source code and spitting out a binary. PHP code is compiled one file at a time as different source files are pulled in during the normal flow of execution. PHP is compiled down to opcodes, which are an intermediate representation analogous to Java bytecode. Opcodes are grouped together by functions. Once a PHP file is compiled, execution of PHP opcodes resumes from where it left off before. The Zend engine defines one singular entry point for compiling PHP source files and one entrypoint for executing groups of opcodes (which correspond to functions). Luckily for us, these two entry points are defined by two mutable global function pointers, meaning that we can completely change the way PHP is compiled and executed by manipulating these pointers from our extension. In Augmented Types, we wrap the compilation function pointer in order to compile type information contained in new phpDoc annotations, and we wrap the execution function pointer in order to enforce the types of arguments and return values before and after functions are executed. This is all done completely from our dynamically-loaded extension; no modifications to outside sources were necessary.

Making Augmented Types Performant

A very important part of getting good performance with PHP is using an opcode cacher. Briefly, opcode cachers are a class of PHP extensions that cache PHP opcodes between requests, thus eliminating the cost of having to compile the same PHP source files every request. To accomplish this, it is necessary that all opcode cachers also wrap the Zend engine's compilation function pointer. In order to guarantee that Augmented Types not impact performance in a significant way, we had to ensure that it would work with opcode cachers. If a function's type information wasn't cached along with its opcodes, then we either would have to re-parse every function's type information every request, or even worse, just discard the function's type information. In order to accomplish this, we employ a clever little hack: we store functions' type information in a contiguous block of memory masquerading as a string constant. All opcode cachers must store string constants to preserve the original functionality of cached opcodes, thus our type information will always stick around too. We store and process functions' type information as efficiently as we can manage, leading to a relatively low overhead. Unfortunately, its hard to pin down a number to the exact performance degradation that Augmented Types causes, because it all depends on the average amount of work performed by each function and the types that you enforce. For example, it is a single memory reference to check whether something is an integer or string, but enforcing that a value is an array of integers (denoted int[]) is non-trivial, because it requires you to iterate through the entire array. Likewise, if every function performs a small amount of computation, then there will be more function calls (and thus more time spent type-checking) relative to execution time. That being said, in various micro-benchmarks we have seen anywhere from a 2% to 50% performance degradation. In our web application, we haven't been able to detect any performance hit, although Augmented Types is only used in about 4% of our codebase (more on this later). For us, the safety provided by Augmented Types more than justifies the slowdown.

Designing an extensible type system

In order to make our type system as powerful as possible, we decided to employ flex and bison (popular lexer and parser implementations). Flex and bison allowed us to turn our type system into a full-on context free grammar, enabling types to be composed and easy addition of new types. This enables the use of types like an array containing integers or arrays of Foo objects (denoted (int|Foo[])[]). There is no intrinsic limit on the type system - any class of values that can be identified at runtime can be expressed or added to the grammar.

Augmented Types at Box and beyond

To aid in our initial internal rollout of Augmented Types at Box, we built a mechanism that would enable selective enforcement of files and directories. It's a slow, uphill battle documenting the types that functions take and return, but we have succeeded in making around 4% of our PHP codebase AT-compliant in a couple months. In addition, almost all new PHP code written at Box is AT-compliant. We use Augmented Types in our development and test environments, however we leave it out of production as we don't gain much from its presence there. Augmented Types is first and foremost a tool for developers, and we don't develop in our production environment. We're really excited to give back to the PHP community. Augmented Types has been a great source of value for us, and we hope that it will be for others outside of Box too. In the future, we hope to build more tools around Augmented Types (such as static analysis utilities, etc), and we will be readily accepting of outside contributions. Open-sourcing Augmented Types is the beginning of its journey, not the end! Thanks for reading!