A fellow Box engineer and I were looking into a CRC32 problem that puzzled us, and we wanted to share our discoveries given that there are few resources available in one place regarding the CRC32 checksum that give detail about its variations.
In short, there is no standard Cyclic Redundancy Check (CRC) specification. Worse, when Unix cksum generates a CRC value, it won't match any others besides its own.
Background: We had been working on an experimental new upload method called “zip uploading,” in which the client creates a zip archive of files and folders, then uploads it to Box. The zip archive is then unzipped on the server-side and its contents are added to one’s Box account. During development of this upload method, we discovered that zip archives store a CRC checksum for each file in the archive. Not only do we generate a SHA-1 checksum for every file normally uploaded to Box, but we also generate a CRC checksum (using the Unix cksum utility).
We thought that we might be able to take advantage of these two pre-computed checksums, since we would be able to quickly tell if a file uploaded via zip-uploading was likely identical to an existing file in the same location (note: it’s not a good idea to rely on CRC checks entirely, due to the possibility of collisions). We were going to check the CRCs of files in the target location against CRCs stored in the zip header as the files were extracted. We expected that Unix cksum would generate the same kind of CRC32 that zip does. However, this was not true - cksum says it generates CRCs the way Ethernet does in its man page, which is different than zip. The crazy part of this puzzle is that cksum actually generates CRC32 checksums that match no CRC specification at all, including Ethernet’s.
We investigated using this test data file to generate CRCs, which contains the characters "asd" and a newline. You can verify this datafile by running "hexdump -b test" which should yield the following values: 141 163 144 012. This file generated the following CRCs (the colors help identify each checksum throughout this article):
- Asking PHP ZipArchive for the crc32 of the test file within a zip: [crc] => 355327694(0x152ddece)
- Running a free crc32 generator on my local 32-bit Windows resulted in: 355327694
- Running PHP hash_file('crc32b', $filename) resulted in: 355327694 ('crc32b' is commonly used in zip and other software CRCs)
- Running PHP hash_file('crc32', $filename) resulted in: 1531968134 (0x5b4ffa86) ('crc32' is commonly used in Ethernet packet CRCs)
- Running Unix cksum on the file on both 64- and 32-bit linux machines resulted in: 1814461271 (0x6c267b57)
We discovered that PHP's hash_file() uses a library called mhash, so we downloaded the source code and found it contained two precomputed tables for fast exponentiation of the CRC polynomial:
- The table 'crc32_table_b' is commented "This polynomial DOES generate the same CRC values as ZMODEM and PKZIP".
- The table 'crc32_table' is commented "This polynomial is used at: AUTODIN II, Ethernet, & FDDI". Also, this table is identical to the one that appears in the source code of our version of cksum (in coreutils 5.97)
What bugged us was that the CRC values computed by cksum and mhash do not match even though the identical polynomial tables are used. So we decided to investigate, because it would be nice if we could easily transform these CRCs to match each other - perhaps there was an extra bit flipped somewhere in the implementation...
Not exactly. Unix cksum does things very differently, even though it shares mhash's 'crc32' table. We modified a Java example to include code ported from mhash and cksum, to figure out what was going on. We were able to produce all three CRCs generated earlier, using the Java code we had written. This code outputs the following; CRCs are in hex, and match up to the previous decimal values by color:
CRC32 (via table lookup) = 5b4ffa86
CRC32b (via table b lookup) = 152ddece
cksum (via cksum craziness) = 6c267b57
CRC32b (via direct calculation) = 152ddece
CRC32b (via Java's library) = 152ddece
We discovered that Unix cksum differs from other CRC generators because it uses the polynomial table twice; it puts the file's bytes through the polynomial, but then also puts the length of the file through the polynomial as well, which is not mentioned anywhere in any documentation. Other things differ too (see the attached Java if interested; you can run its main() method which has the test file's bytes hardcoded into it - you can also use pass bytes in the first argument passed if you uncomment line 151, and comment 152).
Verdict: To check files in a zip archive against files on Box using the CRC in zip, we'll need to use a utility that can calculate the "crc32b” variant for a file (PHP's hash_file() would work, but it will load the whole file into memory).
Update: Perl can do it, and generates the same crc32 as zip. Here's a comparison with cksum:
We benchmarked it loosely using a 1.2GB file on one of our machines; cksum takes 46 seconds, while Perl takes 36 seconds. Perl’s execution appears to be less CPU intensive as well. While running Unix cksum, CPU usage over time looks like the following, via top:
$ top | grep cksum PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND ... 14843 rluecke 18 0 58924 564 432 D 19.3 0.1 0:05.76 cksum 14843 rluecke 18 0 58924 564 432 R 23.3 0.1 0:06.46 cksum ...
versus Perl's CPU usage over time:
$ top | grep perl PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND ... 7730 rluecke 18 0 80468 2488 1544 R 9.3 0.2 0:04.10 perl 7730 rluecke 18 0 80468 2488 1544 D 11.0 0.2 0:04.43 perl ...
Deriving too much in the way of performance conclusions from this test is probably not a good idea, however you can safely assume that Perl and the native cksum perform at similar speeds at the very least. You can find the Perl script to generate CRC32s here. To run it, you will need to make/install the String::CRC32 package from CPAN.
Post by Ryan Luecke (Software Engineer) and James Lyons (Software Engineer)