MS Office Documents — 2007 and Beyond

I’ve been experimenting with my new ZIP.BAR file. Right now, it only characterizes the ZIP archive data and central directory structure. Eventually, I’ll want to support the full DEFLATE algorithm, allowing you to effectively use BARfly to unzip files!

In the process of seeing what I could deserialize with ZIP.BAR, I discovered that it actually works on Office 2007 documents. Remember, before 2007, these documents were in OLE2 format. But the OLE2 deserializer wouldn’t touch the newer documents. No wonder–they’re not OLE2 files–they’re ZIP files!

It’s fortunate that the most common compression algorithms used in ZIP files are “Stored” (type 0) and “DEFLATE” (type 8). Nothing causes implementation chaos so much as obscure compression algorithms, especially ones that are difficult to code or understand. Take JPEG, for instance. Before you get a single image pixel, you must go through some 7 different decoding steps! DEFLATE is not as hard, but it’s going to take more than a day or two to get it implemented.

Once Office 2007 documents are uncompressed, they are composed of largely XML. Therein lies another challenge. BAR stands for Binary Artifact Reference, which makes it fine for binary files, but not so great for text files. I’ve come up with XML.BAR, which I haven’t yet released, but even that I.F. is rather clunky, because XML is itself clunky.

The solutions that people have come up with to deal with XML are many. Native language support for regular expressions is always a plus, like with Perl or PHP. But regular expressions are themselves hard to read. There are also custom XML compression algorithms–which means, back to binary again. There’s also Base64, which uses text to represent binary which, ultimately, might be used to represent text (again)! Binary to text to binary to text to binary…can’t we all just get along?

I am working on the next version of the BAR protocol, which allows organized and unorganized blocks to specify syntaxes in regular expression format as a possible option (expressions ultimately compile to the same construct representations used in the current version). This will allow BAR to characterize both binary and text formats equally well.

MS Office documents may tough to handle, but I’m tougher. You wait and see…

Leave a Reply