MS Office Documents — 2007 and Beyond

June 30th, 2009

I’ve been experimenting with my new ZIP.BAR file. Right now, it only characterizes the ZIP archive data and central directory structure. Eventually, I’ll want to support the full DEFLATE algorithm, allowing you to effectively use BARfly to unzip files!

In the process of seeing what I could deserialize with ZIP.BAR, I discovered that it actually works on Office 2007 documents. Remember, before 2007, these documents were in OLE2 format. But the OLE2 deserializer wouldn’t touch the newer documents. No wonder–they’re not OLE2 files–they’re ZIP files!

It’s fortunate that the most common compression algorithms used in ZIP files are “Stored” (type 0) and “DEFLATE” (type 8). Nothing causes implementation chaos so much as obscure compression algorithms, especially ones that are difficult to code or understand. Take JPEG, for instance. Before you get a single image pixel, you must go through some 7 different decoding steps! DEFLATE is not as hard, but it’s going to take more than a day or two to get it implemented.

Once Office 2007 documents are uncompressed, they are composed of largely XML. Therein lies another challenge. BAR stands for Binary Artifact Reference, which makes it fine for binary files, but not so great for text files. I’ve come up with XML.BAR, which I haven’t yet released, but even that I.F. is rather clunky, because XML is itself clunky.

The solutions that people have come up with to deal with XML are many. Native language support for regular expressions is always a plus, like with Perl or PHP. But regular expressions are themselves hard to read. There are also custom XML compression algorithms–which means, back to binary again. There’s also Base64, which uses text to represent binary which, ultimately, might be used to represent text (again)! Binary to text to binary to text to binary…can’t we all just get along?

I am working on the next version of the BAR protocol, which allows organized and unorganized blocks to specify syntaxes in regular expression format as a possible option (expressions ultimately compile to the same construct representations used in the current version). This will allow BAR to characterize both binary and text formats equally well.

MS Office documents may tough to handle, but I’m tougher. You wait and see…

What’s in a BAR?

June 16th, 2009

Files, files, and more files. Am I contributing to implementation chaos by creating yet another file format? Not really. BAR is one of those formats that, like XML Schema, can have a “schema for itself.” You can get the BAR format specification on my website.

You have your JAR files and your BAR files. You have your EXE files and your OUT files. Code is code, data is data, and there always seems to be yet one more mechanism for interpreting the huge numbers of bytes that fly across your computer every day.

As for BAR, it has code and data, but unlike other mechanisms for general-purpose code execution or data representation, BAR is intended to be ultra-light. This means that whatever it takes to represent the contents of a file, be it code, markup, or some combination of the two, BAR gets it done with the least amount of resources.

Specifically, a BAR implementation file contains the following:

1) Header: information about the file.
2) Value table: long integers to be looked up later (enumerated constants, tokens, and critical-step UIDS).
3) Construct definitions: the “building blocks” of a file format, for the most part data representation.
4) Variable definitions: individual named quantities used to store data in structures, local variables, global variables, and parameter variables.
5) Enumerated constant definitions: collections of enumerated constants.
6) Function definitions: support functions (critical-step methods and utility functions).
7) Compiled script byte code.
8) String look-up table.
9) (Optional) The uncompiled source code.

And the beauty of all this? It fits into only a few kilobytes. Many BAR implementation files are only around 5,000 bytes!

Platform independence is a nice thing to have. It’s also somewhat taxing to design for platform independence. Many platforms, notably the Java runtime, are HUGE. This is because the runtime needs to do many things that a program is not expected to store in its own code, like compiled low-level calculation routines, interrupt callback handling, low-level network processing, etc. The operating system might provide similar services, or it might not.

What about the BAR engine’s runtime? The Win32 static library version weighs in at only a few hundred kilobytes.

The main reason the BAR engine and associated BAR implementation files are so light is that the purpose is limited. You are NOT trying to display an impressive front-end, handle real-time input, program hardware, or handle networking directly. The sole purpose of a BAR I.F. is to represent information. You can tweak a BAR I.F. to perform conversions to “cook” the data in a file, but this is optional. Use BAR to characterize primarily, and convert secondarily.

Right now, there are about 25 small, basic implementation files on the BARfly website. Not all the file formats, but still enough to characterize much of the information floating around the web. Take the average size of a BAR I.F., which is about 10 KB, and multiply by 100, and you’ve only got 1 MB worth of schemas to characterize virtually every format in existence!

It will be a good day when BAR becomes the end-all for representing file data. Contributions from BARfly Gold users will only make the job easier.

Dealing with MS Office documents

June 13th, 2009

Microsoft Office document formats have really proliferated over the years. Word, Excel, PowerPoint, Access, Visio…you get the idea. Why do formats proliferate? Two reasons:

1) Utility. The format is useful. It is capable of doing useful things, and software applications are in plentiful supply that allow people to do these useful things.
2) Widespread use. The popularity of a format, or at least the popularity of the applications that read and write it, determine what people more or less have to use.

It’s possible to have one and not have the other at all, but such formats are not as widely known or distributed. Why? Because there needs to be good reasons for using the format. Simply wanting people to use a format is not enough–people decide on their own and come to their own conclusions about whether or not to use a format.

Now, Microsoft has hit both of the above points squarely on the nose. The formats are pretty useful, and Microsoft, being Microsoft, can dictate over a wide distribution network which formats individuals actually use. So, like it or not, continuing to use them still seems like a pretty good idea.

There was a recent “format war” between the likes of Microsoft, IBM, Sun, Apple, etc. about which office-oriented formats would be the most accepted and distributed. I look at these battles the way any foreigner to a region watches a long-running family feud play out: it seems so silly, so pointless, with people feeding their own local politics for no other reason than to fight. And, ultimately, accomplishing little to the benefit of you and me. I view formats wars as pointless because to me, all formats, regardless of who “owns” them, are easy to read and write with BARfly.

The idea of the “open format,” which anybody is allowed to read and implement with their own software, significantly helps both utility and widespread use. But should ALL formats be open? If you really want that widespread distribution, you need to make something easy to use, easy to implement, easy to distribute, etc. So it helps, although it is by no means required, to release details of the format for other people to implement on their own platforms.

I recently released a BAR I.F. for OLE2. This is the foundation for MS Office documents leading up to about the year 2007, when Microsoft changed everything around yet again with “DOCX” and other formats. If documentation had not been released for easy implementation, this I.F. could not have been built in two days–try two months.

For software with inelastic popularity, no documentation release is necessary–utility is so strong that reverse engineers will hammer on the format day and night until it’s cracked. WAD is a good example.

But how far should a company go? Should a company release EVERY format specification, including internal stuff that isn’t designed to be distributed among users? Microsoft hasn’t done so, nor have most companies. Most companies plunk a series of nonsensical data files and configuration files in their own installation folders, thinking, “who cares what my end users think? They want a product.”

It all comes back to the goals. Remarkably vague “.DAT” files exist everywhere–and who knows what they contain. You can even say the same thing about “.XML” files, because, despite readability, utility is not necessarily known without a schema. The questions a company needs to ask are:

1) Does this format need to be open?
2) How much of my business depends on the format being proprietary?
3) How much of my business is directly or indirectly tied to sharing pertinent information?
4) Whose responsibility do I want it to be to support this format?
5) Is there a correlation between the utility/widespread use of the format and the ability to sell my products and services?

Note that these are not format design decisions. They are business decisions. Only after answering these, must a company consider looking at the two “proliferation” goals.

BARfly and the Free Software Movement

June 9th, 2009

On more than one occasion, I’ve had people suggest I offer my product for free. That is, completely for free, forever, with no theoretical limit to the free support that I’m also expected to give my constituents.

Okay…how do I say this? The Free Software Movement is composed of acts of charity. IT IS NOT A BUSINESS PHILOSOPHY. Everyone likes it (I know I sure do) when an entity, especially a large entity like Microsoft, decides to release something for free, like Express versions of their code compilers.

But how many people were petitioning Microsoft to release .NET Express for free? Surely people weren’t filing grievances, were they? Grievances for what?

Let’s take a look at how a hacker and a CEO approach offering their own software, meticulously scheduled for release with time-sensitive deadlines, and painstakingly developed on the company’s own dollar, for absolutely nothing:

1) Hacker: There’s a lot of software out there. I’m looking for something that serves my needs exactly. Not bloated, not invasive to my computer, not requiring half a zillion legal strings attached, not something that is going to make my computer vulnerable to attack. Something I can easily configure, something I can manipulate to exactly the ends I wish.
2) CEO: We see great opportunity for sales of the product. Seek out target markets, perform case studies, and assess how much individuals are willing to pay for the products we offer. Constantly seek out new business areas, new audiences, and serve their needs as appropriate. But whatever we do, NEVER, as a matter of business principle, simply GIVE away something off which we can make money!

Not much room for common ground, it would seem. It might seem disappointing, but the truth is, each person is connected directly or indirectly to both of these roles. You work for a company (or you own it), so you have to mind the store or ultimately conduct yourself in ways that contribute to minding the store. But you also want the best deal for your company. So, if you can get something for free, by all means, get it for free instead of spending money on it!

When Sun releases the Java runtime for free, why are they doing this? How does it benefit Microsoft by releasing .NET Express for free? For the same reasons companies give away free stuff at conferences at their booths. For the same reasons supermarkets let you sample wares instead of making you purchase something you’ve never heard of before. For the same reasons drug dealers don’t make you pay full price for the first hit.

There are two major reasons why businesses will release their software for free. The first reason is that they get more business that way (more name recognition, more people trying the product, better feedback for improvements, low-cost form of advertising, etc). The second reason is that the marketability of the product is deemed insufficient, yielding a nothing-is-lost philosophy in releasing the software at little or no cost.

Should BARfly be free? Well, considering it’s the only product like itself in the whole world, the answer is no. Economic scarcity dictates that a product in low supply should have a proportionally higher cost. But conversely, a product in low demand should also have a proportionally lower cost.

Of course, none of this is any reason not to develop free software. I’ve got plenty of that you can download at my personal site. My video game Vintage Hyperactive is one such example. It’s a simple project for a hobbyist, to be done in one’s spare time. That’s the “hacker” way of looking at it.

The “CEO” way of looking at it, of course, comes to the exact same conclusion: can I make money off VH? Probably not. So the game remains free!

BUT…if I make an iPhone version of VH, people might buy it. So that version would NOT be free.

Make sense?

A New Beginning

June 8th, 2009

Okay, people; BARfly Bronze Demo is now posted. Feel free to download and comment here as needed.

I’ve given the general public some basic versions of common file formats to play with. The detailed versions of the same file formats I’m saving for those who want to purchase them. For example, a detailed version of GIF performs LZW to generate pixel data, and a detailed version of MIDI parses the individual MIDI commands in each track!

There is no limit to how “low level” BARfly can go when it comes to analyzing data. Even better is the fact that the deserialization process is kind of like a “Wonkavator:” it can parse forwards, backwards, sideways, slantways, you name it! How else can a schema handle the likes of nonlinear formats like TIFF, ICO, and WAD?