I’ve spent a fair amount of time recently marveling over Microsoft Word’s complete inability to generate clean xHTML, or even clean HTML for that matter. It is the year 2004 after all — you would think a company with Microsoft’s resources would be able to figure this stuff out.
Microsoft’s latest offering, Word 2003, features the ability to export to numerous formats including XML and two varieties of HTML (filtered and regular). I have to admit that I held out some small hope that ‘filtered’ would produce the sort of clean code we’ve all been waiting for. No luck, the resulting HTML still included embeded ‘mso’ class references on every element. I can understand, and even appreciate, the applications attempt to generate a document specific stylesheet. I’d appreciate it even more if I could turn that ‘feature’ off.
The majority of Word documents we come in contact with are smaller content items with light to moderate formatting, and are destined for organizational websites. In all cases, those organizational websites have their own stylesheets that would easily accomodate a properly marked up xHTML document. In nearly all cases, the resulting Word document exported to HTML produces code that requires a significant amount of cleanup effort.
There are code sweepers, HTML tidiers, and even large and expensive applications devoted to the task of converting Word documents into well structured HTML. Still, by the year 2004 you’d expect this ability to be built into the core product.
In Microsoft’s defense, as this WebAIM tutorial points out, most users do not create truly structured Word documents. That is to say, most Word users will create a header by changing the selected text’s font properties rather than applying an H1 style. Clearly, this is just as much a training issue as it is a technology issue. Still, even when styles are applied consistently the resulting HTML is loaded with the previously mentioned ‘mso’ class references, not to mention <i> in place of <em> and <b> in place of <strong>.
If I didn’t know better, I might suspect that Microsoft simply doesn’t want to solve this problem. If Word could be counted upon to reliably export clean xHTML it would be much too easy for users to move their documents to some other editing tool. Bill Gates has made no secret of the fact that proprietary file formats create a market advantage for Microsoft — and major headaches for the rest of us.