I’ve spent a fair amount of time recently marveling over Microsoft Word’s complete inability to generate clean xHTML, or even clean HTML for that matter. It is the year 2004 after all — you would think a company with Microsoft’s resources would be able to figure this stuff out.
Microsoft’s latest offering, Word 2003, features the ability to export to numerous formats including XML and two varieties of HTML (filtered and regular). I have to admit that I held out some small hope that ‘filtered’ would produce the sort of clean code we’ve all been waiting for. No luck, the resulting HTML still included embeded ‘mso’ class references on every element. I can understand, and even appreciate, the applications attempt to generate a document specific stylesheet. I’d appreciate it even more if I could turn that ‘feature’ off.
The majority of Word documents we come in contact with are smaller content items with light to moderate formatting, and are destined for organizational websites. In all cases, those organizational websites have their own stylesheets that would easily accomodate a properly marked up xHTML document. In nearly all cases, the resulting Word document exported to HTML produces code that requires a significant amount of cleanup effort.
There are code sweepers, HTML tidiers, and even large and expensive applications devoted to the task of converting Word documents into well structured HTML. Still, by the year 2004 you’d expect this ability to be built into the core product.
In Microsoft’s defense, as this WebAIM tutorial points out, most users do not create truly structured Word documents. That is to say, most Word users will create a header by changing the selected text’s font properties rather than applying an H1 style. Clearly, this is just as much a training issue as it is a technology issue. Still, even when styles are applied consistently the resulting HTML is loaded with the previously mentioned ‘mso’ class references, not to mention <i> in place of <em> and <b> in place of <strong>.
If I didn’t know better, I might suspect that Microsoft simply doesn’t want to solve this problem. If Word could be counted upon to reliably export clean xHTML it would be much too easy for users to move their documents to some other editing tool. Bill Gates has made no secret of the fact that proprietary file formats create a market advantage for Microsoft — and major headaches for the rest of us.
I really hope someone solves this problem soon as it’s a real PITA. We have 150 Word documents to convert now, and then 20% of them again each month and there must be an easier way of doing it!
Regs….David.
Your best bet is to look into one of the many tools for transforming Word’s XML output into another schema. I’ve linked to one suit in my original post, but I have no experience with the product and the link is not intended as an endorsement.
OpenOffice will take your Word doc and Save As almost squeaky clean HTML. Give it a try.
Good point about Open Office. If you have control over your authoring environment Open Office can be a valuable tool. Plus it’s free.
The problem here is that we’re dealing primarily with government agencies. While Open Office is free, some more restrictive IT groups have locked their systems down so tightly that users cannot install additional software.
There are other issues as well, but they’re all variations on the theme of restrictive bureaucracy and/or inadequate user training.
And really, if Open Office can do such a good job converting Word files, why the heck can’t Word do the same with it’s own file format?
I am going crazy because if you are using Word and do any commenting/revisions and then try to convert to a PDF it will not transfer the colors of the revisions/comments or the strike thrus in the resulting PDf. Is there anything to resolve this that you have run across?
Does this Open Office thing really work? I’m currently finishing up my dissertation (in Word) and have gone through the considerable trouble of structuring the document such that it SHOULD be easily exported in clean XHTML….would be great for putting such a document on the web for public use.
Jabley – I’ve had mixed luck getting clean XHTML when using Open Office to save a Word document. Quite a few of the Microsoft specific XML elements hang around and it’s not anywhere close to what I’d call “clean”. At any rate, since Open Office is free it might be worth a try.
I’ve been monitoring the progress of a product called Word Cleaner. I hope to review it here when the next version comes out. It’s a commercial product that specializes in converting Word to XHTML. The batch conversion utility could be a boon for anywone with a large number of Word documents.