A few weeks ago one of our clients called to notify us that one of their web pages didn’t look quite right. The site in question had recently been redesigned using web standards and was table free. This site uses our Content Management System (CMS) to publish pages using xHTML 1.0 strict templates. What could possibly go wrong?
A quick check of the page in question produced interesting results. The page rendered perfectly in Mozilla, Opera, and Safari. Internet Explorer was another issue entirely. The columns seemed to melt together in ways that defied web physics.
A look at the page source revealed the problem. One of the administrative users had added a calendar item to the CMS by copying and pasting a document from Word. While our CMS does make an attempt to sanitize Word’s output, there’s only so much we can do. The resulting markup looked identical to what you would expect to see if you had exported a Word document to HTML.
It’s ironic that the process of exporting a document from Word to HTML would have the effect of breaking the web page in Internet Explorer only. It’s also pretty darned annoying.
We quickly sanitized the HTML in question and the page returned to normal, rendering consistently across browsers and platforms. We also spent some time educating the user on the potential issues involved in this sort of document conversion. Still, the potential for disaster is just a cut-and-paste away.
Clearly, user education is a big part of the solution. While the transition from Word to Web should be transparent, it isn’t. Users charged with publishing web content need to be aware of the pitfalls.
As my previous article pointed out, even if Word could do a reasonable job of converting documents to clean xHTML, most Word documents do not contain the semantic information needed to translate to an HTML equivalent. Michael Gross’ article When Word to XML Conversions Get Nasty provides a good overview of the many challenges we face in re-purposing Office documents for the web.
Your staff should not neglect proper Word markup just because Word is currently unable to easily export to the format the rest of the world uses to publish documents online. At some point either Microsoft will enhance Word to provide this functionality, or your organization will acquire document conversion tools as part of a larger document management initiative. In either case, you won’t be able to extract semantic information that doesn’t exist.
While it’s tempting to segregate Web production standards from day-to-day clerical functions, it’s important to understand that poorly trained front-line staff can have a direct impact on your website’s accessibility without ever touching your CMS (or whatever tool you may be using to publish your website).
If you haven’t done so already, it’s time to take a holistic view of your organization’s content. Inaccessible web content frequently starts out as a poorly produced Office document.
On a related note I have to report that my experiments converting Word documents to HTML via Open Office have produced substandard results. Don’t get me wrong, Open Office is a fine replacement for MS Office, and I’ve had no trouble opening and editing Word documents. However, when exporting these documents to HTML the results are only marginally better than those achieved when doing the same from Word.
Meanwhile, my preliminary tests with Dreamweaver MX 2004 have been very encouraging. During my initial testing I’ve found that running the ‘Clean-up Word HTML’ command does in fact clean-up most of the issues I’ve noted in the past. Plus the latest version comes with HomeSite+ (which some people still consider to be the greatest HTML editing tool ever created).
Update: The Textism: Word HTML Cleaner seems to do the trick, and best of all it’s free (although you can, and probably should, leave a donation if you’re making heavy use of this great resource). Imagine that, a free service that does what Microsoft can’t with their own proprietary file format.