MS Word HTML Cleanup Tool
Have you ever been frustrated with the awful HTML that Microsoft Word outputs? I have and wanted to do something about it. I did some online searching for cleanup tools and found some, but most were focused on stripping MS-specific tags and/or cleaning up mis-nested statements. I have been using FCKeditor for quite some time and have found it's "paste from Word" feature to be one of the best I have found, but it has some cleanup elements missing as well, like certain special character and most notably list items (using span statements with space characters to build a list item is just wrong in my opinion) and excessive use of div
<div> and paragraph
Rather than reinvent the wheel, I decided to just use FCKeditor to do what it already does well and add some code to fix what it doesn't do so well. In the future, all my work may be obsolete if FCKeditor continues to improve and also if Microsoft starts producing clean HTML (which I recently read they are planning to do.)
Currently, what the MS Word HTML Cleanup Tool can (in theory) do:
- replace sized font tags with heading tags (e.g.,
- remove other useless font tags
<p> <p>) with
- replace various special chars (currently:
- remove useless span tags
- remove any comments (e.g.
<!-- [if !supportFootnotes]-->,
- replace code for images with a default image (currently
/images/logo.gif). This feature probably needs some more thought but for now at least it will show you where an image belongs.
<a name=...>code as these are usually useless.
- option to remove or keep remove multiple spaces
- remove extraneous empty lines (sometimes caused by images I think); still needs work...
- remove all remaining
<p>tags. I guess some people like these, but I think
<p>tags are wasteful personally and if
<div>tags don't have a useful class (which they never do when pasting from Word) then why bother?
Update: I now do enlcose all cleaned output in one
<div>container; this is to make FCKedtior "play nice."
- convert list items to actual list items and as much as possible handle nested lists appropriately. This was the primary motivation for me writing this code and it is a somewhat nasty little problem to tackle, especially when you get multiple nested list levels. It's further complicated by the fact that SOMETIMES FCKeditor does convert a list from Word properly, but most times it doesn't.
- allow MS Word text that uses the code format to be converted to
<code>statements and keep associated spacing
- option to keep or discard class statements
Some things I still need to think about and handle appropriately:
- make sure I am not hampering tables in any way and/or improve the way FCKeditor handles them
- make the handling of
<p>tags an option to keep or discard (I prefer discarding but I imagine others prefer them)
Caveat: Right now, this code is still in development mode and I offer no claims about it's usefulness or reliability but if you want to give it a shot, download it below. Naturally, I wrote this code to suit my needs so some of the above may not suit your needs exactly. Feel free to modify it to suit your heart's content.
Instructions for Using the MS Word HTML Cleanup Tool Online Demo
Below is an editor created using the open source FCKeditor software mentioned above. All you need to do is copy any relevant material from a MS Word document and hit the "paste from word" icon on the editor's top toolbar (reference image below)
This will cause a popup window to open. Just paste the content you copied into that window and hit ok. You will then see your (ugly) resulting code in the editor itself. Make any other changes as you see fit and hit the submit button. The results page will allow you to copy the resulting HTML code and also view what it will look like as well.NOTE: You can't always trust what the editor shows as the CSS for that editor most likely doesn't match the CSS you will be using. So, don't worry if it seems like your resulting pasted content has huge spacings, extra large fonts, etc. Since the code will be (hopefully) standard HTML you can use your own CSS to fix all those issues easily. Related Resources/Articles:
Download the Code
Currently, there are 6 files to download (all in one .zip file):
readme.txt- basic usage instructions
cleanit-functions.php- contains the main functions I have worked on
cleanit-functions.asp- a two-function ASP file that James Crooke (www.cj-design.com) wrote based on my work
cleanup.php- provides a sample script to run the conversion code (like this page basically)
ChangeLog.txt- discusses the changes made from version to version
todo.txt- lists enhancements I would like to make someday.