Your Guide to Website Design and Management

Text Size:
small_A.gif small_A.gif
Bookmark and Share

Rating: 3.7/5 (16 votes)


MS Word HTML Cleanup Tool

FCKeditor UPDATE: September 27, 2008 - Upgraded to version 2.6.3 of FCKeditor. Haven't thoroughly tested it but a quick test seems to indicate it works better than the previous version, which seemed to have some problems with paste from word in the newest version of Firefox. If you notice any problems, please let me know.

New Version Update: As of January 30, 2009, a new version (2.6.1) is now ready. See what's new in version 2.x.
Have you ever been frustrated with the awful HTML that Microsoft Word outputs? I have and wanted to do something about it. I did some online searching for cleanup tools and found some, but most were focused on stripping MS-specific tags and/or cleaning up mis-nested statements. I have been using FCKeditor for quite some time and have found it's "paste from Word" feature to be one of the best I have found, but it has some cleanup elements missing as well, like certain special character and most notably list items (using span statements with space characters to build a list item is just wrong in my opinion) and excessive use of div <div> and paragraph <p> tags.

Rather than reinvent the wheel, I decided to just use FCKeditor to do what it already does well and add some code to fix what it doesn't do so well. In the future, all my work may be obsolete if FCKeditor continues to improve and also if Microsoft starts producing clean HTML (which I recently read they are planning to do.)

Current Capabilities

Currently, what the MS Word HTML Cleanup Tool can (in theory) do:
  • replace sized font tags with heading tags (e.g., <font size="5"> becomes <h1> <font size="4"> becomes <h2> etc.)
  • remove other useless font tags
  • replace <div>&nbsp;<div> (or <p>&nbsp;<p>) with <br />
  • replace various special chars (currently: &rsquo;s, &lsquo;, &ndash;, &mdash;, &quot;, &ldquo;, &rdquo;, and &hellip;)
  • remove useless span tags
  • remove any comments (e.g. <!-- [if !supportFootnotes]-->, <!--[endif]-->, etc.
  • replace code for images with a default image (currently /images/logo.gif). This feature probably needs some more thought but for now at least it will show you where an image belongs.
  • remove <a name=...> code as these are usually useless.
  • option to remove or keep remove multiple spaces
  • remove extraneous empty lines (sometimes caused by images I think); still needs work...
  • remove all remaining div and p tags. I guess some people like these, but I think p tags are wasteful personally and if div tags don't have a useful class (which they never do when pasting from Word) then why bother? Update: I now do enlcose all cleaned output in one div container; this is to make FCKedtior "play nice."
  • convert list items to actual list items and as much as possible handle nested lists appropriately. This was the primary motivation for me writing this code and it is a somewhat nasty little problem to tackle, especially when you get multiple nested list levels. It's further complicated by the fact that SOMETIMES FCKeditor does convert a list from Word properly, but most times it doesn't.
  • allow MS Word text that uses the code format to be converted to <code>...<code> statements and keep associated spacing
  • option to keep or discard class statements

Some things I still need to think about and handle appropriately:
  • make sure I am not hampering tables in any way and/or improve the way FCKeditor handles them
  • make the handling of <p> tags an option to keep or discard (I prefer discarding but I imagine others prefer them)

Caveat: Right now, this code is still in development mode and I offer no claims about it's usefulness or reliability but if you want to give it a shot, download it below. Naturally, I wrote this code to suit my needs so some of the above may not suit your needs exactly. Feel free to modify it to suit your heart's content.

Online Demo

Instructions for Using the MS Word HTML Cleanup Tool Online Demo

Below is an editor created using the open source FCKeditor software mentioned above. All you need to do is copy any relevant material from a MS Word document and hit the "paste from word" icon on the editor's top toolbar (reference image below)

This will cause a popup window to open. Just paste the content you copied into that window and hit ok. You will then see your (ugly) resulting code in the editor itself. Make any other changes as you see fit and hit the submit button. The results page will allow you to copy the resulting HTML code and also view what it will look like as well.

NOTE: You can't always trust what the editor shows as the CSS for that editor most likely doesn't match the CSS you will be using. So, don't worry if it seems like your resulting pasted content has huge spacings, extra large fonts, etc. Since the code will be (hopefully) standard HTML you can use your own CSS to fix all those issues easily.

Related Resources/Articles:


Download the Code

Currently, there are 4 files to download (all in one .zip file): a readme.txt file with basic usage instructions, cleanit-functions.php which contains the main functions I have worked on, cleanit-functions.asp, a two-function ASP file that James Crooke (www.cj-design.com) wrote based on my work and ChangeLog.txt which discusses the changes made from version to version.

download MS Word Cleanup Tool Code


Comments »

Where does the file go?
I may have missed it but I am unclear where I put the php file you made in the FCK directory structure.

Can you help?

Thanks much.
Will
#1 - Will - 07/31/2007 - 19:25
Re: Where does the file go?
Hi Will,

I probably should improve the readme file, but near the top it does say that this code doesn't modify FCKeditor directly. Basically, you implement FCKeditor however you see fit, for example:

include_once("FCKeditor/fckeditor.php");
$oFCKeditor = new FCKeditor("editor");
$oFCKeditor->BasePath = '/FCKeditor/';
$oFCKeditor->Height = $rows;
$oFCKeditor->Width = $cols;
$oFCKeditor->Value = stripslashes($_POST["editor"]);
$oFCKeditor->Create();

now, the key variable for the FCKeditor textarea will be called "editor". To cleanup that variable AFTER the form is submitted, do something like this:

include("cleanit-functions.php");
$html = stripslashes($_POST["editor"]);
$lines = explode("\n",$html);
$_POST["editor"] = cleanit($lines, $imagepath, $imagestartnum, $alt_prefix);

where the first parameter ($lines) is generated above and the other parameters are explained in the readme file.

I hope that helps, if not please ask some specific question(s) so I can provide better information.

Regards,
Jeff
#2 - Admin - 08/01/2007 - 00:30
thanks
#3 - Warren - 08/23/2007 - 17:43
Web Development & Hostng
Hi the Converting from Microsoft Word to HTML link is broken and not working for me.

Regards,
Rich
#4 - Richard Metcalfe - 11/08/2008 - 16:02
Re: broken link
Hi Richard,

What link are you referring to? If you mean the download link, it works fine for me. Please provide more details so I can investigate.

Thanks,
Jeff
#5 - Jeff Blum - 11/08/2008 - 16:09
Perhaps Rich's comment refers to the fact that the "paste from word" icons work fine in Firefox, but they are disabled in IE7.
#6 - Bill - 06/18/2009 - 13:56
Name
E-mail (Will not appear online)
Homepage
Title
Comment
To prevent automated Bots form spamming, please enter the text you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.