Cyrille Bonnet wrote:

> Daniel Dekany wrote:
>> BTW, anybody has found a solution for fixing HTML copy-pasted from
>> Microsoft Word (mostly 2000/XP)? Lot of users has MS Word, and the
>> HTML pasted from it is a CSS killer mess. I tried mxTidy but it
>> didn't improved substantially the HTML. So how do you guys do it? I
>> have looked after solutions for Epoz, but didn't found any. But I
>> don't stick to Epoz... if there is a solution already for Kupu (is
>> Kupu already recommended over Epoz anyway?). Certainly the solution
>> would be an Epoz post-tidy Python script, but I didn't found any for
>> Word tidying. (However, the ideal would be if the HTML is tidied
>> right on the client when it pastes it in -- thus user would really
>> get what it sees, i.e. the HTML wouldn't be changed when he saves it.
>> That effect is really evil.)
> As Shane pointed out, there is a tidy up in Kupu. However, in my 
> experience, it is not a very good tidy up (if I remember correctly, a 
> lot of tags are still there after the tidy up).
Unfortunately there is a fine line between tidying up the cruft pasted from 
Word, and not stripping out things which might actually have been entered 
legitimately. I think Kupu does this pretty well (but then I'm a bit 
biased), but without any way to detect that the user is pasting from Word I 
don't see how much more could be stripped.

So far as I know the only thing which doesn't really get stripped from the 
pasted Word text are the mso classnames. These can be manually blacklisted, 
but I never got round to producing a definitive blacklist.

One of my thoughts is to provide a separate 'clean this up' button which 
would apply a more aggressive tidy-up than the one when saving. Also, I 
agree that only applying the tidy on save is bad, but there isn't a cross-
browser way to detect a paste, and applying the cleanup on a large 
document every time you cut/paste one word wouldn't be nice either.

Suggestions for improvements are most welcome.

P.S. It isn't just pasting bad HTML which is a problem: some Microsoft 
applications supply RTF on the clipboard but not HTML and it turns out that 
if you paste RTF into IE it generates seriously invalid HTML with a totally 
weird and corrupted DOM. That is another area where I think the cleanup 
code finally does a passable job but not yet a perfect one.

Zope maillist  -
**   No cross posts or HTML encoding!  **
(Related lists - )

Reply via email to