Cyrille Bonnet wrote:
> Daniel Dekany wrote:
>> BTW, anybody has found a solution for fixing HTML copy-pasted from
>> Microsoft Word (mostly 2000/XP)? Lot of users has MS Word, and the
>> HTML pasted from it is a CSS killer mess. I tried mxTidy but it
>> didn't improved substantially the HTML. So how do you guys do it? I
>> have looked after solutions for Epoz, but didn't found any. But I
>> don't stick to Epoz... if there is a solution already for Kupu (is
>> Kupu already recommended over Epoz anyway?). Certainly the solution
>> would be an Epoz post-tidy Python script, but I didn't found any for
>> Word tidying. (However, the ideal would be if the HTML is tidied
>> right on the client when it pastes it in -- thus user would really
>> get what it sees, i.e. the HTML wouldn't be changed when he saves it.
>> That effect is really evil.)
> As Shane pointed out, there is a tidy up in Kupu. However, in my
> experience, it is not a very good tidy up (if I remember correctly, a
> lot of tags are still there after the tidy up).
Unfortunately there is a fine line between tidying up the cruft pasted from
Word, and not stripping out things which might actually have been entered
legitimately. I think Kupu does this pretty well (but then I'm a bit
biased), but without any way to detect that the user is pasting from Word I
don't see how much more could be stripped.
So far as I know the only thing which doesn't really get stripped from the
pasted Word text are the mso classnames. These can be manually blacklisted,
but I never got round to producing a definitive blacklist.
One of my thoughts is to provide a separate 'clean this up' button which
would apply a more aggressive tidy-up than the one when saving. Also, I
agree that only applying the tidy on save is bad, but there isn't a cross-
browser way to detect a paste, and applying the cleanup on a large
document every time you cut/paste one word wouldn't be nice either.
Suggestions for improvements are most welcome.
P.S. It isn't just pasting bad HTML which is a problem: some Microsoft
applications supply RTF on the clipboard but not HTML and it turns out that
if you paste RTF into IE it generates seriously invalid HTML with a totally
weird and corrupted DOM. That is another area where I think the cleanup
code finally does a passable job but not yet a perfect one.
Zope maillist - Zope@zope.org
** No cross posts or HTML encoding! **
(Related lists -