"Dinesh B Vadhia" <dineshbvad...@hotmail.com> wrote

I'm processing tens of thousands of html files and a few of them contain
mismatched tags and ElementTree throws the error:

"Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line 124, column 8"

IMHO the best way to cleanse HTML files is to use tidy.
It is available for *nix and Windows and has a wealth of
options to control it's output. It can even converty html into
valid xhtml which ElementTree should be happy with.

http://tidy.sourceforge.net/

It may not be Python but it's fast and effective!
And there is a Python wrapper:

http://utidylib.berlios.de/

although I've never used it.

--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to