Stefan / Alan et al Thank-you for all the advice and links. A simple script using etree is scanning 500K+ xhtml files and 2 files with mismatched files have been found so far which can be fixed manually. I'll definitely look into "tidy" as it sounds pretty cool. Because, we are running data processing programs on a 64-bit Windows box (yes, I know, I know ...) using 64-bit Python we can only use pure Python-only libraries. I believe that lxml uses C libraries. Again, thanks to everyone - a terrific community as usual!
-------------------------------------------------------------------------------- Message: 5 Date: Tue, 28 Apr 2009 19:39:17 +0200 From: Stefan Behnel <stefan...@behnel.de> Subject: Re: [Tutor] finding mismatched or unpaired html tags To: tutor@python.org Message-ID: <gt7f05$1o...@ger.gmane.org> Content-Type: text/plain; charset=ISO-8859-1 A.T.Hofkamp wrote: > Dinesh B Vadhia wrote: >> I'm processing tens of thousands of html files and a few of them >> contain mismatched tags and ElementTree throws the error: >> >> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: >> line 124, column 8" >> >> I now want to scan each file and simply identify each mismatched or >> unpaired > tags (by line number) in each file. I've read the ElementTree docs and > cannot > see anything obvious how to do this. I know this is a common problem but > feeling a bit clueless here - any ideas? > > Don't use elementTree, use BeautifulSoup instead. Actually, now that the code is there anyway, the OP might be happier with lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and often parses broken HTML better. It's also more user friendly for many HTML tasks. http://codespeak.net/lxml/lxmlhtml.html This might also be worth a read: http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/ Stefan
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor