A.T.Hofkamp wrote: > Dinesh B Vadhia wrote: >> I'm processing tens of thousands of html files and a few of them >> contain mismatched tags and ElementTree throws the error: >> >> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: >> line 124, column 8" >> >> I now want to scan each file and simply identify each mismatched or >> unpaired > tags (by line number) in each file. I've read the ElementTree docs and > cannot > see anything obvious how to do this. I know this is a common problem but > feeling a bit clueless here - any ideas? > > Don't use elementTree, use BeautifulSoup instead.
Actually, now that the code is there anyway, the OP might be happier with lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and often parses broken HTML better. It's also more user friendly for many HTML tasks. http://codespeak.net/lxml/lxmlhtml.html This might also be worth a read: http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/ Stefan _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor