Re: [Tutor] finding mismatched or unpaired html tags

Dinesh B Vadhia Tue, 28 Apr 2009 15:01:43 -0700

Stefan / Alan et al

Thank-you for all the advice and links.  A simple script using etree is 
scanning 500K+ xhtml files and 2 files with mismatched files have been found so 
far which can be fixed manually.  I'll definitely look into "tidy" as it sounds 
pretty cool.  Because, we are running data processing programs on a 64-bit 
Windows box (yes, I know, I know ...) using 64-bit Python we can only use pure 
Python-only libraries.  I believe that lxml uses C libraries.  Again, thanks to 
everyone - a terrific community as usual!




--------------------------------------------------------------------------------

Message: 5
Date: Tue, 28 Apr 2009 19:39:17 +0200
From: Stefan Behnel <[email protected]>
Subject: Re: [Tutor] finding mismatched or unpaired html tags
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1

A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
> 
> Don't use elementTree, use BeautifulSoup instead.

Actually, now that the code is there anyway, the OP might be happier with
lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and
often parses broken HTML better. It's also more user friendly for many HTML
tasks.

http://codespeak.net/lxml/lxmlhtml.html

This might also be worth a read:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

Stefan

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

Reply via email to