Re: [Tutor] finding mismatched or unpaired html tags

Dinesh B Vadhia Tue, 28 Apr 2009 07:04:29 -0700

A.T. / Marty

I'd prefer that the html parser didn't replace the missing tags as I want to 
know where and what the problems are.  Also, the source html documents were 
generated by another computer ie. they are not web page documents.  My sense is 
that it is only a few files out of tens of thousands.  Cheers ...


Dinesh



--------------------------------------------------------------------------------

Message: 7
Date: Tue, 28 Apr 2009 08:54:33 -0500
From: Martin Walsh <[email protected]>
Subject: Re: [Tutor] finding mismatched or unpaired html tags
To: "[email protected]" <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=us-ascii

A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
>>
> 
> Don't use elementTree, use BeautifulSoup instead.
> 
> elementTree expects perfect input, typically generated by another computer.
> BeautifulSoup is designed to handle your everyday HTML page, filled with
> errors of all possible kinds.

But it also modifies the source html by default, adding closing tags,
etc. Important to know, I suppose, if you intend to re-write the html
files you parse with BeautifulSoup.

Also, unless you're running python 3.0 or greater, use the 3.0.x series
of BeautifulSoup -- otherwise you may run into the same issue.

http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

HTH,
Marty

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

Reply via email to