Lawrence D'Oliveiro [EMAIL PROTECTED] writes:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception,
[Richie]
But Tidy fails on huge numbers of real-world HTML pages. [...]
Is there a Python HTML tidier which will do as good a job as a browser?
[Walter]
You can also use the HTML parser from libxml2
[Paul]
libxml2 will attempt to parse HTML if asked to [...] See how it fixes
up the
Rene Pijlman wrote:
Lawrence D'Oliveiro:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not.
There are two solutions to this:
1.
Richie Hindle wrote:
But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:
from mx.Tidy import tidy
results = tidy(htmlbodypreeHello world!/pre/body/html)
[Various error messages]
Is there a Python HTML tidier which will do as good a
In article [EMAIL PROTECTED],
Rene Pijlman [EMAIL PROTECTED] wrote:
2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/
That sounds like what I'm after!
--
http://mail.python.org/mailman/listinfo/python-list
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused
Lawrence D'Oliveiro:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not.
There are two solutions to this:
1. Tidy the source before parsing
Lawrence D'Oliveiro wrote:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object
[Daniel]
You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.
But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:
from mx.Tidy import tidy
results = tidy(htmlbodypreeHello