AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
yet) htmllib and see, which parser is more forgiving.
Thanks, I'll try htmllib.
In other case, I found a solution. Feeding data to the HTMLParser by
chunks extracted from the string using string.split(), will allow me
to
From http://www.crummy.com/software/BeautifulSoup/:
You didn't write that awful page. You're just trying to get
some data out of it. Right now, you don't really care what
HTML is supposed to look like.
Neither does this parser.
True, I just want to extract some data from
florent wrote:
True, I just want to extract some data from html documents. But the
problem is the same. The parser looses the position he was in the string
when he encounters a bad tag.
Are you saying that Beautiful Soup can't parse the HTML? If so, I'm
sure the author would like an
AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
yet) htmllib and see, which parser is more forgiving.
You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !
--
http://mail.python.org/mailman/listinfo/python-list
Are you saying that Beautiful Soup can't parse the HTML? If so, I'm
sure the author would like an example so he can fix it.
I finally use the htmllib module wich is more permissive than the
HTMLParser module when parsing bad html documents.
Anyway, where can I find the author's contact
You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !
The HTMLParser on my distribution is a she. But then again, I am using
ActivePython on Windows...
--
http://mail.python.org/mailman/listinfo/python-list
Steve M wrote:
You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !
The HTMLParser on my distribution is a she. But then again, I am using
ActivePython on Windows...
Although building parsers is for some strange reason one of my favourite
programming
florent wrote:
I'm trying to parse html documents from the web, using the HTMLParser
class of the HTMLParser module (python 2.3), but some web documents are
not fully valids.
Some?? Most of them :(
When the parser finds an invalid tag, he raises an
exception. Then it seems impossible to
florent wrote:
I'm trying to parse html documents from the web, using the HTMLParser
class of the HTMLParser module (python 2.3), but some web documents are
not fully valids.
From http://www.crummy.com/software/BeautifulSoup/:
You didn't write that awful page. You're just trying to