Re: trying to parse non valid html documents with HTMLParser
AFAIK not with HTMLParser or htmllib. You might try (if you haven't done yet) htmllib and see, which parser is more forgiving. Thanks, I'll try htmllib. In other case, I found a solution. Feeding data to the HTMLParser by chunks extracted from the string using string.split(), will allow me to loose only one tag at a time when an exception is raised ! -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
From http://www.crummy.com/software/BeautifulSoup/: You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser. True, I just want to extract some data from html documents. But the problem is the same. The parser looses the position he was in the string when he encounters a bad tag. -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
florent wrote: True, I just want to extract some data from html documents. But the problem is the same. The parser looses the position he was in the string when he encounters a bad tag. Are you saying that Beautiful Soup can't parse the HTML? If so, I'm sure the author would like an example so he can fix it. -- Benji York -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
AFAIK not with HTMLParser or htmllib. You might try (if you haven't done yet) htmllib and see, which parser is more forgiving. You were right, the HTMLParser of htmllib is more permissive. He just ignores the bad tags ! -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
Are you saying that Beautiful Soup can't parse the HTML? If so, I'm sure the author would like an example so he can fix it. I finally use the htmllib module wich is more permissive than the HTMLParser module when parsing bad html documents. Anyway, where can I find the author's contact informations ? -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
You were right, the HTMLParser of htmllib is more permissive. He just ignores the bad tags ! The HTMLParser on my distribution is a she. But then again, I am using ActivePython on Windows... -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
Steve M wrote: You were right, the HTMLParser of htmllib is more permissive. He just ignores the bad tags ! The HTMLParser on my distribution is a she. But then again, I am using ActivePython on Windows... Although building parsers is for some strange reason one of my favourite programming adventures, I do not have such a personal relationship with my classes ;) -- Benjamin Niemann Email: pink at odahoda dot de WWW: http://www.odahoda.de/ -- http://mail.python.org/mailman/listinfo/python-list
trying to parse non valid html documents with HTMLParser
I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. When the parser finds an invalid tag, he raises an exception. Then it seems impossible to resume the parsing just after where the exception was raised. I'd like to continue parsing an html document even if an invalid tag was found. Is it possible to do this ? Here is a little non valid html document. -- html body a href=bogus link/a /body /html -- -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
florent wrote: I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. Some?? Most of them :( When the parser finds an invalid tag, he raises an exception. Then it seems impossible to resume the parsing just after where the exception was raised. I'd like to continue parsing an html document even if an invalid tag was found. Is it possible to do this ? AFAIK not with HTMLParser or htmllib. You might try (if you haven't done yet) htmllib and see, which parser is more forgiving. You might pipe the document through an external tool like HTML Tidy http://www.w3.org/People/Raggett/tidy/ before you feed it into HTMLParser. -- Benjamin Niemann Email: pink at odahoda dot de WWW: http://www.odahoda.de/ -- http://mail.python.org/mailman/listinfo/python-list
Re: trying to parse non valid html documents with HTMLParser
florent wrote: I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. From http://www.crummy.com/software/BeautifulSoup/: You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser. -- Benji York -- http://mail.python.org/mailman/listinfo/python-list