Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
 AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
 yet) htmllib and see, which parser is more forgiving.

Thanks, I'll try htmllib.
In other case, I found a solution. Feeding data to the HTMLParser by 
chunks extracted from the string using string.split(), will allow me 
to loose only one tag at a time when an exception is raised !
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
  From http://www.crummy.com/software/BeautifulSoup/:
 
 You didn't write that awful page. You're just trying to get
 some data out of it. Right now, you don't really care what
 HTML is supposed to look like.
 
 Neither does this parser.

True, I just want to extract some data from html documents. But the 
problem is the same. The parser looses the position he was in the string 
when he encounters a bad tag.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread Benji York
florent wrote:
 True, I just want to extract some data from html documents. But the 
 problem is the same. The parser looses the position he was in the string 
 when he encounters a bad tag.

Are you saying that Beautiful Soup can't parse the HTML?  If so, I'm 
sure the author would like an example so he can fix it.
--
Benji York


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
 AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
 yet) htmllib and see, which parser is more forgiving.

You were right, the HTMLParser of htmllib is more permissive. He just 
ignores the bad tags !
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
 Are you saying that Beautiful Soup can't parse the HTML?  If so, I'm 
 sure the author would like an example so he can fix it.

I finally use the htmllib module wich is more permissive than the 
HTMLParser module when parsing bad html documents.
Anyway, where can I find the author's contact informations ?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread Steve M
You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !

The HTMLParser on my distribution is a she. But then again, I am using
ActivePython on Windows...

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread Benjamin Niemann
Steve M wrote:

You were right, the HTMLParser of htmllib is more permissive. He just
 ignores the bad tags !
 
 The HTMLParser on my distribution is a she. But then again, I am using
 ActivePython on Windows...

Although building parsers is for some strange reason one of my favourite
programming adventures, I do not have such a personal relationship with my
classes ;)

-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


trying to parse non valid html documents with HTMLParser

2005-08-02 Thread florent
I'm trying to parse html documents from the web, using the HTMLParser 
class of the HTMLParser module (python 2.3), but some web documents are 
not fully valids. When the parser finds an invalid tag, he raises an 
exception. Then it seems impossible to resume the parsing just after 
where the exception was raised. I'd like to continue parsing an html 
document even if an invalid tag was found. Is it possible to do this ?

Here is a little non valid html document.
--
html
body
a href=bogus link/a
/body
/html
--
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-02 Thread Benjamin Niemann
florent wrote:

 I'm trying to parse html documents from the web, using the HTMLParser
 class of the HTMLParser module (python 2.3), but some web documents are
 not fully valids.

Some?? Most of them :(

 When the parser finds an invalid tag, he raises an 
 exception. Then it seems impossible to resume the parsing just after
 where the exception was raised. I'd like to continue parsing an html
 document even if an invalid tag was found. Is it possible to do this ?

AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
yet) htmllib and see, which parser is more forgiving.

You might pipe the document through an external tool like HTML Tidy
http://www.w3.org/People/Raggett/tidy/ before you feed it into
HTMLParser.


-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: trying to parse non valid html documents with HTMLParser

2005-08-02 Thread Benji York
florent wrote:
 I'm trying to parse html documents from the web, using the HTMLParser 
 class of the HTMLParser module (python 2.3), but some web documents are 
 not fully valids. 

 From http://www.crummy.com/software/BeautifulSoup/:

 You didn't write that awful page. You're just trying to get
 some data out of it. Right now, you don't really care what
 HTML is supposed to look like.

 Neither does this parser.
--
Benji York

-- 
http://mail.python.org/mailman/listinfo/python-list