Agreed that the web sites are probably broken. Try running the HTML
though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed
me to parse where I had problem such as yours.
I have also had luck with BeautifulSoup, which also includes a tidy
function in it.
Just Another Victim of t
Fredrik Lundh escreveu:
> > Except it appears to be buggy or, at least, not very robust. There are
> > websites for which it falsely terminates early in the parsing.
>
> which probably means that the sites are broken. the amount of broken
> HTML on the net is staggering, as is the amount of
> Except it appears to be buggy or, at least, not very robust. There are
> websites for which it falsely terminates early in the parsing.
which probably means that the sites are broken. the amount of broken
HTML on the net is staggering, as is the amount of code in a typical web
browser
"Just Another Victim of the Ambient Morality" <[EMAIL PROTECTED]> wrote
in message news:[EMAIL PROTECTED]
>
>Okay, I think I found what I'm looking for in HTMLParser in the
> HTMLParser module.
Except it appears to be buggy or, at least, not very robust. There are
websites for which i
"Just Another Victim of the Ambient Morality" <[EMAIL PROTECTED]> wrote
in message news:[EMAIL PROTECTED]
>I'm trying to parse HTML in a very generic way.
>So far, I'm using SGMLParser in the sgmllib module. The problem is
> that it forces you to parse very specific tags through object
I'm trying to parse HTML in a very generic way.
So far, I'm using SGMLParser in the sgmllib module. The problem is that
it forces you to parse very specific tags through object methods like
start_a(), start_p() and the like, forcing you to know exactly which tags
you want to handle. I