Re: Beautiful Soup - close tags more promptly?

2022-10-25 Thread Chris Angelico
On Wed, 26 Oct 2022 at 04:59, Tim Delaney wrote: > > On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote: >> >> >> Ah, cool. Thanks. I'm not entirely sure of the various advantages and >> disadvantages of the different parsers; is there a tabulation >> anywhere, or at least a list of recommendatio

Re: Beautiful Soup - close tags more promptly?

2022-10-25 Thread Tim Delaney
On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote: > > Ah, cool. Thanks. I'm not entirely sure of the various advantages and > disadvantages of the different parsers; is there a tabulation > anywhere, or at least a list of recommendations on choosing a suitable > parser? > Coming to this a bit

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer wrote: > > One thing I find quite interesting, though, is the way that browsers > > *differ* in the face of bad nesting of tags. Recently I was struggling > > to figure out a problem with an HTML form, and eventually found that > > there was a spurious

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-25 06:56:58 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote: > > There may be several reasons: > > > > * Historically, some browsers differed in which end tags were actually > > optional. Since (AFAIK) no mainstream browser ever implemented a real >

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote: > There may be several reasons: > > * Historically, some browsers differed in which end tags were actually > optional. Since (AFAIK) no mainstream browser ever implemented a real > SGML parser (they were always "tag soup" parsers with lots o

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01: On 2022-10-24, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list wrote: >> Adding in the omitted , , , , and >> would make no difference and there's no particular reason to recommend >> doing so as fa

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-25 03:09:33 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > wrote: > > On 2022-10-24, Chris Angelico wrote: > > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > > >> Yes, I got that. What I wanted to say was that this is indeed a bug

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Jon Ribbens via Python-list
On 2022-10-24, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > wrote: >> >> On 2022-10-24, Chris Angelico wrote: >> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: >> >> Yes, I got that. What I wanted to say was that this is indeed a bug in >> >> html.p

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list wrote: > > On 2022-10-24, Chris Angelico wrote: > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > >> Yes, I got that. What I wanted to say was that this is indeed a bug in > >> html.parser and not an error (or sloppyness, as you

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Jon Ribbens via Python-list
On 2022-10-24, Chris Angelico wrote: > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: >> Yes, I got that. What I wanted to say was that this is indeed a bug in >> html.parser and not an error (or sloppyness, as you called it) in the >> input or ambiguity in the HTML standard. > > I describe

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > > On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > > > Ron has already noted that the lxml and html5 parser do the right thing, > > > so just for the record: > > > > > > The HTML f

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > > Ron has already noted that the lxml and html5 parser do the right thing, > > so just for the record: > > > > The HTML fragment above is well-formed and contains a number of li > > element

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > Ron has already noted that the lxml and html5 parser do the right thing, > so just for the record: > > The HTML fragment above is well-formed and contains a number of li > elements at the same level directly below the ol element, not lots of >

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote: > Ron has already noted that the lxml and html5 parser do the right thing, ^^^ Oops, sorry. That was Roel. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 13:29:13 +1100, Chris Angelico wrote: > Parsing ancient HTML files is something Beautiful Soup is normally > great at. But I've run into a small problem, caused by this sort of > sloppy HTML: > > from bs4 import BeautifulSoup > # See: https://gsarchive.net/gilbert/plays/princess/tenn

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
(Oops, accidentally only sent to Chris instead of to the list) Op 24/10/2022 om 10:02 schreef Chris Angelico: On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote: > Using html5lib (install package html5lib) instead of html.parser seems > to do the trick: it inserts right before the next , and

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
Op 24/10/2022 om 9:42 schreef Roel Schroeven: Using html5lib (install package html5lib) instead of html.parser seems to do the trick: it inserts right before the next , and one before the closing . On my system the same happens when I don't specify a parser, but IIRC that's a bit fragile beca

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote: > > Op 24/10/2022 om 4:29 schreef Chris Angelico: > > Parsing ancient HTML files is something Beautiful Soup is normally > > great at. But I've run into a small problem, caused by this sort of > > sloppy HTML: > > > > from bs4 import BeautifulSou

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
Op 24/10/2022 om 4:29 schreef Chris Angelico: Parsing ancient HTML files is something Beautiful Soup is normally great at. But I've run into a small problem, caused by this sort of sloppy HTML: from bs4 import BeautifulSoup # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm

Beautiful Soup - close tags more promptly?

2022-10-23 Thread Chris Angelico
Parsing ancient HTML files is something Beautiful Soup is normally great at. But I've run into a small problem, caused by this sort of sloppy HTML: from bs4 import BeautifulSoup # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm blob = b""" 'THERE sinks the nebulous star we c