On Wed, 26 Oct 2022 at 04:59, Tim Delaney wrote:
>
> On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote:
>>
>>
>> Ah, cool. Thanks. I'm not entirely sure of the various advantages and
>> disadvantages of the different parsers; is there a tabulation
>> anywhere, or at least a list of recommendatio
On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote:
>
> Ah, cool. Thanks. I'm not entirely sure of the various advantages and
> disadvantages of the different parsers; is there a tabulation
> anywhere, or at least a list of recommendations on choosing a suitable
> parser?
>
Coming to this a bit
On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer wrote:
> > One thing I find quite interesting, though, is the way that browsers
> > *differ* in the face of bad nesting of tags. Recently I was struggling
> > to figure out a problem with an HTML form, and eventually found that
> > there was a spurious
On 2022-10-25 06:56:58 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote:
> > There may be several reasons:
> >
> > * Historically, some browsers differed in which end tags were actually
> > optional. Since (AFAIK) no mainstream browser ever implemented a real
>
On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote:
> There may be several reasons:
>
> * Historically, some browsers differed in which end tags were actually
> optional. Since (AFAIK) no mainstream browser ever implemented a real
> SGML parser (they were always "tag soup" parsers with lots o
Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01:
On 2022-10-24, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
wrote:
>> Adding in the omitted , , , , and
>> would make no difference and there's no particular reason to recommend
>> doing so as fa
On 2022-10-25 03:09:33 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
> wrote:
> > On 2022-10-24, Chris Angelico wrote:
> > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
> > >> Yes, I got that. What I wanted to say was that this is indeed a bug
On 2022-10-24, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
> wrote:
>>
>> On 2022-10-24, Chris Angelico wrote:
>> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
>> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
>> >> html.p
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
wrote:
>
> On 2022-10-24, Chris Angelico wrote:
> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
> >> html.parser and not an error (or sloppyness, as you
On 2022-10-24, Chris Angelico wrote:
> On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
>> Yes, I got that. What I wanted to say was that this is indeed a bug in
>> html.parser and not an error (or sloppyness, as you called it) in the
>> input or ambiguity in the HTML standard.
>
> I describe
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
>
> On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote:
> > > Ron has already noted that the lxml and html5 parser do the right thing,
> > > so just for the record:
> > >
> > > The HTML f
On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote:
> > Ron has already noted that the lxml and html5 parser do the right thing,
> > so just for the record:
> >
> > The HTML fragment above is well-formed and contains a number of li
> > element
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote:
> Ron has already noted that the lxml and html5 parser do the right thing,
> so just for the record:
>
> The HTML fragment above is well-formed and contains a number of li
> elements at the same level directly below the ol element, not lots of
>
On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote:
> Ron has already noted that the lxml and html5 parser do the right thing,
^^^
Oops, sorry. That was Roel.
hp
--
_ | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| | | h...@hjp.
On 2022-10-24 13:29:13 +1100, Chris Angelico wrote:
> Parsing ancient HTML files is something Beautiful Soup is normally
> great at. But I've run into a small problem, caused by this sort of
> sloppy HTML:
>
> from bs4 import BeautifulSoup
> # See: https://gsarchive.net/gilbert/plays/princess/tenn
(Oops, accidentally only sent to Chris instead of to the list)
Op 24/10/2022 om 10:02 schreef Chris Angelico:
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven
wrote:
> Using html5lib (install package html5lib) instead of html.parser seems
> to do the trick: it inserts right before the next , and
Op 24/10/2022 om 9:42 schreef Roel Schroeven:
Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts right before the next , and one
before the closing . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile beca
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote:
>
> Op 24/10/2022 om 4:29 schreef Chris Angelico:
> > Parsing ancient HTML files is something Beautiful Soup is normally
> > great at. But I've run into a small problem, caused by this sort of
> > sloppy HTML:
> >
> > from bs4 import BeautifulSou
Op 24/10/2022 om 4:29 schreef Chris Angelico:
Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
'THERE sinks the nebulous star we c
20 matches
Mail list logo