Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Joao S. O. Bueno
On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel stefan...@behnel.de wrote:
 Brett Cannon, 28.07.2011 23:49:

 On Thu, Jul 28, 2011 at 11:25, Matt wrote:

 - What policies are in place for keeping parity with other HTML
 parsers (such as those in web browsers)?

 There aren't any beyond it would be nice.
 [...]
 It's more of an issue of someone caring enough to do the coding work to
 bring the parser up to spec for HTML5 (or introduce new code to live
 beside
 the HTML4 parsing code).

 Which, given that html5lib readily exists, would likely be a lot more work
 than anyone who is interested in HTML5 handling would want to invest.

 I don't think we need a new HTML5 parsing implementation only to have it in
 the stdlib. That's the old sunny Java way of doing it.


I disaagree.
Having proper html parsing out of the box is part of the batteries
included thing.
And it is not a matter of having html 5 - as stated on this thread, fixing it
for html5 will fix it for html that exists in the real world.

Python _has_ to work with quick 30-50 lines scripts deliverable everywhere, not
just has proper 3rd party libraries that can work as part of a huge
project using buildout.


  js
 --

 Stefan

 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/jsbueno%40python.org.br

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Stefan Behnel

Joao S. O. Bueno, 29.07.2011 13:22:

On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote:

Brett Cannon, 28.07.2011 23:49:


On Thu, Jul 28, 2011 at 11:25, Matt wrote:


- What policies are in place for keeping parity with other HTML
parsers (such as those in web browsers)?


There aren't any beyond it would be nice.
[...]
It's more of an issue of someone caring enough to do the coding work to
bring the parser up to spec for HTML5 (or introduce new code to live
beside
the HTML4 parsing code).


Which, given that html5lib readily exists, would likely be a lot more work
than anyone who is interested in HTML5 handling would want to invest.

I don't think we need a new HTML5 parsing implementation only to have it in
the stdlib. That's the old sunny Java way of doing it.


I disaagree.
Having proper html parsing out of the box is part of the batteries
included thing.


Well, you can easily prove me wrong by implementing this.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Glyph Lefkowitz

On Jul 29, 2011, at 7:46 AM, Stefan Behnel wrote:

 Joao S. O. Bueno, 29.07.2011 13:22:
 On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote:
 Brett Cannon, 28.07.2011 23:49:
 
 On Thu, Jul 28, 2011 at 11:25, Matt wrote:
 
 - What policies are in place for keeping parity with other HTML
 parsers (such as those in web browsers)?
 
 There aren't any beyond it would be nice.
 [...]
 It's more of an issue of someone caring enough to do the coding work to
 bring the parser up to spec for HTML5 (or introduce new code to live
 beside
 the HTML4 parsing code).
 
 Which, given that html5lib readily exists, would likely be a lot more work
 than anyone who is interested in HTML5 handling would want to invest.
 
 I don't think we need a new HTML5 parsing implementation only to have it in
 the stdlib. That's the old sunny Java way of doing it.
 
 I disaagree.
 Having proper html parsing out of the box is part of the batteries
 included thing.
 
 Well, you can easily prove me wrong by implementing this.
 
 Stefan

Please don't implement this just to profe Stefan wrong :).

The thing to do, if you want html parsing in the stdlib, is to _incorporate_ 
html5lib, which is already a perfectly good, thoroughly tested HTML parser, and 
simply deprecate HTMLParser and friends.  Implementing a new parser would serve 
no purpose I can see.

-glyph

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/29/2011 07:22 AM, Joao S. O. Bueno wrote:

 I disaagree. Having proper html parsing out of the box is part of
 the batteries included thing. And it is not a matter of having
 html 5 - as stated on this thread, fixing it for html5 will fix it
 for html that exists in the real world.
 
 Python _has_ to work with quick 30-50 lines scripts deliverable 
 everywhere, not just has proper 3rd party libraries that can work as 
 part of a huge project using buildout.

Assuming it were merged today, that parser would only be available on
Python 3.3 and later:  how is that everywhere?  Having scripts that
work against html5lib (which *doesn't* need buildout to install, or even
setuptools) makes them portable to any version of Python supported by
the library (Python 2.3+, AFAICT).


Tres.
- -- 
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4y/JYACgkQ+gerLs4ltQ4KKwCgkyOlmb8xxhxg1qWH9RRbEpEw
ne0AoL6NgRElbY61QRqnXJjiKoHq0ToW
=fk3k
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Matt
On Fri, Jul 29, 2011 at 11:03 AM, Glyph Lefkowitz
gl...@twistedmatrix.comwrote:


 On Jul 29, 2011, at 7:46 AM, Stefan Behnel wrote:

  Joao S. O. Bueno, 29.07.2011 13:22:
  On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote:
  Brett Cannon, 28.07.2011 23:49:
 
  On Thu, Jul 28, 2011 at 11:25, Matt wrote:
 
  - What policies are in place for keeping parity with other HTML
  parsers (such as those in web browsers)?
 
  There aren't any beyond it would be nice.
  [...]
  It's more of an issue of someone caring enough to do the coding work
 to
  bring the parser up to spec for HTML5 (or introduce new code to live
  beside
  the HTML4 parsing code).
 
  Which, given that html5lib readily exists, would likely be a lot more
 work
  than anyone who is interested in HTML5 handling would want to invest.
 
  I don't think we need a new HTML5 parsing implementation only to have
 it in
  the stdlib. That's the old sunny Java way of doing it.
 
  I disaagree.
  Having proper html parsing out of the box is part of the batteries
  included thing.
 
  Well, you can easily prove me wrong by implementing this.


As far as the issue described in my initial message goes, there is a patch
and tests for the patch.



 Please don't implement this just to profe Stefan wrong :).

 The thing to do, if you want html parsing in the stdlib, is to
 _incorporate_ html5lib, which is already a perfectly good, thoroughly tested
 HTML parser, and simply deprecate HTMLParser and friends.  Implementing a
 new parser would serve no purpose I can see.


I don't see any real reason to drop a decent piece of code (HTMLParser, that
is) in favor of a third party library when only relatively minor updates are
needed to bring it up to speed with the latest spec. As far as structure
goes, HTML4 and HTML5 are practically identical. The differences between the
two that are applicable to HTMLParser involve the way the specs deal with
special element types and broken syntax. For what it's worth, the rules
HTML4 does define are (in many cases) ignored in favor of more modern,
Postel's Law-agreeable rules. HTML5 simply standardized what browsers
actually do.

Deprecating HTMLParser in favor of a newer/better/faster HTML library is a
bad thing for everybody that's already using HTMLParser, whether directly or
indirectly. html5lib does not have an interface compatible with HTMLParser,
so code would largely need to be rewritten from scratch to gain the benefits
of HTML5's support for broken code. Developers using HTMLParser would be
permanently stuck using a library that throws exceptions for perfectly valid
HTML. Keep in mind that these are solved problems: all of the thinking on
how to handle broken code has been done for us by the folks at the WHATWG.
It's simply a matter of updating our existing code with these new rules.

While I agree that there are merits to dropping support for the old code, it
does not solve the existing problems that folks are having right now (namely
incorrect parser output or exceptions). It would be more ideal to perhaps
patch the obvious issues stemming from HTML4 support for now, leaving
anything that goes beyond parity with browsers for a later time or
implementing as an opt-in feature (i.e.: enabled by a parameter).

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Glyph Lefkowitz
On Jul 29, 2011, at 3:00 PM, Matt wrote:

 I don't see any real reason to drop a decent piece of code (HTMLParser, that 
 is) in favor of a third party library when only relatively minor updates are 
 needed to bring it up to speed with the latest spec.

I am not really one to throw stones here, as Twisted contains a lenient 
pseudo-XML parser which I still maintain - one which decidedly does not agree 
with html5's requirements for dealing with invalid data, but just a bunch of 
ad-hoc guesses of my own.

My impression of HTML5 is that HTMLParser would require significant 
modifications and possibly a drastic re-architecture in order to really do 
HTML5 right; especially the parts that the html5lib authors claim makes HTML5 
streaming-unfriendly, i.e. subtree reordering when encountering certain types 
of invalid data.

But if I'm wrong about that, and there are just a few spec updates and bugfixes 
that need to be applied, by all means, ignore my comment.

-glyph


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Brett Cannon
On Fri, Jul 29, 2011 at 11:31, Tres Seaver tsea...@palladion.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 07/29/2011 07:22 AM, Joao S. O. Bueno wrote:

  I disaagree. Having proper html parsing out of the box is part of
  the batteries included thing. And it is not a matter of having
  html 5 - as stated on this thread, fixing it for html5 will fix it
  for html that exists in the real world.
 
  Python _has_ to work with quick 30-50 lines scripts deliverable
  everywhere, not just has proper 3rd party libraries that can work as
  part of a huge project using buildout.

 Assuming it were merged today, that parser would only be available on
 Python 3.3 and later:  how is that everywhere?


Well, everywhere, eventually. This gets down to the usual philosophical
debate of what should (not) be in the stdlib so that those who have strict
third-party code get access to useful libraries while balancing the desire
of those who want to keep the stdlib lean or prevent stagnating the API of a
module.


  Having scripts that
 work against html5lib (which *doesn't* need buildout to install, or even
 setuptools) makes them portable to any version of Python supported by
 the library (Python 2.3+, AFAICT).


If the library was brought in they could probably continue to be portable
with possibly just the addition of a try/finally on the import line.

-Brett




 Tres.
 - --
 ===
 Tres Seaver  +1 540-429-0999  tsea...@palladion.com
 Palladion Software   Excellence by Designhttp://palladion.com
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

 iEYEARECAAYFAk4y/JYACgkQ+gerLs4ltQ4KKwCgkyOlmb8xxhxg1qWH9RRbEpEw
 ne0AoL6NgRElbY61QRqnXJjiKoHq0ToW
 =fk3k
 -END PGP SIGNATURE-

 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/brett%40python.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Brett Cannon
On Fri, Jul 29, 2011 at 13:16, Glyph Lefkowitz gl...@twistedmatrix.comwrote:

 On Jul 29, 2011, at 3:00 PM, Matt wrote:

 I don't see any real reason to drop a decent piece of code (HTMLParser,
 that is) in favor of a third party library when only relatively minor
 updates are needed to bring it up to speed with the latest spec.


 I am not really one to throw stones here, as Twisted contains a lenient
 pseudo-XML parser which I still maintain - one which decidedly does *not* 
 agree
 with html5's requirements for dealing with invalid data, but just a bunch of
 ad-hoc guesses of my own.

 My impression of HTML5 is that HTMLParser would require significant
 modifications and possibly a drastic re-architecture in order to really do
 HTML5 right; especially the parts that the html5lib authors claim makes
 HTML5 streaming-unfriendly, i.e. subtree reordering when encountering
 certain types of invalid data.


We could also have the code live side-by-side for a while (or indefinitely
if that was really desired) by bringing html5lib in as either a separate
module or having the relevant classes live in htmllib under different names.

But all of this is just hypothetical until someone decides to do the legwork
to actually make a proposal and get the coding done.

-Brett



 But if I'm wrong about that, and there are just a few spec updates and
 bugfixes that need to be applied, by all means, ignore my comment.

 -glyph



 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/brett%40python.org


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-29 Thread Antoine Pitrou
On Fri, 29 Jul 2011 13:34:13 -0700
Brett Cannon br...@python.org wrote:
 On Fri, Jul 29, 2011 at 13:16, Glyph Lefkowitz gl...@twistedmatrix.comwrote:
 
  On Jul 29, 2011, at 3:00 PM, Matt wrote:
 
  I don't see any real reason to drop a decent piece of code (HTMLParser,
  that is) in favor of a third party library when only relatively minor
  updates are needed to bring it up to speed with the latest spec.
 
 
  I am not really one to throw stones here, as Twisted contains a lenient
  pseudo-XML parser which I still maintain - one which decidedly does *not* 
  agree
  with html5's requirements for dealing with invalid data, but just a bunch of
  ad-hoc guesses of my own.
 
  My impression of HTML5 is that HTMLParser would require significant
  modifications and possibly a drastic re-architecture in order to really do
  HTML5 right; especially the parts that the html5lib authors claim makes
  HTML5 streaming-unfriendly, i.e. subtree reordering when encountering
  certain types of invalid data.
 
 
 We could also have the code live side-by-side for a while (or indefinitely
 if that was really desired) by bringing html5lib in as either a separate
 module or having the relevant classes live in htmllib under different names.

Unless html5lib is better in some fundamental ways which are difficult
to fix in htmllib, I'm not sure there's any point in adding it to the
stdlib.

We don't really do users a service if we keep adding alternative APIs
for common functionality.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-28 Thread Brett Cannon
On Thu, Jul 28, 2011 at 11:25, Matt mattba...@gmail.com wrote:

 Hello all,

 I wanted to ask a few questions and start a discussion about HTML5
 support within the HTMLParser class(es). Over on issue 670664, an
 inconsistency with the way browsers and the HTMLParser parse script
 and style tags was discovered. Currently, HTMLParser adheres strictly
 to the HTML4 standard, which says that these tags should exit CDATA
 mode when the start of *any* closing tag is found. No browsers, to my
 knowledge, have ever supported this (at least in the 21st century).
 Instead, all browsers implement the behavior described in the HTML5
 spec, which states that script tags should exit their raw text mode
 when the full closing tag for that element is encountered.

 The repercussions of adhering to the HTML4 standard in HTMLParser are
 somewhat serious: a good number of documents will either encounter
 exceptions for broken markup (which aren't actually broken). Libraries
 like Beautiful Soup (which depend on HTMLParser) are also affected,
 requiring the use of hacks just to get the document to parse at all.

 Rather than bore you all with another paragraph about how HTML4 is
 terrible, feel free to look at the issue
 (http://bugs.python.org/issue670664), which quite thoroughly outlines
 the pros and cons of this particular change. Any feedback/input  on
 the proposed changes is welcome.

 So here are my questions:

 - What plans, if any, are there to support HTML5 parsing behaviors,
 since the HTML5 spec effectively describes current web browser
 behavior?


There are not specific plans that have been publicly brought up (to my
knowledge).


 - What policies are in place for keeping parity with other HTML
 parsers (such as those in web browsers)?


There aren't any beyond it would be nice.



 Given the semi-backward-compatible nature of HTML5's syntax, this
 seems like a rather unique problem that could use some more
 discussion.


It's more of an issue of someone caring enough to do the coding work to
bring the parser up to spec for HTML5 (or introduce new code to live beside
the HTML4 parsing code). IOW there is no policies specifically about this
topic beyond the general desire to stay up-to-date with stable specs.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] HTMLParser and HTML5

2011-07-28 Thread Stefan Behnel

Brett Cannon, 28.07.2011 23:49:

On Thu, Jul 28, 2011 at 11:25, Matt wrote:

- What policies are in place for keeping parity with other HTML
parsers (such as those in web browsers)?


There aren't any beyond it would be nice.
[...]
It's more of an issue of someone caring enough to do the coding work to
bring the parser up to spec for HTML5 (or introduce new code to live beside
the HTML4 parsing code).


Which, given that html5lib readily exists, would likely be a lot more work 
than anyone who is interested in HTML5 handling would want to invest.


I don't think we need a new HTML5 parsing implementation only to have it in 
the stdlib. That's the old sunny Java way of doing it.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com