Re: [Python-Dev] HTMLParser and HTML5
On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel stefan...@behnel.de wrote: Brett Cannon, 28.07.2011 23:49: On Thu, Jul 28, 2011 at 11:25, Matt wrote: - What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)? There aren't any beyond it would be nice. [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest. I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it. I disaagree. Having proper html parsing out of the box is part of the batteries included thing. And it is not a matter of having html 5 - as stated on this thread, fixing it for html5 will fix it for html that exists in the real world. Python _has_ to work with quick 30-50 lines scripts deliverable everywhere, not just has proper 3rd party libraries that can work as part of a huge project using buildout. js -- Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/jsbueno%40python.org.br ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
Joao S. O. Bueno, 29.07.2011 13:22: On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote: Brett Cannon, 28.07.2011 23:49: On Thu, Jul 28, 2011 at 11:25, Matt wrote: - What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)? There aren't any beyond it would be nice. [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest. I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it. I disaagree. Having proper html parsing out of the box is part of the batteries included thing. Well, you can easily prove me wrong by implementing this. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
On Jul 29, 2011, at 7:46 AM, Stefan Behnel wrote: Joao S. O. Bueno, 29.07.2011 13:22: On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote: Brett Cannon, 28.07.2011 23:49: On Thu, Jul 28, 2011 at 11:25, Matt wrote: - What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)? There aren't any beyond it would be nice. [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest. I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it. I disaagree. Having proper html parsing out of the box is part of the batteries included thing. Well, you can easily prove me wrong by implementing this. Stefan Please don't implement this just to profe Stefan wrong :). The thing to do, if you want html parsing in the stdlib, is to _incorporate_ html5lib, which is already a perfectly good, thoroughly tested HTML parser, and simply deprecate HTMLParser and friends. Implementing a new parser would serve no purpose I can see. -glyph ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/29/2011 07:22 AM, Joao S. O. Bueno wrote: I disaagree. Having proper html parsing out of the box is part of the batteries included thing. And it is not a matter of having html 5 - as stated on this thread, fixing it for html5 will fix it for html that exists in the real world. Python _has_ to work with quick 30-50 lines scripts deliverable everywhere, not just has proper 3rd party libraries that can work as part of a huge project using buildout. Assuming it were merged today, that parser would only be available on Python 3.3 and later: how is that everywhere? Having scripts that work against html5lib (which *doesn't* need buildout to install, or even setuptools) makes them portable to any version of Python supported by the library (Python 2.3+, AFAICT). Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4y/JYACgkQ+gerLs4ltQ4KKwCgkyOlmb8xxhxg1qWH9RRbEpEw ne0AoL6NgRElbY61QRqnXJjiKoHq0ToW =fk3k -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
On Fri, Jul 29, 2011 at 11:03 AM, Glyph Lefkowitz gl...@twistedmatrix.comwrote: On Jul 29, 2011, at 7:46 AM, Stefan Behnel wrote: Joao S. O. Bueno, 29.07.2011 13:22: On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote: Brett Cannon, 28.07.2011 23:49: On Thu, Jul 28, 2011 at 11:25, Matt wrote: - What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)? There aren't any beyond it would be nice. [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest. I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it. I disaagree. Having proper html parsing out of the box is part of the batteries included thing. Well, you can easily prove me wrong by implementing this. As far as the issue described in my initial message goes, there is a patch and tests for the patch. Please don't implement this just to profe Stefan wrong :). The thing to do, if you want html parsing in the stdlib, is to _incorporate_ html5lib, which is already a perfectly good, thoroughly tested HTML parser, and simply deprecate HTMLParser and friends. Implementing a new parser would serve no purpose I can see. I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec. As far as structure goes, HTML4 and HTML5 are practically identical. The differences between the two that are applicable to HTMLParser involve the way the specs deal with special element types and broken syntax. For what it's worth, the rules HTML4 does define are (in many cases) ignored in favor of more modern, Postel's Law-agreeable rules. HTML5 simply standardized what browsers actually do. Deprecating HTMLParser in favor of a newer/better/faster HTML library is a bad thing for everybody that's already using HTMLParser, whether directly or indirectly. html5lib does not have an interface compatible with HTMLParser, so code would largely need to be rewritten from scratch to gain the benefits of HTML5's support for broken code. Developers using HTMLParser would be permanently stuck using a library that throws exceptions for perfectly valid HTML. Keep in mind that these are solved problems: all of the thinking on how to handle broken code has been done for us by the folks at the WHATWG. It's simply a matter of updating our existing code with these new rules. While I agree that there are merits to dropping support for the old code, it does not solve the existing problems that folks are having right now (namely incorrect parser output or exceptions). It would be more ideal to perhaps patch the obvious issues stemming from HTML4 support for now, leaving anything that goes beyond parity with browsers for a later time or implementing as an opt-in feature (i.e.: enabled by a parameter). Matt ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
On Jul 29, 2011, at 3:00 PM, Matt wrote: I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec. I am not really one to throw stones here, as Twisted contains a lenient pseudo-XML parser which I still maintain - one which decidedly does not agree with html5's requirements for dealing with invalid data, but just a bunch of ad-hoc guesses of my own. My impression of HTML5 is that HTMLParser would require significant modifications and possibly a drastic re-architecture in order to really do HTML5 right; especially the parts that the html5lib authors claim makes HTML5 streaming-unfriendly, i.e. subtree reordering when encountering certain types of invalid data. But if I'm wrong about that, and there are just a few spec updates and bugfixes that need to be applied, by all means, ignore my comment. -glyph ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
On Fri, Jul 29, 2011 at 11:31, Tres Seaver tsea...@palladion.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/29/2011 07:22 AM, Joao S. O. Bueno wrote: I disaagree. Having proper html parsing out of the box is part of the batteries included thing. And it is not a matter of having html 5 - as stated on this thread, fixing it for html5 will fix it for html that exists in the real world. Python _has_ to work with quick 30-50 lines scripts deliverable everywhere, not just has proper 3rd party libraries that can work as part of a huge project using buildout. Assuming it were merged today, that parser would only be available on Python 3.3 and later: how is that everywhere? Well, everywhere, eventually. This gets down to the usual philosophical debate of what should (not) be in the stdlib so that those who have strict third-party code get access to useful libraries while balancing the desire of those who want to keep the stdlib lean or prevent stagnating the API of a module. Having scripts that work against html5lib (which *doesn't* need buildout to install, or even setuptools) makes them portable to any version of Python supported by the library (Python 2.3+, AFAICT). If the library was brought in they could probably continue to be portable with possibly just the addition of a try/finally on the import line. -Brett Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4y/JYACgkQ+gerLs4ltQ4KKwCgkyOlmb8xxhxg1qWH9RRbEpEw ne0AoL6NgRElbY61QRqnXJjiKoHq0ToW =fk3k -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
On Fri, Jul 29, 2011 at 13:16, Glyph Lefkowitz gl...@twistedmatrix.comwrote: On Jul 29, 2011, at 3:00 PM, Matt wrote: I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec. I am not really one to throw stones here, as Twisted contains a lenient pseudo-XML parser which I still maintain - one which decidedly does *not* agree with html5's requirements for dealing with invalid data, but just a bunch of ad-hoc guesses of my own. My impression of HTML5 is that HTMLParser would require significant modifications and possibly a drastic re-architecture in order to really do HTML5 right; especially the parts that the html5lib authors claim makes HTML5 streaming-unfriendly, i.e. subtree reordering when encountering certain types of invalid data. We could also have the code live side-by-side for a while (or indefinitely if that was really desired) by bringing html5lib in as either a separate module or having the relevant classes live in htmllib under different names. But all of this is just hypothetical until someone decides to do the legwork to actually make a proposal and get the coding done. -Brett But if I'm wrong about that, and there are just a few spec updates and bugfixes that need to be applied, by all means, ignore my comment. -glyph ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
On Fri, 29 Jul 2011 13:34:13 -0700 Brett Cannon br...@python.org wrote: On Fri, Jul 29, 2011 at 13:16, Glyph Lefkowitz gl...@twistedmatrix.comwrote: On Jul 29, 2011, at 3:00 PM, Matt wrote: I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec. I am not really one to throw stones here, as Twisted contains a lenient pseudo-XML parser which I still maintain - one which decidedly does *not* agree with html5's requirements for dealing with invalid data, but just a bunch of ad-hoc guesses of my own. My impression of HTML5 is that HTMLParser would require significant modifications and possibly a drastic re-architecture in order to really do HTML5 right; especially the parts that the html5lib authors claim makes HTML5 streaming-unfriendly, i.e. subtree reordering when encountering certain types of invalid data. We could also have the code live side-by-side for a while (or indefinitely if that was really desired) by bringing html5lib in as either a separate module or having the relevant classes live in htmllib under different names. Unless html5lib is better in some fundamental ways which are difficult to fix in htmllib, I'm not sure there's any point in adding it to the stdlib. We don't really do users a service if we keep adding alternative APIs for common functionality. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
On Thu, Jul 28, 2011 at 11:25, Matt mattba...@gmail.com wrote: Hello all, I wanted to ask a few questions and start a discussion about HTML5 support within the HTMLParser class(es). Over on issue 670664, an inconsistency with the way browsers and the HTMLParser parse script and style tags was discovered. Currently, HTMLParser adheres strictly to the HTML4 standard, which says that these tags should exit CDATA mode when the start of *any* closing tag is found. No browsers, to my knowledge, have ever supported this (at least in the 21st century). Instead, all browsers implement the behavior described in the HTML5 spec, which states that script tags should exit their raw text mode when the full closing tag for that element is encountered. The repercussions of adhering to the HTML4 standard in HTMLParser are somewhat serious: a good number of documents will either encounter exceptions for broken markup (which aren't actually broken). Libraries like Beautiful Soup (which depend on HTMLParser) are also affected, requiring the use of hacks just to get the document to parse at all. Rather than bore you all with another paragraph about how HTML4 is terrible, feel free to look at the issue (http://bugs.python.org/issue670664), which quite thoroughly outlines the pros and cons of this particular change. Any feedback/input on the proposed changes is welcome. So here are my questions: - What plans, if any, are there to support HTML5 parsing behaviors, since the HTML5 spec effectively describes current web browser behavior? There are not specific plans that have been publicly brought up (to my knowledge). - What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)? There aren't any beyond it would be nice. Given the semi-backward-compatible nature of HTML5's syntax, this seems like a rather unique problem that could use some more discussion. It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). IOW there is no policies specifically about this topic beyond the general desire to stay up-to-date with stable specs. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] HTMLParser and HTML5
Brett Cannon, 28.07.2011 23:49: On Thu, Jul 28, 2011 at 11:25, Matt wrote: - What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)? There aren't any beyond it would be nice. [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest. I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com