[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-03 Thread Chenyun Yang

Chenyun Yang added the comment:

handle_startendtag is also called for non-void elements, such as , so
the override example will break in those situation.

The compatible patch I proposed right now is just one liner checker:

# http://www.w3.org/TR/html5/syntax.html#void-elements
<https://www.google.com/url?q=http://www.w3.org/TR/html5/syntax.html%23void-elements=AFQjCNFVtfyZ53NDOHlPq896qmX5b8fPTA>_VOID_ELEMENT_TAGS
= frozenset(['area', 'base', 'br', 'col', 'embed', 'hr', 'img',
'input', 'keygen','link', 'meta', 'param', 'source', 'track',
'wbr'])class HTMLParser.HTMLParser:  # Internal -- handle starttag,
return end or -1 if not terminated  def parse_starttag(self, i):
#...if end.endswith('/>'):  # XHTML-style empty tag:   self.handle_startendtag(tag, attrs)
#PATCH#elif end.endswith('>')
and tag in _VOID_ELEMENT_TAGS:  self.handle_startendtag(tag,
attrs)#PATCH#

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25258>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Chenyun Yang

Chenyun Yang added the comment:

the example you give for  is a different case.

,  are void elements which are allowed to have no close tag;
 without  is a browser implementation detail, most browser
autocompletes .

Without the parser calls the handle_endtag(), the client code which uses
HTMLParser won't be able to know whether the a traversal is finished.

Do you have a strong reason why we should include the knowledge of  void
elements into the HTMLParser at this line?

https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py#L341

if end.endswith('/>') or (end.endswith('>') and tag in VOID_ELEMENTS)

On Wed, Sep 30, 2015 at 7:05 PM, Martin Panter <rep...@bugs.python.org>
wrote:

>
> Martin Panter added the comment:
>
> My thinking is that the knowledge that  does not have a closing tag
> is at a higher level than the current HTMLParser class. It is similar to
> knowing where the following HTML implicitly closes the  elements:
>
> Item AItem B
>
> In both cases I would not expect the HTMLParser to report “virtual” empty
> or closing tags. I don’t think it should report an empty  or closing
>  tag just because that is easy to do, because it would be
> inconsistent with other implied HTML tags. But maybe see what other people
> say.
>
> I don’t know your particular use case, but I would suggest if you need to
> parse non-XML HTML  tags, use the handle_starttag() method and don’t
> rely on the end tag :)
>
> --
>
> ___
> Python tracker <rep...@bugs.python.org>
> <http://bugs.python.org/issue25258>
> ___
>

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25258>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Chenyun Yang

Chenyun Yang added the comment:

Correct for previous comment, consistent -> not consistent

On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang <rep...@bugs.python.org> wrote:

>
> Chenyun Yang added the comment:
>
> I am fine with either handle_startendtag or handle_starttag,
>
> The issue is that the behavior is consistent for the two equally valid
> syntax ( and  are handled differently); this inconsistent cannot
> be fixed from the inherited class as (handle_* calls are dispatched in the
> internal method of HTMLParser)
>
> On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <rep...@bugs.python.org>
> wrote:
>
> >
> > Ezio Melotti added the comment:
> >
> > Note that HTMLParser tries to follow the HTML5 specs, and for this case
> > they say [0]:
> > "Set the self-closing flag of the current tag token. Switch to the data
> > state. Emit the current tag token."
> >
> > So it seems that for , only the  (and not the closing )
> > should be emitted.  HTMLParser has no way to set the self-closing flag,
> so
> > calling handle_startendtag seems the most reasonable things to do, since
> it
> > allows tree-builders to set the flag themselves.  That said, the default
> > implementation of handle_startendtag should indeed just call
> > handle_starttag, however this would be a backward-incompatible change.
> >
> > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
> >
> > --
> > type:  -> behavior
> >
> > ___
> > Python tracker <rep...@bugs.python.org>
> > <http://bugs.python.org/issue25258>
> > ___
> >
>
> --
>
> ___
> Python tracker <rep...@bugs.python.org>
> <http://bugs.python.org/issue25258>
> ___
>

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25258>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Chenyun Yang

Chenyun Yang added the comment:

I am fine with either handle_startendtag or handle_starttag,

The issue is that the behavior is consistent for the two equally valid
syntax ( and  are handled differently); this inconsistent cannot
be fixed from the inherited class as (handle_* calls are dispatched in the
internal method of HTMLParser)

On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <rep...@bugs.python.org>
wrote:

>
> Ezio Melotti added the comment:
>
> Note that HTMLParser tries to follow the HTML5 specs, and for this case
> they say [0]:
> "Set the self-closing flag of the current tag token. Switch to the data
> state. Emit the current tag token."
>
> So it seems that for , only the  (and not the closing )
> should be emitted.  HTMLParser has no way to set the self-closing flag, so
> calling handle_startendtag seems the most reasonable things to do, since it
> allows tree-builders to set the flag themselves.  That said, the default
> implementation of handle_startendtag should indeed just call
> handle_starttag, however this would be a backward-incompatible change.
>
> [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
>
> --
> type:  -> behavior
>
> ___
> Python tracker <rep...@bugs.python.org>
> <http://bugs.python.org/issue25258>
> ___
>

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25258>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-29 Thread Chenyun Yang

Chenyun Yang added the comment:

I think the bug is mostly about inconsistent behavior:  and 
shouldn't be parsed differently.

This causes problem in the case that the parser won't be able to know
consistently whether it has ended the visit of  tag.

I propose one fix which will be: in the `parse_internal' method call, check
for void elements and call `handle_startendtag'

On Tue, Sep 29, 2015 at 1:27 PM, Martin Panter <rep...@bugs.python.org>
wrote:

>
> Martin Panter added the comment:
>
> Also applies to Python 3, though I’m not sure I would consider it a bug.
>
> --
> nosy: +martin.panter
> versions: +Python 3.4, Python 3.5, Python 3.6
>
> ___
> Python tracker <rep...@bugs.python.org>
> <http://bugs.python.org/issue25258>
> ___
>

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25258>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-28 Thread Chenyun Yang

New submission from Chenyun Yang:

For void elements such as (, ), there doesn't need to have xhtml 
empty end tag. HtmlParser which relies on the XHTML empty end syntax failed to 
handle this situation. 

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data  :", data

>>> parser.feed('')
Encountered a start tag: link
Encountered a start tag: img
>>> parser.feed('')
Encountered a start tag: link
Encountered an end tag : link
Encountered a start tag: img
Encountered an end tag : img


Reference:
https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py
http://www.w3.org/TR/html5/syntax.html#void-elements

--
components: Library (Lib)
messages: 251792
nosy: Chenyun Yang
priority: normal
severity: normal
status: open
title: HtmlParser doesn't handle void element tags correctly
versions: Python 2.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25258>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com