[issue25258] HtmlParser doesn't handle void element tags correctly

2021-01-04 Thread karl


karl  added the comment:

I wonder if the confusion comes from the name. The HTMLParser is kind of a 
tokenizer more than a full HTML parser, but that's probably a detail. It 
doesn't create a DOM Tree which you can access, but could help you to build a 
DOM Tree (!= DOM Document object)

https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model

> Implementations that do not support scripting do not have to actually create 
> a DOM Document object, but the DOM tree in such cases is still used as the 
> model for the rest of the specification.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2021-01-04 Thread karl


karl  added the comment:

The parsing rules for tokenization of html are at 
https://html.spec.whatwg.org/multipage/parsing.html#tokenization

In the stack of open elements, there are specific rules for certain elements. 
https://html.spec.whatwg.org/multipage/parsing.html#special

from a DOM point of view, there is indeed no difference in between 


https://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%0A%3Cimg%20src%3D%22somewhere%22%3E%3Cimg%20src%3D%22somewhere%22%2F%3E

--
nosy: +karlcow

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-03 Thread R. David Murray

R. David Murray added the comment:

I suspect that calling startendtag is also backward incompatible, in that there 
may be parsers out there that are depending on starttag getting called for 
, and endtag not getting called (that is, endtag getting called for it 
will cause them to break).  I would hope that this would not be the case, but 
I'm worried about it.

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-03 Thread Chenyun Yang

Chenyun Yang added the comment:

handle_startendtag is also called for non-void elements, such as , so
the override example will break in those situation.

The compatible patch I proposed right now is just one liner checker:

# http://www.w3.org/TR/html5/syntax.html#void-elements
_VOID_ELEMENT_TAGS
= frozenset(['area', 'base', 'br', 'col', 'embed', 'hr', 'img',
'input', 'keygen','link', 'meta', 'param', 'source', 'track',
'wbr'])class HTMLParser.HTMLParser:  # Internal -- handle starttag,
return end or -1 if not terminated  def parse_starttag(self, i):
#...if end.endswith('/>'):  # XHTML-style empty tag:   self.handle_startendtag(tag, attrs)
#PATCH#elif end.endswith('>')
and tag in _VOID_ELEMENT_TAGS:  self.handle_startendtag(tag,
attrs)#PATCH#

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Chenyun Yang

Chenyun Yang added the comment:

the example you give for  is a different case.

,  are void elements which are allowed to have no close tag;
 without  is a browser implementation detail, most browser
autocompletes .

Without the parser calls the handle_endtag(), the client code which uses
HTMLParser won't be able to know whether the a traversal is finished.

Do you have a strong reason why we should include the knowledge of  void
elements into the HTMLParser at this line?

https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py#L341

if end.endswith('/>') or (end.endswith('>') and tag in VOID_ELEMENTS)

On Wed, Sep 30, 2015 at 7:05 PM, Martin Panter 
wrote:

>
> Martin Panter added the comment:
>
> My thinking is that the knowledge that  does not have a closing tag
> is at a higher level than the current HTMLParser class. It is similar to
> knowing where the following HTML implicitly closes the  elements:
>
> Item AItem B
>
> In both cases I would not expect the HTMLParser to report “virtual” empty
> or closing tags. I don’t think it should report an empty  or closing
>  tag just because that is easy to do, because it would be
> inconsistent with other implied HTML tags. But maybe see what other people
> say.
>
> I don’t know your particular use case, but I would suggest if you need to
> parse non-XML HTML  tags, use the handle_starttag() method and don’t
> rely on the end tag :)
>
> --
>
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Chenyun Yang

Chenyun Yang added the comment:

Correct for previous comment, consistent -> not consistent

On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang  wrote:

>
> Chenyun Yang added the comment:
>
> I am fine with either handle_startendtag or handle_starttag,
>
> The issue is that the behavior is consistent for the two equally valid
> syntax ( and  are handled differently); this inconsistent cannot
> be fixed from the inherited class as (handle_* calls are dispatched in the
> internal method of HTMLParser)
>
> On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti 
> wrote:
>
> >
> > Ezio Melotti added the comment:
> >
> > Note that HTMLParser tries to follow the HTML5 specs, and for this case
> > they say [0]:
> > "Set the self-closing flag of the current tag token. Switch to the data
> > state. Emit the current tag token."
> >
> > So it seems that for , only the  (and not the closing )
> > should be emitted.  HTMLParser has no way to set the self-closing flag,
> so
> > calling handle_startendtag seems the most reasonable things to do, since
> it
> > allows tree-builders to set the flag themselves.  That said, the default
> > implementation of handle_startendtag should indeed just call
> > handle_starttag, however this would be a backward-incompatible change.
> >
> > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
> >
> > --
> > type:  -> behavior
> >
> > ___
> > Python tracker 
> > 
> > ___
> >
>
> --
>
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Ezio Melotti

Ezio Melotti added the comment:

Note that HTMLParser tries to follow the HTML5 specs, and for this case they 
say [0]:
"Set the self-closing flag of the current tag token. Switch to the data state. 
Emit the current tag token."

So it seems that for , only the  (and not the closing ) 
should be emitted.  HTMLParser has no way to set the self-closing flag, so 
calling handle_startendtag seems the most reasonable things to do, since it 
allows tree-builders to set the flag themselves.  That said, the default 
implementation of handle_startendtag should indeed just call handle_starttag, 
however this would be a backward-incompatible change.

[0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state

--
type:  -> behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Chenyun Yang

Chenyun Yang added the comment:

I am fine with either handle_startendtag or handle_starttag,

The issue is that the behavior is consistent for the two equally valid
syntax ( and  are handled differently); this inconsistent cannot
be fixed from the inherited class as (handle_* calls are dispatched in the
internal method of HTMLParser)

On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti 
wrote:

>
> Ezio Melotti added the comment:
>
> Note that HTMLParser tries to follow the HTML5 specs, and for this case
> they say [0]:
> "Set the self-closing flag of the current tag token. Switch to the data
> state. Emit the current tag token."
>
> So it seems that for , only the  (and not the closing )
> should be emitted.  HTMLParser has no way to set the self-closing flag, so
> calling handle_startendtag seems the most reasonable things to do, since it
> allows tree-builders to set the flag themselves.  That said, the default
> implementation of handle_startendtag should indeed just call
> handle_starttag, however this would be a backward-incompatible change.
>
> [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
>
> --
> type:  -> behavior
>
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-10-02 Thread Ezio Melotti

Ezio Melotti added the comment:

> this inconsistent cannot be fixed from the inherited class as (handle_* 
> calls are dispatched in the internal method of HTMLParser)

You can override handle_startendtag() like this:

>>> class MyHTMLParser(HTMLParser):
... def handle_starttag(self, tag, attrs):
... print('start', tag)
... def handle_endtag(self, tag):
... print('end', tag)
... def handle_startendtag(self, tag, attrs):
... self.handle_starttag(tag, attrs)
... 
>>> parser = MyHTMLParser()
>>> parser.feed('')
start link
start img


(P.S. please don't quote the whole message in your reply)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-30 Thread Martin Panter

Martin Panter added the comment:

My thinking is that the knowledge that  does not have a closing tag is at 
a higher level than the current HTMLParser class. It is similar to knowing 
where the following HTML implicitly closes the  elements:

Item AItem B

In both cases I would not expect the HTMLParser to report “virtual” empty or 
closing tags. I don’t think it should report an empty  or closing  
tag just because that is easy to do, because it would be inconsistent with 
other implied HTML tags. But maybe see what other people say.

I don’t know your particular use case, but I would suggest if you need to parse 
non-XML HTML  tags, use the handle_starttag() method and don’t rely on the 
end tag :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-29 Thread Chenyun Yang

Chenyun Yang added the comment:

I think the bug is mostly about inconsistent behavior:  and 
shouldn't be parsed differently.

This causes problem in the case that the parser won't be able to know
consistently whether it has ended the visit of  tag.

I propose one fix which will be: in the `parse_internal' method call, check
for void elements and call `handle_startendtag'

On Tue, Sep 29, 2015 at 1:27 PM, Martin Panter 
wrote:

>
> Martin Panter added the comment:
>
> Also applies to Python 3, though I’m not sure I would consider it a bug.
>
> --
> nosy: +martin.panter
> versions: +Python 3.4, Python 3.5, Python 3.6
>
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-29 Thread Martin Panter

Martin Panter added the comment:

Also applies to Python 3, though I’m not sure I would consider it a bug.

--
nosy: +martin.panter
versions: +Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-28 Thread Xiang Zhang

Xiang Zhang added the comment:

>From the specification, void element has no end tag, so I think this
behaviour can not be called incorrect. For void element, only
handle_starttag is called.

And for start tag ends with '/>', actually HTMLParser calls
handle_startendtag, which invokes handle_starttag and
handle_endtag.

I think there are two solutions, filter void elements in the library
and then invoke handle_startendtag, or filter void elements in the
application in handle_starttag and then invoke handle_endtag.

--
nosy: +xiang.zhang

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-28 Thread R. David Murray

Changes by R. David Murray :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25258] HtmlParser doesn't handle void element tags correctly

2015-09-28 Thread Chenyun Yang

New submission from Chenyun Yang:

For void elements such as (, ), there doesn't need to have xhtml 
empty end tag. HtmlParser which relies on the XHTML empty end syntax failed to 
handle this situation. 

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data  :", data

>>> parser.feed('')
Encountered a start tag: link
Encountered a start tag: img
>>> parser.feed('')
Encountered a start tag: link
Encountered an end tag : link
Encountered a start tag: img
Encountered an end tag : img


Reference:
https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py
http://www.w3.org/TR/html5/syntax.html#void-elements

--
components: Library (Lib)
messages: 251792
nosy: Chenyun Yang
priority: normal
severity: normal
status: open
title: HtmlParser doesn't handle void element tags correctly
versions: Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com