[issue20288] HTMLParse handing of non-numeric charrefs broken
Ezio Melotti added the comment: This is now fixed, thanks for the report! This should be fixed, and the behavior of _run_check should probably be changed too -- maybe it could test both the char-by-char and the regular feeding. I created #20623 to track this. -- resolution: - fixed stage: needs patch - committed/rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue20288 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue20288] HTMLParse handing of non-numeric charrefs broken
Ezio Melotti added the comment: Here's a patch against 2.7. -- keywords: +patch Added file: http://bugs.python.org/file33845/issue20288.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue20288 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue20288] HTMLParse handing of non-numeric charrefs broken
Roundup Robot added the comment: New changeset 0d50b5851f38 by Ezio Melotti in branch '2.7': #20288: fix handling of invalid numeric charrefs in HTMLParser. http://hg.python.org/cpython/rev/0d50b5851f38 New changeset 32097f193892 by Ezio Melotti in branch '3.3': #20288: fix handling of invalid numeric charrefs in HTMLParser. http://hg.python.org/cpython/rev/32097f193892 New changeset 92b3928bfde1 by Ezio Melotti in branch 'default': #20288: merge with 3.3. http://hg.python.org/cpython/rev/92b3928bfde1 -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue20288 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue20288] HTMLParse handing of non-numeric charrefs broken
New submission from Anders Hammarquist: Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4) match = charref.match(rawdata, i) if match: ... else: if ; in rawdata[i:]: #bail by consuming # self.handle_data(rawdata[0:2]) i = self.updatepos(i, 2) break if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg: p = HTMLParser() p.handle_data = lambda x: sys.stdout.write(x) p.feed('p#foo;/p') will print 'p' which is clearly wrong. I think the intention of the code is to pass '#', which seems saner. -- components: Library (Lib) messages: 208336 nosy: iko priority: normal severity: normal status: open title: HTMLParse handing of non-numeric charrefs broken type: behavior ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue20288 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue20288] HTMLParse handing of non-numeric charrefs broken
Changes by Ezio Melotti ezio.melo...@gmail.com: -- assignee: - ezio.melotti nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue20288 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue20288] HTMLParse handing of non-numeric charrefs broken
Ezio Melotti added the comment: Thanks for the report, this is indeed a bug. This behavior was covered by a test (see Lib/test/test_htmlparser.py:164), but _run_check feeds the chars one by one to the parser, and in that case it works correctly. While feeding the parser a whole chunk I was able to reproduce the bug. This should be fixed, and the behavior of _run_check should probably be changed too -- maybe it could test both the char-by-char and the regular feeding. -- nosy: +r.david.murray stage: - needs patch versions: +Python 2.7, Python 3.3, Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue20288 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com