[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-12-18 Thread Ezio Melotti

Changes by Ezio Melotti :


--
resolution:  -> fixed
stage: needs patch -> committed/rejected
status: open -> closed
type: behavior -> enhancement

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com




[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-12-18 Thread Roundup Robot

Roundup Robot  added the comment:

New changeset 978f45013c34 by Ezio Melotti in branch '2.7':
#3932: suggest passing unicode to HTMLParser.feed().
http://hg.python.org/cpython/rev/978f45013c34

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-11-28 Thread Ezio Melotti

Ezio Melotti  added the comment:

I'll change this in a doc issue then.

Any suggestions about the wording?
Adding "Passing unicode strings is suggested/advised/preferred." in the .feed() 
section is a bit vague, and mentioning the problem (with str it might break in 
some corner cases) while keeping a positive tone is somewhat difficult.

--
components: +Documentation -Library (Lib)
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-11-22 Thread Éric Araujo

Éric Araujo  added the comment:

+1 on refusing the temptation to guess and to be half-working for some cases by 
accident.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-11-14 Thread Ezio Melotti

Changes by Ezio Melotti :


--
assignee:  -> ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-11-06 Thread Ezio Melotti

Changes by Ezio Melotti :


--
stage:  -> needs patch
Added file: http://bugs.python.org/file23621/issue3932-test.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-11-06 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +eric.araujo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2011-11-06 Thread Ezio Melotti

Ezio Melotti  added the comment:

I'm not sure what is the best solution here.

unescape uses a regex with replaceEntities as callback to replace the entities 
in attribute values.
The problem is that replaceEntities currently returns unicode, and if unescape 
receives a str, an automatic coercion to unicode happens and an error is raised 
whenever the str is non-ascii.

The possible solutions are:
 1) Document the status quo (i.e replaceEntities always returns unicode, and an 
error is raised whenever a string that contains non-ascii chars is passed);
 2) Change replaceEntities to return str only for ascii chars (as the patch 
proposed by Zbigniew does).  This works as long as the entity resolves to an 
ascii character, but keep failing for the other cases.

The first option is cleaner, and means that if you want to parse something you 
should always use unicode, otherwise it might fail (In case of ambiguity, 
refuse the temptation to guess).
The second option might allow you to parse a few more documents without 
converting them to unicode, but only if you are lucky (i.e. you don't get any 
unicode mixed with non-ascii str).  If most of the entities in attributes 
resolve to ascii (e.g. "e; & ' > <), it might be more 
practical to return str and avoid unnecessary errors, while still adding a note 
in documentation that passing unicode is better.

--
nosy: +ezio.melotti, r.david.murray
type:  -> behavior
versions:  -Python 2.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2009-12-12 Thread Sérgio

Sérgio  added the comment:

the patch fix parsing in simple tag a with title with  ?! and
accents like this:

 

--
nosy: +sergiomb2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2009-07-30 Thread Artur Frysiak

Changes by Artur Frysiak :


--
nosy: +wiget

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2009-07-30 Thread Zbigniew Chyla

Zbigniew Chyla  added the comment:

Since `HTMLParser.unescape` in 2.5 returns `str` for `str` input, 2.6
should remain compatible. Therefore I propose the attached patch
(`HTMLParser-unescape-fix.diff`). With this patch applied the result
will have the same type as the input.

--
keywords: +patch
nosy: +zchyla
Added file: http://bugs.python.org/file14606/HTMLParser-unescape-fix.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2008-10-03 Thread Simon Cross

Simon Cross <[EMAIL PROTECTED]> added the comment:

I've tracked down the cause to the .unescape(...) method in HTMLParser.
The replaceEntities function passed to re.sub() always returns a unicode
character, even when matching string s is a byte string. Changing line
383 to:

  return self.entitydefs[s].encode("utf-8")

makes the test pass. Unfortunately this is obviously not a viable
solution in the general case. The problem is that there is no way to
know what character set to encode in without knowing both the HTTP
headers (which are not available to HTMLParser) and looking at the XML
and HTML headers.

Python 3.0 implicitly rejects non-unicode strings right at the start of
html.parser.HTMLParser.feed(...) by adding '' to the data passed in.

Given Python 3.0's behaviour, the docs should perhaps be updated to say
HTMLParser does not support non-unicode strings? If it should support
byte strings, we'll have to figure out how to handle encoded entity issues.

It's a bit weird that character and entity references outside
tags/attributes result in calls to .entityref(...) and .charref(...)
while those inside get unescape called automatically. Don't really see
what can be done about that though.

--
versions: +Python 2.7

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2008-10-03 Thread yanne

yanne <[EMAIL PROTECTED]> added the comment:

It seems that I managed to upload wrong test file the first time.

This attached test should fail, I tested it with Python2.6 final both on
Linux and Windows.

Added file: http://bugs.python.org/file11690/test.py

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2008-10-03 Thread yanne

Changes by yanne <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file11557/test.py

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2008-09-26 Thread Simon Cross

Simon Cross <[EMAIL PROTECTED]> added the comment:

I can't reproduce this on current trunk (r66633, 27 Sep 2008). I checked
sys.getdefaultencoding() but that returned 'ascii' as expected and I
even tried language Python with "LANG=C ./python" but that didn't fail
either. Perhaps this has been fixed? It looks like it might originally
have been a problem in the re module from the traceback.

--
nosy: +hodgestar

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

2008-09-22 Thread yanne

New submission from yanne <[EMAIL PROTECTED]>:

It seems that HTMLParser.feed throws an exception whenever an attribute
name contains both quotation mark '&' and non-ascii characters.

Running the attached test file with Python 2.5 succeeds, but with Python
2.6, the result is:

C:\Python26>python.exe test.py
Without & in attribute
OK
With & in attribute
Traceback (most recent call last):
  File "test.py", line 18, in 
HP().feed(s)
  File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
  File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
  File "C:\Python26\lib\HTMLParser.py", line 249, in parse_starttag
attrvalue = self.unescape(attrvalue)
  File "C:\Python26\lib\HTMLParser.py", line 386, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));",
replaceEntities, s)
  File "C:\Python26\lib\re.py", line 150, in sub
return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal
not in range(128)

I am running:

Python 2.6rc2 (r26rc2:66507, Sep 18 2008, 14:27:33) [MSC v.1500 32 bit
(Intel)] on win32

--
components: Library (Lib)
files: test.py
messages: 73571
nosy: yanne
severity: normal
status: open
title: HTMLParser cannot handle '&' and non-ascii characters in attribute names
versions: Python 2.6
Added file: http://bugs.python.org/file11557/test.py

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com