Roundup Robot added the comment:
New changeset 719ee60fc5e2 by Serhiy Storchaka in branch '2.7':
Issue #15866: The xmlcharrefreplace error handler no more produces two XML
http://hg.python.org/cpython/rev/719ee60fc5e2
--
nosy: +python-dev
___
Python
Changes by Serhiy Storchaka storch...@gmail.com:
--
assignee: - serhiy.storchaka
resolution: - fixed
stage: patch review - committed/rejected
status: open - closed
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15866
STINNER Victor added the comment:
Should we really invest time to fix bugs related to astral (non-BMP) characters
with rare codecs and error handlers (CJK codecs, xmlcharrefreplace error
handler)? Python 3.3 is released and has a much better support of astral
characters (in many places). I
Ezio Melotti added the comment:
I tend to agree with Victor: if you want to fix 2.7 go ahead, but if that's too
much work it's OK with me to close this issue.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15866
Changes by Serhiy Storchaka storch...@gmail.com:
--
versions: -Python 3.2
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15866
___
___
Serhiy Storchaka added the comment:
Here is a patch which fixes xmlcharrefreplace error handling in other places.
Unfortunately multibyte asian encoders are broken yet. I'll open a separate
issue for this.
--
Added file: http://bugs.python.org/file29378/issue15866_2.patch
Serhiy Storchaka added the comment:
I think it's better to be compatible with 3.3+. This is anyway a rather
obscure corner case.
Well, we should not introduce new divergence between 3.2 wide build and 3.3.
Do you want to propose a new patch?
I will do it.
--
Serhiy Storchaka added the comment:
I prefer a little different (simpler for me) form:
for (p = collstart; p collend;) {
Py_UCS4 ch = *p++;
if ((0xD800 = ch ch = 0xDBFF)
(p collend)
Ezio Melotti added the comment:
I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as
b'#128157;' and not as b'#55357;#56477;'?
I don't think so. IIRC surrogates are invalid in UTF-32, and certainly
shouldn't be recombined.
This will be compatible with narrow build but
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +serhiy.storchaka
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15866
___
___
Ezio Melotti added the comment:
Attached patch against 3.2 seems to fix the problem.
--
keywords: +patch
stage: - patch review
versions: +Python 3.2
Added file: http://bugs.python.org/file27134/issue15866.diff
___
Python tracker
Ezio Melotti added the comment:
Note that there's similar code in charmap_encoding_error,
PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, and
PyUnicode_EncodeDecimal, however I'm not sure how to reach these paths.
--
nosy: +lemburg
___
Python
STINNER Victor added the comment:
Thanks to the PEP 393, this issue is already fixed in Python 3.3.
$ ./python
Python 3.3.0rc1+ (default:ba2c1def3710+, Sep 3 2012, 23:20:25)
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux
( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace')
New submission from Wim:
Encoding a (well-formed) Unicode string containing a non-BMP character, using
the xmlcharrefreplace error handler, will produce two XML entities for
surrogate codepoints instead of one entity for the actual character.
Here's a transcript (Python 2.7.3, x86_64):
b
14 matches
Mail list logo