subject:"\[issue3297\] Python interpreter uses Unicode surrogate pairs only before the pyc is created"

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-06-18 Thread STINNER Victor


Changes by STINNER Victor victor.stin...@haypocalc.com:


--
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-06-14 Thread STINNER Victor


STINNER Victor victor.stin...@haypocalc.com added the comment:

We are too close from the final 2.7 release, it's too late to backport. As I 
wrote, this feature is not important and there are many workaround, so we don't 
need to backport to 3.1. Close the issue: use Python 3.2 if you want a better 
support of unicode ;-)

--
dependencies:  -Use Py_UCS4 instead of Py_UNICODE in unicodectype.c
resolution:  - fixed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-06-09 Thread Terry J. Reedy


Changes by Terry J. Reedy tjre...@udel.edu:


--
versions:  -Python 2.4, Python 2.5, Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-05-21 Thread STINNER Victor


STINNER Victor victor.stin...@haypocalc.com added the comment:

@benjamin.peterson: Do you plan to port r75928 to 2.7 and 3.1? If not, can you 
close this issue?

I think that this issue priority is minor because few people write directly 
non-BMP characters in Python files (maybe only one, Ezio Melotti :-)). 
u\u, u\U or unichr(xxx) can be used in Python 2.7 and 3.1 
(without u prefix for 3.1).

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-28 Thread Benjamin Peterson


Changes by Benjamin Peterson benja...@python.org:


--
dependencies: +UnicodeEncodeError - I can't even see license

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-28 Thread Benjamin Peterson


Benjamin Peterson benja...@python.org added the comment:

Committed Adam's patch in r75928.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-04 Thread Amaury Forgeot d'Arc


Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

This last point is already tracked by issue5127.

--
nosy: +amaury.forgeotdarc

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-04 Thread Adam Olsen


Adam Olsen rha...@gmail.com added the comment:

Patch, which uses UTF-32-BE as indicated in my last comment.  Test included.

--
keywords: +patch
Added file: http://bugs.python.org/file15043/py3k-nonBMP-literal.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-04 Thread Adam Olsen


Adam Olsen rha...@gmail.com added the comment:

With some further prodding I've noticed that although the test behaves
as expected in the py3k branch (fails on UTF-32 builds before the
patch), it doesn't fail using python 3.0.  I'm guessing there's
interactions with compile() vs import and the issue 3672 fix.  Still
good enough though, IMO.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-03 Thread Adam Olsen


Adam Olsen rha...@gmail.com added the comment:

Looks like the failure mode has changed here, presumably due to issue
#3672 patches.  It now always fails, even after loading from a .pyc. 
This is using py3k via bzr, which reports itself as 3.2a0

$ rm unicodetest.pyc 
$ ./python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: '\ud800\udd23' '\U00010123'
[28877 refs]
$ ./python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: '\ud800\udd23' '\U00010123'
[28708 refs]

--
versions: +Python 2.7, Python 3.1, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-03 Thread Adam Olsen


Adam Olsen rha...@gmail.com added the comment:

I've traced down the biggest problem to decode_unicode in ast.c.  It
needs to convert everything into a form of escapes so it becomes pure
ascii, which then become evaluated back into a unicode object. 
Unfortunately, it uses UTF-16-BE to do so, which always split
surrogates.  Switching it to UTF-32-BE is fairly straightforward, and
works even on UTF-16 (or narrow) builds.

Incidentally, there's no point using the surrogatepass error handler
once we actually support surrogates.

Unfortunately there's a second problem in repr(). 
'\U0001010F'.isprintable() returns True on UTF-32 builds and False on
UTF-16 builds.  This causes repr() to escape it unnecessarily on UTF-16
builds.  repr() at least joins surrogate pairs before its internally
printable test (unlike .isprintable() or any other str method), but it
turns out all of the APIs in unicodectype.c only accept a single 16-bit
int in UTF-16 builds anyway.  That'll be a bigger patch than the first part.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-04-28 Thread Lino Mastrodomenico


Changes by Lino Mastrodomenico l.mastrodomen...@gmail.com:


--
nosy: +l.mastrodomenico

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-04-25 Thread Jakub Wilk


Changes by Jakub Wilk uba...@users.sf.net:


--
nosy: +jwilk

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-12-21 Thread hippietrail


Changes by hippietrail hippytr...@gmail.com:


--
nosy: +hippietrail

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-09-02 Thread Adam Olsen


Adam Olsen [EMAIL PROTECTED] added the comment:

Marc, I don't understand what you're saying.  UTF-16's surrogates are
not optional.  Unicode 2.0 and later require them, and Python is
supposed to support it.

Likewise, UCS-4 originally allowed a much larger range of code points,
but it no longer does; allowing them would mean supporting only old,
archaic versions of the standards (which is clearly not desirable.)

You are right in that I shouldn't have said a pair of ill-formed code
units.  I should have said a pair of unassigned code points, which is
how UCS-2 always have and always will classify them.

Although python may allow ill-formed sequences to be created internally
(primarily lone surrogates on UTF-16 builds), it cannot encode or decode
them.  The standard is clear that these are to be treated as errors,
which the .decode()'s errors argument controls.  You could add a new
value for errors to pass-through the garbage, but I fail to see a use
case for it.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-09-02 Thread Adam Olsen


Adam Olsen [EMAIL PROTECTED] added the comment:

I've got another report open about the codecs not properly reporting
errors relating to surrogates: issue 3672

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-09-01 Thread Marc-Andre Lemburg


Marc-Andre Lemburg [EMAIL PROTECTED] added the comment:

On 2008-08-29 23:33, Terry J. Reedy wrote:
 Terry J. Reedy [EMAIL PROTECTED] added the comment:
 
 Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16
 vs. UTF-32)
 
 I recently read most of the Unicode 5 standard and as near as I could
 tell it no longer uses the term UCS, if it ever did. 

UCS2 and UCS4 are terms which stem from the versions of Unicode
that were current at the time of adding Unicode support to Python,
ie. in the year 2000 when ISO 10646 and the Unicode spec co-existed.

See http://en.wikipedia.org/wiki/Universal_Character_Set for details.

UTF-16 is a transfer encoding that is based on UCS2 by adding
surrogate pair interpretations. UTF-32 is the same for UCS4,
but also restricting the range of valid code points to the range
covered by UTF-16.

Whether surrogates are supported or not and how they are supported
depends entirely on the codecs you use to convert the internal
format to some encoding.

 If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. 
 It'd be a pair of ill-formed code units instead.

You are mixing the internal representation of Unicode code points
with the result of passing those values through one of the codecs,
e.g. the unicode-escape codec is responsible for converting between
the string representation u'\U00010123' and the internal representation.

Also note that because Python can be built using two different internal
representations, the results of the codecs may vary depending on
platform.

BTW: There's no such thing as an ill-formed code unit. What you probably
mean is an ill-formed code unit sequence. However, those refer
to the output or accepted input values of a codec, not the internal
representation.

Please also note that because Python can be used to build valid
and parse possibly invalid Unicode encoding data, it has to have
the ability to work with Unicode code points regardless of whether
they can be interpreted as lone surrogates or not (hence the usage
of the terms UCS2/UCS4 which don't support surrogates).

Whether the codecs should raise exceptions and possibly let an
error handler decide whether or not to accept and/or generate
ill-formed code unit sequences is another question.

I hope that clears up the reasoning for using UCS2/UCS4 rather
than UTF-16/UTF-32 when referring to the internal Unicode representation
of Python.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-08-29 Thread Terry J. Reedy


Terry J. Reedy [EMAIL PROTECTED] added the comment:

Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16
vs. UTF-32)

I recently read most of the Unicode 5 standard and as near as I could
tell it no longer uses the term UCS, if it ever did.  Chapter 3 has only
the following 3 hits.

1. D79 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit sequence.
• For historical reasons, the Unicode encoding forms are also referred
to as Unicode (or UCS) transformation formats (UTF). That term is
actually ambiguous between its usage for encoding forms and encoding
schemes.

2. For a discussion of the relationship between UTF-32 and UCS-4
encoding form defined in ISO/IEC 10646, see Section C.2, Encoding Forms
in ISO/IEC 10646.

Section C.2 says UCS-4 can now be taken effectively as an alias for the
Unicode encoding form UTF-32 and mentions the restriction of UCS-2 to
the BMP.

3. ISO/IEC 10646 specifies an equivalent UTF-16 encoding form.
For details, see Section C.3, UCS Transformation Formats.

U5 has 3 coding formats which it names UTF-8,16,32 and 7 serialization
formats of the same name with plus the latter two with 'BE' or 'LE'
append.  So, to me, use of 'UCS' is either confusing or misleading.

--
If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. 
It'd be a pair of ill-formed code units instead.

On WinXP,IDLE 3.0b2 
 repr('\U00010123') # u prefix no longer needed or valid
'ģ'
 repr('\ud800\udd23')
'ģ'
# Interesting: what I cut from IDLE has 2 empty boxes instead of the one
larger square with 010 and 123 I see on FireFox.  len(repr('\U0010123'))
is 4, not 3, so FireFox recognizes the surrogate and displays one symbol.

Entering either directly into the interpreter gives
Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
(Intel)] on win32
 c='\U00010123'
 len(c)
2
 repr(c)
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Program Files\Python30\lib\io.py, line 1428, in write
b = encoder.encode(s)
  File C:\Program Files\Python30\lib\encodings\cp437.py, line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
2-3: character maps to undefined 

2.5 gives instead u'\\U00010123' as reported, so I added 3.0 to the
list of versions with a problem.

I do wonder how can repr() work on IDLE but not the underlying
interpreter?  Could IDLE change self.errors so that undefined is left
as is instead of raising an exception?  With the display then replacing
those with empty boxes?

--
nosy: +tjreedy
versions: +Python 3.0

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-08-21 Thread Benjamin Peterson


Benjamin Peterson [EMAIL PROTECTED] added the comment:

Ping.

--
nosy: +benjamin.peterson

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-08-04 Thread Antoine Pitrou


Changes by Antoine Pitrou [EMAIL PROTECTED]:


--
priority:  - critical
versions: +Python 2.6

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-12 Thread Marc-Andre Lemburg


Marc-Andre Lemburg [EMAIL PROTECTED] added the comment:

Adam, I do know what I'm talking about: I was the lead designer of the
Unicode integration you find in Python and implemented most of it.

What you see as repr() of a Unicode object is the result of applying a
codec to the internal representation. Please don't confuse the output of
the codec (unicode-escape) with the internal representation.

That said, Ezio did uncover a bug and we need to find the cause. It's
likely caused by the fact that the UTF-8 codec does not recombine
surrogates on UCS4 builds. See this comment in the codec implementation:

case 3:
if ((s[1]  0xc0) != 0x80 ||
(s[2]  0xc0) != 0x80) {
errmsg = invalid data;
startinpos = s-starts;
endinpos = startinpos+3;
goto utf8Error;
}
ch = ((s[0]  0x0f)  12) + ((s[1]  0x3f)  6) + (s[2] 
0x3f);
if (ch  0x0800) {
/* Note: UTF-8 encodings of surrogates are considered
   legal UTF-8 sequences;

   XXX For wide builds (UCS-4) we should probably try
   to recombine the surrogates into a single code
   unit.
*/
errmsg = illegal encoding;
startinpos = s-starts;
endinpos = startinpos+3;
goto utf8Error;
}
else
*p++ = (Py_UNICODE)ch;
break;

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-12 Thread Adam Olsen


Adam Olsen [EMAIL PROTECTED] added the comment:

Marc, perhaps Unicode has refined their definitions since you last looked?

Valid UTF-8 *cannot* contain surrogates[1].  If it does, you have
CESU-8[2][3], not UTF-8.

So there are two bugs: first, the UTF-8 codec should refuse to load
surrogates.  Second, since the original bug showed up before the .pyc is
created, something in the parse/compilation/whatever stage is producing
CESU-8.


[1] 4th bullet point of D92 in
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
[2] http://unicode.org/reports/tr26/
[3] http://en.wikipedia.org/wiki/CESU-8

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-12 Thread Adam Olsen


Adam Olsen [EMAIL PROTECTED] added the comment:

Err, to clarify, the parse/compile/whatever stages is producing broken
UTF-32 (surrogates are ill-formed there too), and that gets transformed
into CESU-8 when the .pyc is saved.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Ezio Melotti


Ezio Melotti [EMAIL PROTECTED] added the comment:

On my Linux box sys.maxunicode == 1114111 and len(u'\U00010123') == 1,
so it should be a UTF-32 build.
On windows instead sys.maxunicode == 65535 and len(u'\U00010123') == 2,
so it should be a UTF-16 build.
The problem seems then related to UTF-32 builds.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Adam Olsen


Adam Olsen [EMAIL PROTECTED] added the comment:

Simpler way to reproduce this (on linux):

$ rm unicodetest.pyc 
$ 
$ python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: u'\ud800\udd23' u'\U00010123'
$ 
$ python -c 'import unicodetest'
Result: True
Len: 1 1
Repr: u'\U00010123' u'\U00010123'

Storing surrogates in UTF-32 is ill-formed[1], so the first part
definitely shouldn't be failing on linux (with a UTF-32 build).

The repr could go either way, as unicode doesn't cover escape sequences.
 We could allow u'\ud800\udd23' literals to magically become
u'\U00010123' on UTF-32 builds.  We already allow repr(u'\ud800\udd23')
to magically become u'\U00010123' on UTF-16 builds (which is why the
repr test always passes there, rather than always failing).

The bigger problem is how much we prohibit ill-formed character
sequences.  We already prevent values above U+10, but not
inappropriate surrogates.


[1] Search for D90 in http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

--
nosy: +Rhamphoryncus
Added file: http://bugs.python.org/file10880/unicodetest.py

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Marc-Andre Lemburg


Marc-Andre Lemburg [EMAIL PROTECTED] added the comment:

Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16
vs. UTF-32).

The conversions done from the literal escaped representation to the
internal format are done using the unicode-escape and raw-unicode-escape
codecs.

PYC files are written using the marshal module, which uses UTF-8 as
encoding for Unicode objects.

All of these codecs know about surrogates, so there must be a bug
somewhere in the Python tokenizer or compiler.

I checked on Linux using a UCS2 and a UCS4 build of Python 2.5: the
problem only shows up with the UCS4 build.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Adam Olsen


Adam Olsen [EMAIL PROTECTED] added the comment:

No, the configure options are wrong - we do use UTF-16 and UTF-32. 
Although modern UCS-4 has been restricted down to the range of UTF-32
(it used to be larger!), UCS-2 still doesn't support the supplementary
planes (ie no surrogates.)

If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. 
It'd be a pair of ill-formed code units instead.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-06 Thread Ezio Melotti


New submission from Ezio Melotti [EMAIL PROTECTED]:

Problem: when you have Unicode characters with a code point greater than
U+ written directly in the source file (that is, not in the form
u'\U' but as normal chars in a u'' string) the interpreter uses
surrogate pairs for representing these characters only if the pyc
doesn't exist. When the pyc is created it uses a normal character
(\U instead of the pair \u\u). This could lead to an
unexpected behavior while comparing Unicode strings or in other
situations (even if it could be solved without problems in different
ways - using u'\Uxxx' or u'\uxxx' instead of the characters,
encoding them before comparing - there shouldn't be differences between
a py and its pyc).

Tested on:
Ubuntu 8.04 with python 2.4: Uses a surrogate pair.
Ubuntu 8.04 with python 2.5: Uses a surrogate pair.
Windows XP SP2 with python 2.4: Uses a normal character.

Steps to reproduce the problem:
 1a. download the attached file or create it following the next step;
 1b. in a UTF-8-aware console write `print 
unichr(int('10123', 16))` (or any codepoint = 1), copy the printed
character (depending on the console it could be a box, two box or a
character) in a file with the lines `# -*- coding: utf-8 -*-`, `print
'Result:', u'paste here the char' == u'\U00010123'` and `print
'Repr:', repr(u'paste here the char'), repr(u'\U00010123')`. Save the
file in UTF-8;
 2. open a python interpreter and import the file (`import
unicodetest`). It should print `Result: False` and `Repr:
u'\ud800\udd23' u'\U00010123'` (the character is represented as a
surrogate pair). During this step the pyc file is created.
 3. from the python interpreter write `reload(unicodetest)`. Now it
should print `Result: True` and `Repr: u'\U00010123' u'\U00010123'` (the
char is represented as a normal character). Any other reload will
print True. If you delete the pyc and reload again it will print False.

(Instead of using reload() is also possible to create a function and
call it from the module when it's loaded and again with
unicodetest.func(), the result will be the same.)

Expected behavior:
The interpreter should use the same representation in both the situation
(and print True in both the tests).
Another solution could be to change the behavior of == to return True if
a normal char is compared with its surrogate pair (if it makes sense).

Further informations:
The character used for the test is part of the Unicode Plane 1 (see
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane).
More information about the surrogate pairs can be found here:
http://en.wikipedia.org/wiki/Surrogate_pair#Encoding_of_characters_outside_the_BMP

--
components: Unicode
files: unicodetest.py
messages: 69321
nosy: ezio.melotti, lemburg
severity: normal
status: open
title: Python interpreter uses Unicode surrogate pairs only before the pyc is 
created
type: behavior
versions: Python 2.4, Python 2.5
Added file: http://bugs.python.org/file10826/unicodetest.py

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

28 matches

Site Navigation

Mail list logo

Footer information