[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

So the discussion is now on 2 points:

1. Is the change backwards compatible? (at the code level, after
recompilation).  My answer is yes, because all known case
transformations stay in the same plane: if you pass a char in the BMP,
they return a char in the BMP; if you pass a code 0x1000, you get
another code 0x1000. In other words: in narrow builds, when you pass
Py_UNICODE, the answer will be correct even when downcasted to
Py_UNICODE.  If you want, I can add checks to makeunicodedata.py to
verify that future Unicode standards don't break this statement.

Naive code that simply walks the Py_UNICODE* buffer will have
identical behavior.  (The current unicode methods are in this case. 
They should be fixed, later)

2. Is this change acceptable for 3.2?  I'd say yes, because existing
extension modules that use these functions will need to be recompiled;
the functions names change, the modules won't load otherwise.  There is
no need to change the API number for this.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

It's not as easy as that.

The functions for case conversion are used in a way that assumes they
never fail (and indeed, the existing functions cannot fail).

What we can do is change the input parameter to Py_UCS4, but not the
Py_UNICODE output parameter, since that would cause lots of compiler
warnings and implicit truncation on UCS2 builds, which would be a pretty
disruptive change.

However, this change would not really help anyone if there are no
mappings from BMP to non-BMP or vice-versa, so I'm not sure whether this
whole exercise is worth the effort.

It appears to be better to just leave the case mapping APIs unchanged -
or am I missing something ?

The situation is different for the various Py_UNICODE_IS*() APIs: for
these we can change the input parameter to Py_UCS4, remove the name
mangling and add UCS2 helper functions to maintain API compatibility on
UCS2 builds.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

 that would cause lots of compiler
 warnings and implicit truncation on UCS2 builds

Unfortunately, there is no such warning, or the initial problem we are trying 
to solve would have been spotted by such a warning (unicode_repr() calls 
Py_UNICODE_ISPRINTABLE() with a Py_UCS4 argument).

gcc has a -Wconversion flag, (which I tried today on python) but it is far too 
verbose before version 4.3, and this newer version still has some false 
positives. http://gcc.gnu.org/wiki/NewWconversion

But the most important thing is that implicit truncation on UCS2 builds is what 
happens already! The patch does not solve it, but at least it yields sensible 
results to wary code.
Or can you imagine some (somewhat working) code which behavior will be worse 
after the change?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Amaury Forgeot d'Arc

Changes by Amaury Forgeot d'Arc amaur...@gmail.com:


Added file: http://bugs.python.org/file15058/unicodectype_ucs4_3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Adam Olsen wrote:
 
 Adam Olsen rha...@gmail.com added the comment:
 
 Surrogates aren't optional features of UTF-16, we really need to get
 this fixed.  That includes .isalpha().

We use UCS2 on narrow Python builds, not UTF-16.

 We might keep the old public API for compatibility, but it should be
 clearly marked as broken for non-BMP scalar values.

That has always been the case. UCS2 doesn't support surrogates.

However, we have been slowly moving into the direction of making
the UCS2 storage appear like UTF-16 to the Python programmer.

This process is not yet complete and will likely never complete
since it must still be possible to create things line lone
surrogates for processing purposes, so care has to be taken
when using non-BMP code points on narrow builds.

 I don't see a problem with changing 2.x.  The existing behaviour is
 broken for non-BMP scalar values, so surely nobody can claim dependence
 on it.

No, but changing the APIs from 16-bit integers to 32-bit integers
does require a recompile of all code using it. Otherwise you
end up with segfaults.

Also, the Unicode type database itself uses Py_UNICODE, so
case mapping would fail for non-BMP code points.

So if we want to support accessing non-BMP type information
on narrow builds, we'd need to change the complete
Unicode type database API to work with UCS4 code points and then
provide a backwards compatible C API using Py_UNICODE. Due
to the UCS2/UCS4 API renaming done in unicodeobject.h, this
would amount to exposing both the UCS2 and the UCS4 variants
of the APIs on narrow builds.

With such an approach we'd not break the binary API and
still get the full UCS4 range of code points in the type
database. The change would be possible in Python 2.x and
3.x (which now both use the same strategy w/r to change
management).

Would someone be willing to work on this ?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

 No, but changing the APIs from 16-bit integers to 32-bit integers
 does require a recompile of all code using it.

Is it acceptable between 3.1 and 3.2 for example? ISTM that other
changes already require recompilation of extension modules.

 Also, the Unicode type database itself uses Py_UNICODE, so
 case mapping would fail for non-BMP code points.

Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Amaury Forgeot d'Arc wrote:
 
 Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
 
 No, but changing the APIs from 16-bit integers to 32-bit integers
 does require a recompile of all code using it.
 
 Is it acceptable between 3.1 and 3.2 for example? ISTM that other
 changes already require recompilation of extension modules.

With the proposed approach, we'll keep binary compatibility, so
this is not much of an issue.

Note: Changes to the binary interface can be done in minor releases,
but we should make sure that it's not possible to load an extension
compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

 Also, the Unicode type database itself uses Py_UNICODE, so
 case mapping would fail for non-BMP code points.
 
 Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4.

The change affects the Unicode type database which is implemented
in unicodectype.c, not the Unicode database, which already uses UCS4.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

 we should make sure that it's not possible to load an extension
 compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

This is the case with this patch: today all these functions
(_PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to
_PyUnicodeUCS2_* or _PyUnicodeUCS4_*.
The patch removes the #defines: 3.1 modules that call
_PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter.

 The change affects the Unicode type database which is implemented
 in unicodectype.c, not the Unicode database, which already uses UCS4.

Are you referring to the _PyUnicode_TypeRecord structure?
The first three fields only contains values up to 65535, so they could
use unsigned short even for UCS4 builds.
All the other uses are precisely changed by the patch...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 We might keep the old public API for compatibility, but it should be
 clearly marked as broken for non-BMP scalar values.

 That has always been the case. UCS2 doesn't support surrogates.

 However, we have been slowly moving into the direction of making
 the UCS2 storage appear like UTF-16 to the Python programmer.

UCS2 died long ago, is there any reason why we keep using an UCS2 that
appears like UTF-16 instead of real UTF-16?

 This process is not yet complete and will likely never complete
 since it must still be possible to create things line lone
 surrogates for processing purposes, so care has to be taken
 when using non-BMP code points on narrow builds.

I don't exactly know all the details of the current implementation, but
-- from what I understand reading this (correct me if I'm wrong) -- it
seems that the implementation is half-UCS2 to allow things like the
processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
work with surrogate pairs and hence with chars outside the BMP.

What are the use cases for processing the lone surrogates? Wouldn't be
better to use UTF-16 and disallow them (since they are illegal) and
possibly provide some other way to deal with them (if it's really needed)?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Amaury Forgeot d'Arc wrote:
 
 Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
 
 we should make sure that it's not possible to load an extension
 compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.
 
 This is the case with this patch: today all these functions
 (_PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to
 _PyUnicodeUCS2_* or _PyUnicodeUCS4_*.
 The patch removes the #defines: 3.1 modules that call
 _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter.

True, but we can do better. For narrow builds, the API currently
exposes the UCS2 APIs. We'd need to expose the UCS4 APIs *in addition*
to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

For wide builds, we don't need to change anything.

 The change affects the Unicode type database which is implemented
 in unicodectype.c, not the Unicode database, which already uses UCS4.
 
 Are you referring to the _PyUnicode_TypeRecord structure?
 The first three fields only contains values up to 65535, so they could
 use unsigned short even for UCS4 builds.

I haven't checked, but it's certainly possible to have a code point
use a non-BMP lower/upper/title case mapping, so this should be
made possible as well, if we're going to make changes to the type
database.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

This is off-topic for the tracker item, but I'll reply anyway:

Ezio Melotti wrote:
 
 Ezio Melotti ezio.melo...@gmail.com added the comment:
 
 We might keep the old public API for compatibility, but it should be
 clearly marked as broken for non-BMP scalar values.
 
 That has always been the case. UCS2 doesn't support surrogates.
 
 However, we have been slowly moving into the direction of making
 the UCS2 storage appear like UTF-16 to the Python programmer.
 
 UCS2 died long ago, is there any reason why we keep using an UCS2 that
 appears like UTF-16 instead of real UTF-16?

UCS2 is how we store Unicode in Python for narrow builds internally.
It's a storage format, not an encoding.

However, on narrow builds such as the Windows builds, you will sometimes
want to create Unicode strings that use non-BMP code points. Since
both UCS2 and UCS4 can represent the UTF-16 encoding, it's handy to
expose a bit of automatic conversion at the Python level to make
things easier for the programmer.

 This process is not yet complete and will likely never complete
 since it must still be possible to create things line lone
 surrogates for processing purposes, so care has to be taken
 when using non-BMP code points on narrow builds.
 
 I don't exactly know all the details of the current implementation, but
 -- from what I understand reading this (correct me if I'm wrong) -- it
 seems that the implementation is half-UCS2 to allow things like the
 processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
 work with surrogate pairs and hence with chars outside the BMP.
 
 What are the use cases for processing the lone surrogates? Wouldn't be
 better to use UTF-16 and disallow them (since they are illegal) and
 possibly provide some other way to deal with them (if it's really needed)?

No, because Python is meant to be used for working on all Unicode
code points. Lone surrogates are not allowed in transfer encodings
such as UTF-16 or UTF-8, but they are valid Unicode code points and
you need to be able to work with them, since you may want to construct
surrogate pairs by hand or get lone surrogates as a result of slicing a
Unicode string.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

 We'd need to expose the UCS4 APIs *in addition*
 to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

Why have two names for the same function? it's Python 3, after all.
Or is this no recompile feature so important (as long as changes are
clearly shown to the user)? It does not work on Windows, FWIW.

 I haven't checked, but it's certainly possible to have a code point
 use a non-BMP lower/upper/title case mapping, so this should be
 made possible as well, if we're going to make changes to the type
 database.

OK, here is a new patch.  Even if this does not happen with unicodedata
up to 5.1, the table has only 175 entries so memory usage is not
dramatically increased.
Py_UNICODE is no more used at all in unicodectype.c.

--
Added file: http://bugs.python.org/file15047/unicodectype_ucs4-2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Amaury Forgeot d'Arc wrote:
 
 Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
 
 We'd need to expose the UCS4 APIs *in addition*
 to those APIs and have the UCS2 APIs redirect to the UCS4 ones.
 
 Why have two names for the same function? it's Python 3, after all.

It's not the same function... the UCS2 version would take a
Py_UNICODE parameter, the UCS4 version a Py_UCS4 parameter.

I don't understand the comment about Python 3.x. FWIW, we're no
longer in the backwards incompatible changes are allowed mode
for 3.x.

 Or is this no recompile feature so important (as long as changes are
 clearly shown to the user)? It does not work on Windows, FWIW.

There are generally two options for API changes within a
major release branch:

 1. the changes are API backwards compatible and only the Python API
version is changed

 2. the changes are not API backwards compatible; in such a case,
Python has to reject imports of old module (as it always
does on Windows), so the Python API version has to be changed
*and* the import mechanism must reject the import

The second option was used when transitioning from 2.4 to 2.5 due
to the Py_ssize_t changes.

We could do the same for 2.7/3.2, but if it's just needed for this
one change, then I'd rather stick to implementing the first option.

 I haven't checked, but it's certainly possible to have a code point
 use a non-BMP lower/upper/title case mapping, so this should be
 made possible as well, if we're going to make changes to the type
 database.
 
 OK, here is a new patch.  Even if this does not happen with unicodedata
 up to 5.1, the table has only 175 entries so memory usage is not
 dramatically increased.
 Py_UNICODE is no more used at all in unicodectype.c.

Sorry, but this doesn't work: the functions have to return Py_UNICODE
and raise an exception if the return value doesn't fit.

Otherwise, you'd get completely wrong values in code downcasting
the return value to Py_UNICODE on narrow builds.

Another good reason to use two sets of APIs. The new set could
indeed return Py_UCS4 values.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Adam Olsen

Adam Olsen rha...@gmail.com added the comment:

On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org wrote:
 We use UCS2 on narrow Python builds, not UTF-16.

 We might keep the old public API for compatibility, but it should be
 clearly marked as broken for non-BMP scalar values.

 That has always been the case. UCS2 doesn't support surrogates.

 However, we have been slowly moving into the direction of making
 the UCS2 storage appear like UTF-16 to the Python programmer.

 This process is not yet complete and will likely never complete
 since it must still be possible to create things line lone
 surrogates for processing purposes, so care has to be taken
 when using non-BMP code points on narrow builds.

Balderdash.  We expose UTF-16 code units, not UCS-2.  Guido has made
this quite clear.

UTF-16 was designed as an easy transition from UCS-2.  Indeed, if your
code only does searches or joins existing strings then it will Just
Work; declare it UTF-16 and you are done.  We have a lot more work to
do than that (as in this bug report), and we can't reasonably prevent
the user from splitting surrogate pairs via poor code, but a 95%
solution doesn't mean we suddenly revert all the way back to UCS-2.

If the intent really was to use UCS-2 then a correctly functioning
UTF-16 codec would join a surrogate pair into a single scalar value,
then raise an error because it's outside the range representable in
UCS-2.  That's not very helpful though; obviously, it's much better to
use UTF-16 internally.

The alternative (no matter what the configure flag is called) is
UTF-16, not UCS-2 though: there is support for surrogate pairs in
various places, including the \U escape and the UTF-8 codec.
http://mail.python.org/pipermail/python-dev/2008-July/080892.html

If you find places where the Python core or standard library is doing
Unicode processing that would break when surrogates are present you
should file a bug. However this does not mean that every bit of code
that slices a string at an arbitrary point (and hence risks slicing in
the middle of a surrogate) is incorrect -- it all depends on what is
done next with the slice.
http://mail.python.org/pipermail/python-dev/2008-July/080900.html

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Adam Olsen wrote:
 
 Adam Olsen rha...@gmail.com added the comment:
 
 On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org 
 wrote:
 We use UCS2 on narrow Python builds, not UTF-16.

 We might keep the old public API for compatibility, but it should be
 clearly marked as broken for non-BMP scalar values.

 That has always been the case. UCS2 doesn't support surrogates.

 However, we have been slowly moving into the direction of making
 the UCS2 storage appear like UTF-16 to the Python programmer.

 This process is not yet complete and will likely never complete
 since it must still be possible to create things line lone
 surrogates for processing purposes, so care has to be taken
 when using non-BMP code points on narrow builds.
 
 Balderdash.  We expose UTF-16 code units, not UCS-2.  Guido has made
 this quite clear.
 
 UTF-16 was designed as an easy transition from UCS-2.  Indeed, if your
 code only does searches or joins existing strings then it will Just
 Work; declare it UTF-16 and you are done.  We have a lot more work to
 do than that (as in this bug report), and we can't reasonably prevent
 the user from splitting surrogate pairs via poor code, but a 95%
 solution doesn't mean we suddenly revert all the way back to UCS-2.
 
 If the intent really was to use UCS-2 then a correctly functioning
 UTF-16 codec would join a surrogate pair into a single scalar value,
 then raise an error because it's outside the range representable in
 UCS-2.  That's not very helpful though; obviously, it's much better to
 use UTF-16 internally.
 
 The alternative (no matter what the configure flag is called) is
 UTF-16, not UCS-2 though: there is support for surrogate pairs in
 various places, including the \U escape and the UTF-8 codec.
 http://mail.python.org/pipermail/python-dev/2008-July/080892.html
 
 If you find places where the Python core or standard library is doing
 Unicode processing that would break when surrogates are present you
 should file a bug. However this does not mean that every bit of code
 that slices a string at an arbitrary point (and hence risks slicing in
 the middle of a surrogate) is incorrect -- it all depends on what is
 done next with the slice.
 http://mail.python.org/pipermail/python-dev/2008-July/080900.html

All this is just nitpicking, really. UCS2 is a character set,
UTF-16 an encoding.

It so happens that when the Unicode consortium realized
that 16 bit would not be enough to represent all scripts of the
world, they added the concept of surrogates and reserved a few
ranges of code points in UCS2 to represent these extra code
points which are not part of UCS2, but the extensions UCS4.

The conversion of these surrogate pairs to UCS4 code point
values is what you find defined in UTF-16.

If we were to implement Unicode using UTF-16 as storage format,
we would not be able to store single lone surrogates, since these
are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
code points, etc.

PEP 100 really says it all:

http://www.python.org/dev/peps/pep-0100/


This [internal] format will hold UTF-16 encodings of the corresponding
Unicode ordinals.  The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
...
Future implementations can extend the 16 bit restriction to the
full set of all UTF-16 addressable characters (around 1M
characters).


Note that I wrote the PEP and worked on the implementation at a time
when Unicode 2.x was still in use wide-spread use (mostly on Windows)
and 3.0 was just being release:

http://www.unicode.org/history/publicationdates.html

But all that is off-topic for this ticket, so please let's just
stop such discussions.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Adam Olsen

Adam Olsen rha...@gmail.com added the comment:

On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg rep...@bugs.python.org wrote:
 All this is just nitpicking, really. UCS2 is a character set,
 UTF-16 an encoding.

UCS is a character set, for most purposes synonymous with the Unicode
character set.  UCS-2 and UTF-16 are both encodings of that character
set.  However, UCS-2 can only represent the BMP, while UTF-16 can
represent the full range.

 If we were to implement Unicode using UTF-16 as storage format,
 we would not be able to store single lone surrogates, since these
 are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
 code points, etc.

No.  Internal usage may become temporarily ill-formed, but this is a
compromise, and acceptable so long as we never export them to other
systems.

Not that I wouldn't *prefer* a system that wouldn't store lone
surrogates, but.. pragmatics prevail.

 Note that I wrote the PEP and worked on the implementation at a time
 when Unicode 2.x was still in use wide-spread use (mostly on Windows)
 and 3.0 was just being release:

        http://www.unicode.org/history/publicationdates.html

I think you hit the nail on the head there.  10 years ago, unicode
meant something different than it does today.  That's reflected in PEP
100 and in the code.  Now it's time to move on, switch to the modern
terminology, modern usage, and modern specs.

 But all that is off-topic for this ticket, so please let's just
 stop such discussions.

It needs to be discussed somewhere.  It's a distraction from fixing
the bug, but at least it's more private here.  Would you prefer email?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-10-04 Thread Adam Olsen

Adam Olsen rha...@gmail.com added the comment:

Surrogates aren't optional features of UTF-16, we really need to get
this fixed.  That includes .isalpha().

We might keep the old public API for compatibility, but it should be
clearly marked as broken for non-BMP scalar values.

I don't see a problem with changing 2.x.  The existing behaviour is
broken for non-BMP scalar values, so surely nobody can claim dependence
on it.

--
nosy: +Rhamphoryncus
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-09-24 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
priority:  - normal
stage:  - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

FWIW, on Python3 it seems to work:
 import unicodedata
 unicodedata.category(\U0001)
'Lo'
 unicodedata.category(\U00011000)
'Cn'
 unicodedata.category(chr(0x1))
'Lo'
 unicodedata.category(chr(0x11000))
'Cn'
 ord(chr(0x1)), 0x1
(65536, 65536)
 ord(chr(0x11000)), 0x11000
(69632, 69632)

I'm using a narrow build too:
 import sys
 sys.maxunicode
65535
 len('\U0001')
2
 ord('\U0001')
65536

On Python2 unichr() is supposed to raise a ValueError on a narrow build
if the value is greater than 0x [1], but if the characters above
0x can be represented with u\U there should be a way to
fix unichr so it can return them. Python3 already does it with chr().

Maybe we should open a new issue for this if it's not present already.

[1]: http://docs.python.org/library/functions.html#unichr

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

Since r56395, ord() and chr() accept and return surrogate pairs even in
narrow builds.

The goal is to remove most differences between narrow and wide unicode
builds (except for string lengths, indices or slices)

To address this problem, I suggest to change all functions in
unicodectype.c so that they accept Py_UCS4 characters (instead of
Py_UNICODE). 
This would be a binary-incompatible change; and --with-wctype-functions
would have an effect only if sizeof(wchar_t)==4 (instead of the current
condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

amaury Since r56395, ord() and chr() accept and return surrogate pairs 
amaury even in narrow builds.

Note: My examples are made with Python 2.x.

 The goal is to remove most differences between narrow and wide unicode
 builds (except for string lengths, indices or slices)

It would be nice to get the same behaviour in Python 2.x and 3.x to help 
migration from Python2 to Python3 ;-)

unichr() (in Python 2.x) documentation is correct. But I would approciate to 
support surrogates using unichr() which means also changing ord() behaviour.

 To address this problem, I suggest to change all functions in
 unicodectype.c so that they accept Py_UCS4 characters (instead of
 Py_UNICODE).

Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP 
characters (code  0x).

--

I can open a new issue if you agree that we can change unichr() / ord() 
behaviour on narrow build. We may ask on the mailing list?

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

 That would cause major breakage in the C API 

Not if you recompile. I don't see how this breaks the API at the C level.

 and is not inline with the intention of having a Py_UNICODE 
 type in the first place.

Py_UNICODE is still used as the allocation unit for unicode strings.

To get correct results, we need a way to access the whole unicode
database even on ucs2 builds; it's possible with the unicodedata module,
why not from C?

My motivation for the change is this post:
http://mail.python.org/pipermail/python-dev/2008-July/080900.html

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote:
 Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
 
 Since r56395, ord() and chr() accept and return surrogate pairs even in
 narrow builds.
 
 The goal is to remove most differences between narrow and wide unicode
 builds (except for string lengths, indices or slices)

 To address this problem, I suggest to change all functions in
 unicodectype.c so that they accept Py_UCS4 characters (instead of
 Py_UNICODE). 

-1.

That would cause major breakage in the C API and is not inline with the
intention of having a Py_UNICODE type in the first place.

Users who are interested in UCS4 builds should simply use UCS4 builds.

 This would be a binary-incompatible change; and --with-wctype-functions
 would have an effect only if sizeof(wchar_t)==4 (instead of the current
 condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

--with-wctype-functions was scheduled for removal many releases ago,
but I never got around to it. The only reason it's still there is
that some Linux distribution use this config option (AFAIR, RedHat).
I'd be +1 on removing the option in 3.0.1 or deprecating it in
3.0.1 and removing it in 3.1.

It's not useful in any way, and causes compatibility problems
with regular builds.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

I don't understand the behaviour of unichr():

Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28)
 import unicodedata
 unicodedata.category(u\U0001)
'Lo'
 unicodedata.category(u\U00011000)
'Cn'
 unicodedata.category(unichr(0x1))
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: unichr() arg not in range(0x1) (narrow Python build)

Why unichr() fails whereas \U works?

 len(u\U0001)
2
 ord(u\U0001)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: ord() expected a character, but string of length 2 found

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-02-03 14:14, STINNER Victor wrote:
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 amaury Since r56395, ord() and chr() accept and return surrogate pairs 
 amaury even in narrow builds.
 
 Note: My examples are made with Python 2.x.
 
 The goal is to remove most differences between narrow and wide unicode
 builds (except for string lengths, indices or slices)
 
 It would be nice to get the same behaviour in Python 2.x and 3.x to help 
 migration from Python2 to Python3 ;-)
 
 unichr() (in Python 2.x) documentation is correct. But I would approciate to 
 support surrogates using unichr() which means also changing ord() behaviour.

This is not possible for unichr() in Python 2.x, since applications
always expect len(unichr(x)) == 1.

Changing ord() would be possible in Python 2.x is easier, since
this would only extend the range of returned values for UCS2
builds.

 To address this problem, I suggest to change all functions in
 unicodectype.c so that they accept Py_UCS4 characters (instead of
 Py_UNICODE).
 
 Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP 
 characters (code  0x).
 
 --
 
 I can open a new issue if you agree that we can change unichr() / ord() 
 behaviour on narrow build. We may ask on the mailing list?

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

lemburg This is not possible for unichr() in Python 2.x, since applications
lemburg always expect len(unichr(x)) == 1

Oh, ok.

lemburg Changing ord() would be possible in Python 2.x is easier, since
lemburg this would only extend the range of returned values for UCS2
lemburg builds.

ord() of Python3 (narrow build) rejects surrogate characters:

'\U0001'
 len(chr(0x1))
2
 ord(0x1)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: ord() expected string of length 1, but int found

---

It looks that narrow builds with surrogates have some more problems...

Test with U+1: LINEAR B SYLLABLE B008 A, category: Letter, Other.

Correct result (Python 2.5, wide build):

   $ python
   Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
unichr(0x1)
   u'\U0001'
unichr(0x1).isalpha()
   True

Error in Python3 (narrow build):

   marge$ ./python
   Python 3.1a0 (py3k:69105M, Feb  3 2009, 15:04:35)
chr(0x1).isalpha()
   False
list(chr(0x1))
   ['\ud800', '\udc00']
chr(0xd800).isalpha()
   False
chr(0xdc00).isalpha()
   False

Unicode ranges, all in the category Other, Surrogate:
 - U+D800..U+DB7F: Non Private Use High Surrogate
 - U+DB80..U+DBFF: Private Use High Surrogate
 - U+DC00..U+DFFF: Low Surrogate range

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote:
 Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
 
 That would cause major breakage in the C API 
 
 Not if you recompile. I don't see how this breaks the API at the C level.

Well, then try to look at such a change from a C extension
writer's perspective.

They'd have to change all their function calls and routines to work
with Py_UCS4.

Supporting both the old API and the new one would
be nearly impossible and require either an adapter API or a lot
of #ifdef'ery.

Please remember that the public Python C API is not only meant for
Python developers. It's main purpose is for it to be used by
other developers extending or embedding Python and those developers
use different release cycles and want to support more than just the
bleeding edge Python version.

Python has a long history of providing very stable APIs, both in
C and in Python.

FWIW: The last major change in the C API (the change to Py_ssize_t
from Python 2.4 to 2.5) has not even propogated to all major C
extensions yet. It's only now that people start to realize problems
with this, since their extensions start failing with segfaults
on 64-bit machines.

That said, we can of course provide additional UCS4 APIs for
certain things and also provide conversion helpers between
Py_UNICODE and Py_UCS4 where needed.

 and is not inline with the intention of having a Py_UNICODE 
 type in the first place.
 
 Py_UNICODE is still used as the allocation unit for unicode strings.
 
 To get correct results, we need a way to access the whole unicode
 database even on ucs2 builds; it's possible with the unicodedata module,
 why not from C?

I must be missing some detail, but what does the Unicode database
have to do with the unicodeobject.c C API ?

 My motivation for the change is this post:
 http://mail.python.org/pipermail/python-dev/2008-July/080900.html

There are certainly other ways to make Python deal with surrogates
in more cases than the ones we already support.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

haypo ord() of Python3 (narrow build) rejects surrogate characters:
haypo '\U0001'
haypo  len(chr(0x1))
haypo 2
haypo  ord(0x1)
haypo TypeError: ord() expected string of length 1, but int found

ord() works fine on Py3, you probably meant to do 
 ord('\U0001')
65536
or
 ord(chr(0x1))
65536

In Py3 is also stated that it accepts surrogate pairs (help(ord)).
Py2 instead doesn't support them:
 ord(u'\U0001')
TypeError: ord() expected a character, but string of length 2 found

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

 I must be missing some detail, but what does the Unicode database
 have to do with the unicodeobject.c C API ?

Ah, now I understand your concerns. My suggestion is to change only the 20 
functions in 
unicodectype.c: _PyUnicode_IsAlpha, _PyUnicode_ToLowercase... and no change in 
unicodeobject.c at all.
They all take a single code point as argument, some also return a single code 
point.
Changing these functions is backwards compatible.

I join a patch so we can argue on concrete code (tests are missing).

Another effect of the patch: unicodedata.numeric('\N{AEGEAN NUMBER TWO}') can 
return 2.0.

The str.isalpha() (and others) methods did not change: they still split the 
surrogate pairs.

--
keywords: +patch
Added file: http://bugs.python.org/file12934/unicodectype_ucs4.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-02 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

There were non-ascii characters in the Windows license file. This was
corrected with r67860.


 I believe that chr(0x1) and chr(0x11000) should have the 
 opposite behavior.

This other problem is because on a narrow unicode build,
Py_UNICODE_ISPRINTABLE takes a 16bit integer.
And indeed, 

 unicodedata.category(chr(0x1 % 65536))
'Cc'
 unicodedata.category(chr(0x11000 % 65536))
'Lo'

--
nosy: +amaury.forgeotdarc

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-01 Thread Venusaur

New submission from Venusaur bup...@hotmail.com:

 license
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python30\lib\site.py, line 372, in __repr__
self.__setup()
  File C:\Python30\lib\site.py, line 359, in __setup
data = fp.read()
  File C:\Python30\lib\io.py, line 1724, in read
decoder.decode(self.buffer.read(), final=True))
  File C:\Python30\lib\io.py, line 1295, in decode
output = self.decoder.decode(input, final=final)
UnicodeDecodeError: 'cp949' codec can't decode bytes in position 15164-
15165: il
legal multibyte sequence
 chr(0x1)
'\U0001'
 chr(0x11000)
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python30\lib\io.py, line 1491, in write
b = encoder.encode(s)
UnicodeEncodeError: 'cp949' codec can't encode character '\ud804' in 
position 1:
 illegal multibyte sequence


I also can't understand why chr(0x1) and chr(0x11000) has different 
behavior

--
components: Unicode
messages: 80924
nosy: bupjae
severity: normal
status: open
title: UnicodeEncodeError - I can't even see license
versions: Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5127] UnicodeEncodeError - I can't even see license

2009-02-01 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Here (winxpsp2, Py3, cp850-terminal) the license works fine:
 license
Type license() to see the full license text

and license() works as well.

I get this output for the chr()s:
 chr(0x1)
'\U0001'
 chr(0x11000)
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Programs\Python30\lib\io.py, line 1491, in write
b = encoder.encode(s)
  File C:\Programs\Python30\lib\encodings\cp850.py, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
1-2: character maps to undefined

I believe that chr(0x1) and chr(0x11000) should have the opposite
behavior.
U+1 (LINEAR B SYLLABLE B008 A) belongs to the 'Lo' category and
should be printed (and possibly raise a UnicodeError, see issue5110
[1]), U+11000 belongs to the 'Cn' category and should be escaped[2].

On Linux with Py3 and a UTF-8 terminal, chr(0x1) prints '\U0001'
and chr(0x11000) prints the char (actually I see two boxes, but it
shouldn't be a problem of Python). The license() works fine too.

Also note that with cp850 the error message is 'character maps to
undefined' and with cp949 is 'illegal multibyte sequence'.

[1]: http://bugs.python.org/issue5110
[2]: http://www.python.org/dev/peps/pep-3138/#specification

--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com