[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: So the discussion is now on 2 points: 1. Is the change backwards compatible? (at the code level, after recompilation). My answer is yes, because all known case transformations stay in the same plane: if you pass a char in the BMP, they return a char in the BMP; if you pass a code 0x1000, you get another code 0x1000. In other words: in narrow builds, when you pass Py_UNICODE, the answer will be correct even when downcasted to Py_UNICODE. If you want, I can add checks to makeunicodedata.py to verify that future Unicode standards don't break this statement. Naive code that simply walks the Py_UNICODE* buffer will have identical behavior. (The current unicode methods are in this case. They should be fixed, later) 2. Is this change acceptable for 3.2? I'd say yes, because existing extension modules that use these functions will need to be recompiled; the functions names change, the modules won't load otherwise. There is no need to change the API number for this. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: It's not as easy as that. The functions for case conversion are used in a way that assumes they never fail (and indeed, the existing functions cannot fail). What we can do is change the input parameter to Py_UCS4, but not the Py_UNICODE output parameter, since that would cause lots of compiler warnings and implicit truncation on UCS2 builds, which would be a pretty disruptive change. However, this change would not really help anyone if there are no mappings from BMP to non-BMP or vice-versa, so I'm not sure whether this whole exercise is worth the effort. It appears to be better to just leave the case mapping APIs unchanged - or am I missing something ? The situation is different for the various Py_UNICODE_IS*() APIs: for these we can change the input parameter to Py_UCS4, remove the name mangling and add UCS2 helper functions to maintain API compatibility on UCS2 builds. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: that would cause lots of compiler warnings and implicit truncation on UCS2 builds Unfortunately, there is no such warning, or the initial problem we are trying to solve would have been spotted by such a warning (unicode_repr() calls Py_UNICODE_ISPRINTABLE() with a Py_UCS4 argument). gcc has a -Wconversion flag, (which I tried today on python) but it is far too verbose before version 4.3, and this newer version still has some false positives. http://gcc.gnu.org/wiki/NewWconversion But the most important thing is that implicit truncation on UCS2 builds is what happens already! The patch does not solve it, but at least it yields sensible results to wary code. Or can you imagine some (somewhat working) code which behavior will be worse after the change? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Changes by Amaury Forgeot d'Arc amaur...@gmail.com: Added file: http://bugs.python.org/file15058/unicodectype_ucs4_3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: Adam Olsen wrote: Adam Olsen rha...@gmail.com added the comment: Surrogates aren't optional features of UTF-16, we really need to get this fixed. That includes .isalpha(). We use UCS2 on narrow Python builds, not UTF-16. We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. I don't see a problem with changing 2.x. The existing behaviour is broken for non-BMP scalar values, so surely nobody can claim dependence on it. No, but changing the APIs from 16-bit integers to 32-bit integers does require a recompile of all code using it. Otherwise you end up with segfaults. Also, the Unicode type database itself uses Py_UNICODE, so case mapping would fail for non-BMP code points. So if we want to support accessing non-BMP type information on narrow builds, we'd need to change the complete Unicode type database API to work with UCS4 code points and then provide a backwards compatible C API using Py_UNICODE. Due to the UCS2/UCS4 API renaming done in unicodeobject.h, this would amount to exposing both the UCS2 and the UCS4 variants of the APIs on narrow builds. With such an approach we'd not break the binary API and still get the full UCS4 range of code points in the type database. The change would be possible in Python 2.x and 3.x (which now both use the same strategy w/r to change management). Would someone be willing to work on this ? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: No, but changing the APIs from 16-bit integers to 32-bit integers does require a recompile of all code using it. Is it acceptable between 3.1 and 3.2 for example? ISTM that other changes already require recompilation of extension modules. Also, the Unicode type database itself uses Py_UNICODE, so case mapping would fail for non-BMP code points. Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: No, but changing the APIs from 16-bit integers to 32-bit integers does require a recompile of all code using it. Is it acceptable between 3.1 and 3.2 for example? ISTM that other changes already require recompilation of extension modules. With the proposed approach, we'll keep binary compatibility, so this is not much of an issue. Note: Changes to the binary interface can be done in minor releases, but we should make sure that it's not possible to load an extension compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. Also, the Unicode type database itself uses Py_UNICODE, so case mapping would fail for non-BMP code points. Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4. The change affects the Unicode type database which is implemented in unicodectype.c, not the Unicode database, which already uses UCS4. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: we should make sure that it's not possible to load an extension compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. This is the case with this patch: today all these functions (_PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to _PyUnicodeUCS2_* or _PyUnicodeUCS4_*. The patch removes the #defines: 3.1 modules that call _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter. The change affects the Unicode type database which is implemented in unicodectype.c, not the Unicode database, which already uses UCS4. Are you referring to the _PyUnicode_TypeRecord structure? The first three fields only contains values up to 65535, so they could use unsigned short even for UCS4 builds. All the other uses are precisely changed by the patch... -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Ezio Melotti ezio.melo...@gmail.com added the comment: We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. UCS2 died long ago, is there any reason why we keep using an UCS2 that appears like UTF-16 instead of real UTF-16? This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. I don't exactly know all the details of the current implementation, but -- from what I understand reading this (correct me if I'm wrong) -- it seems that the implementation is half-UCS2 to allow things like the processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to work with surrogate pairs and hence with chars outside the BMP. What are the use cases for processing the lone surrogates? Wouldn't be better to use UTF-16 and disallow them (since they are illegal) and possibly provide some other way to deal with them (if it's really needed)? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: we should make sure that it's not possible to load an extension compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. This is the case with this patch: today all these functions (_PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to _PyUnicodeUCS2_* or _PyUnicodeUCS4_*. The patch removes the #defines: 3.1 modules that call _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter. True, but we can do better. For narrow builds, the API currently exposes the UCS2 APIs. We'd need to expose the UCS4 APIs *in addition* to those APIs and have the UCS2 APIs redirect to the UCS4 ones. For wide builds, we don't need to change anything. The change affects the Unicode type database which is implemented in unicodectype.c, not the Unicode database, which already uses UCS4. Are you referring to the _PyUnicode_TypeRecord structure? The first three fields only contains values up to 65535, so they could use unsigned short even for UCS4 builds. I haven't checked, but it's certainly possible to have a code point use a non-BMP lower/upper/title case mapping, so this should be made possible as well, if we're going to make changes to the type database. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: This is off-topic for the tracker item, but I'll reply anyway: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. UCS2 died long ago, is there any reason why we keep using an UCS2 that appears like UTF-16 instead of real UTF-16? UCS2 is how we store Unicode in Python for narrow builds internally. It's a storage format, not an encoding. However, on narrow builds such as the Windows builds, you will sometimes want to create Unicode strings that use non-BMP code points. Since both UCS2 and UCS4 can represent the UTF-16 encoding, it's handy to expose a bit of automatic conversion at the Python level to make things easier for the programmer. This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. I don't exactly know all the details of the current implementation, but -- from what I understand reading this (correct me if I'm wrong) -- it seems that the implementation is half-UCS2 to allow things like the processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to work with surrogate pairs and hence with chars outside the BMP. What are the use cases for processing the lone surrogates? Wouldn't be better to use UTF-16 and disallow them (since they are illegal) and possibly provide some other way to deal with them (if it's really needed)? No, because Python is meant to be used for working on all Unicode code points. Lone surrogates are not allowed in transfer encodings such as UTF-16 or UTF-8, but they are valid Unicode code points and you need to be able to work with them, since you may want to construct surrogate pairs by hand or get lone surrogates as a result of slicing a Unicode string. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: We'd need to expose the UCS4 APIs *in addition* to those APIs and have the UCS2 APIs redirect to the UCS4 ones. Why have two names for the same function? it's Python 3, after all. Or is this no recompile feature so important (as long as changes are clearly shown to the user)? It does not work on Windows, FWIW. I haven't checked, but it's certainly possible to have a code point use a non-BMP lower/upper/title case mapping, so this should be made possible as well, if we're going to make changes to the type database. OK, here is a new patch. Even if this does not happen with unicodedata up to 5.1, the table has only 175 entries so memory usage is not dramatically increased. Py_UNICODE is no more used at all in unicodectype.c. -- Added file: http://bugs.python.org/file15047/unicodectype_ucs4-2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: We'd need to expose the UCS4 APIs *in addition* to those APIs and have the UCS2 APIs redirect to the UCS4 ones. Why have two names for the same function? it's Python 3, after all. It's not the same function... the UCS2 version would take a Py_UNICODE parameter, the UCS4 version a Py_UCS4 parameter. I don't understand the comment about Python 3.x. FWIW, we're no longer in the backwards incompatible changes are allowed mode for 3.x. Or is this no recompile feature so important (as long as changes are clearly shown to the user)? It does not work on Windows, FWIW. There are generally two options for API changes within a major release branch: 1. the changes are API backwards compatible and only the Python API version is changed 2. the changes are not API backwards compatible; in such a case, Python has to reject imports of old module (as it always does on Windows), so the Python API version has to be changed *and* the import mechanism must reject the import The second option was used when transitioning from 2.4 to 2.5 due to the Py_ssize_t changes. We could do the same for 2.7/3.2, but if it's just needed for this one change, then I'd rather stick to implementing the first option. I haven't checked, but it's certainly possible to have a code point use a non-BMP lower/upper/title case mapping, so this should be made possible as well, if we're going to make changes to the type database. OK, here is a new patch. Even if this does not happen with unicodedata up to 5.1, the table has only 175 entries so memory usage is not dramatically increased. Py_UNICODE is no more used at all in unicodectype.c. Sorry, but this doesn't work: the functions have to return Py_UNICODE and raise an exception if the return value doesn't fit. Otherwise, you'd get completely wrong values in code downcasting the return value to Py_UNICODE on narrow builds. Another good reason to use two sets of APIs. The new set could indeed return Py_UCS4 values. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org wrote: We use UCS2 on narrow Python builds, not UTF-16. We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made this quite clear. UTF-16 was designed as an easy transition from UCS-2. Indeed, if your code only does searches or joins existing strings then it will Just Work; declare it UTF-16 and you are done. We have a lot more work to do than that (as in this bug report), and we can't reasonably prevent the user from splitting surrogate pairs via poor code, but a 95% solution doesn't mean we suddenly revert all the way back to UCS-2. If the intent really was to use UCS-2 then a correctly functioning UTF-16 codec would join a surrogate pair into a single scalar value, then raise an error because it's outside the range representable in UCS-2. That's not very helpful though; obviously, it's much better to use UTF-16 internally. The alternative (no matter what the configure flag is called) is UTF-16, not UCS-2 though: there is support for surrogate pairs in various places, including the \U escape and the UTF-8 codec. http://mail.python.org/pipermail/python-dev/2008-July/080892.html If you find places where the Python core or standard library is doing Unicode processing that would break when surrogates are present you should file a bug. However this does not mean that every bit of code that slices a string at an arbitrary point (and hence risks slicing in the middle of a surrogate) is incorrect -- it all depends on what is done next with the slice. http://mail.python.org/pipermail/python-dev/2008-July/080900.html -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: Adam Olsen wrote: Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org wrote: We use UCS2 on narrow Python builds, not UTF-16. We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made this quite clear. UTF-16 was designed as an easy transition from UCS-2. Indeed, if your code only does searches or joins existing strings then it will Just Work; declare it UTF-16 and you are done. We have a lot more work to do than that (as in this bug report), and we can't reasonably prevent the user from splitting surrogate pairs via poor code, but a 95% solution doesn't mean we suddenly revert all the way back to UCS-2. If the intent really was to use UCS-2 then a correctly functioning UTF-16 codec would join a surrogate pair into a single scalar value, then raise an error because it's outside the range representable in UCS-2. That's not very helpful though; obviously, it's much better to use UTF-16 internally. The alternative (no matter what the configure flag is called) is UTF-16, not UCS-2 though: there is support for surrogate pairs in various places, including the \U escape and the UTF-8 codec. http://mail.python.org/pipermail/python-dev/2008-July/080892.html If you find places where the Python core or standard library is doing Unicode processing that would break when surrogates are present you should file a bug. However this does not mean that every bit of code that slices a string at an arbitrary point (and hence risks slicing in the middle of a surrogate) is incorrect -- it all depends on what is done next with the slice. http://mail.python.org/pipermail/python-dev/2008-July/080900.html All this is just nitpicking, really. UCS2 is a character set, UTF-16 an encoding. It so happens that when the Unicode consortium realized that 16 bit would not be enough to represent all scripts of the world, they added the concept of surrogates and reserved a few ranges of code points in UCS2 to represent these extra code points which are not part of UCS2, but the extensions UCS4. The conversion of these surrogate pairs to UCS4 code point values is what you find defined in UTF-16. If we were to implement Unicode using UTF-16 as storage format, we would not be able to store single lone surrogates, since these are not allowed in UTF-16. Ditto for unassigned ordinals, invalid code points, etc. PEP 100 really says it all: http://www.python.org/dev/peps/pep-0100/ This [internal] format will hold UTF-16 encodings of the corresponding Unicode ordinals. The Python Unicode implementation will address these values as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all currently defined Unicode character points. ... Future implementations can extend the 16 bit restriction to the full set of all UTF-16 addressable characters (around 1M characters). Note that I wrote the PEP and worked on the implementation at a time when Unicode 2.x was still in use wide-spread use (mostly on Windows) and 3.0 was just being release: http://www.unicode.org/history/publicationdates.html But all that is off-topic for this ticket, so please let's just stop such discussions. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg rep...@bugs.python.org wrote: All this is just nitpicking, really. UCS2 is a character set, UTF-16 an encoding. UCS is a character set, for most purposes synonymous with the Unicode character set. UCS-2 and UTF-16 are both encodings of that character set. However, UCS-2 can only represent the BMP, while UTF-16 can represent the full range. If we were to implement Unicode using UTF-16 as storage format, we would not be able to store single lone surrogates, since these are not allowed in UTF-16. Ditto for unassigned ordinals, invalid code points, etc. No. Internal usage may become temporarily ill-formed, but this is a compromise, and acceptable so long as we never export them to other systems. Not that I wouldn't *prefer* a system that wouldn't store lone surrogates, but.. pragmatics prevail. Note that I wrote the PEP and worked on the implementation at a time when Unicode 2.x was still in use wide-spread use (mostly on Windows) and 3.0 was just being release: http://www.unicode.org/history/publicationdates.html I think you hit the nail on the head there. 10 years ago, unicode meant something different than it does today. That's reflected in PEP 100 and in the code. Now it's time to move on, switch to the modern terminology, modern usage, and modern specs. But all that is off-topic for this ticket, so please let's just stop such discussions. It needs to be discussed somewhere. It's a distraction from fixing the bug, but at least it's more private here. Would you prefer email? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Adam Olsen rha...@gmail.com added the comment: Surrogates aren't optional features of UTF-16, we really need to get this fixed. That includes .isalpha(). We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. I don't see a problem with changing 2.x. The existing behaviour is broken for non-BMP scalar values, so surely nobody can claim dependence on it. -- nosy: +Rhamphoryncus type: - behavior ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Changes by Ezio Melotti ezio.melo...@gmail.com: -- priority: - normal stage: - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Ezio Melotti ezio.melo...@gmail.com added the comment: FWIW, on Python3 it seems to work: import unicodedata unicodedata.category(\U0001) 'Lo' unicodedata.category(\U00011000) 'Cn' unicodedata.category(chr(0x1)) 'Lo' unicodedata.category(chr(0x11000)) 'Cn' ord(chr(0x1)), 0x1 (65536, 65536) ord(chr(0x11000)), 0x11000 (69632, 69632) I'm using a narrow build too: import sys sys.maxunicode 65535 len('\U0001') 2 ord('\U0001') 65536 On Python2 unichr() is supposed to raise a ValueError on a narrow build if the value is greater than 0x [1], but if the characters above 0x can be represented with u\U there should be a way to fix unichr so it can return them. Python3 already does it with chr(). Maybe we should open a new issue for this if it's not present already. [1]: http://docs.python.org/library/functions.html#unichr ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: Since r56395, ord() and chr() accept and return surrogate pairs even in narrow builds. The goal is to remove most differences between narrow and wide unicode builds (except for string lengths, indices or slices) To address this problem, I suggest to change all functions in unicodectype.c so that they accept Py_UCS4 characters (instead of Py_UNICODE). This would be a binary-incompatible change; and --with-wctype-functions would have an effect only if sizeof(wchar_t)==4 (instead of the current condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE)) ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
STINNER Victor victor.stin...@haypocalc.com added the comment: amaury Since r56395, ord() and chr() accept and return surrogate pairs amaury even in narrow builds. Note: My examples are made with Python 2.x. The goal is to remove most differences between narrow and wide unicode builds (except for string lengths, indices or slices) It would be nice to get the same behaviour in Python 2.x and 3.x to help migration from Python2 to Python3 ;-) unichr() (in Python 2.x) documentation is correct. But I would approciate to support surrogates using unichr() which means also changing ord() behaviour. To address this problem, I suggest to change all functions in unicodectype.c so that they accept Py_UCS4 characters (instead of Py_UNICODE). Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP characters (code 0x). -- I can open a new issue if you agree that we can change unichr() / ord() behaviour on narrow build. We may ask on the mailing list? ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: That would cause major breakage in the C API Not if you recompile. I don't see how this breaks the API at the C level. and is not inline with the intention of having a Py_UNICODE type in the first place. Py_UNICODE is still used as the allocation unit for unicode strings. To get correct results, we need a way to access the whole unicode database even on ucs2 builds; it's possible with the unicodedata module, why not from C? My motivation for the change is this post: http://mail.python.org/pipermail/python-dev/2008-July/080900.html ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: Since r56395, ord() and chr() accept and return surrogate pairs even in narrow builds. The goal is to remove most differences between narrow and wide unicode builds (except for string lengths, indices or slices) To address this problem, I suggest to change all functions in unicodectype.c so that they accept Py_UCS4 characters (instead of Py_UNICODE). -1. That would cause major breakage in the C API and is not inline with the intention of having a Py_UNICODE type in the first place. Users who are interested in UCS4 builds should simply use UCS4 builds. This would be a binary-incompatible change; and --with-wctype-functions would have an effect only if sizeof(wchar_t)==4 (instead of the current condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE)) --with-wctype-functions was scheduled for removal many releases ago, but I never got around to it. The only reason it's still there is that some Linux distribution use this config option (AFAIR, RedHat). I'd be +1 on removing the option in 3.0.1 or deprecating it in 3.0.1 and removing it in 3.1. It's not useful in any way, and causes compatibility problems with regular builds. -- nosy: +lemburg ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
STINNER Victor victor.stin...@haypocalc.com added the comment: I don't understand the behaviour of unichr(): Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28) import unicodedata unicodedata.category(u\U0001) 'Lo' unicodedata.category(u\U00011000) 'Cn' unicodedata.category(unichr(0x1)) Traceback (most recent call last): File stdin, line 1, in module ValueError: unichr() arg not in range(0x1) (narrow Python build) Why unichr() fails whereas \U works? len(u\U0001) 2 ord(u\U0001) Traceback (most recent call last): File stdin, line 1, in module TypeError: ord() expected a character, but string of length 2 found -- nosy: +haypo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-02-03 14:14, STINNER Victor wrote: STINNER Victor victor.stin...@haypocalc.com added the comment: amaury Since r56395, ord() and chr() accept and return surrogate pairs amaury even in narrow builds. Note: My examples are made with Python 2.x. The goal is to remove most differences between narrow and wide unicode builds (except for string lengths, indices or slices) It would be nice to get the same behaviour in Python 2.x and 3.x to help migration from Python2 to Python3 ;-) unichr() (in Python 2.x) documentation is correct. But I would approciate to support surrogates using unichr() which means also changing ord() behaviour. This is not possible for unichr() in Python 2.x, since applications always expect len(unichr(x)) == 1. Changing ord() would be possible in Python 2.x is easier, since this would only extend the range of returned values for UCS2 builds. To address this problem, I suggest to change all functions in unicodectype.c so that they accept Py_UCS4 characters (instead of Py_UNICODE). Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP characters (code 0x). -- I can open a new issue if you agree that we can change unichr() / ord() behaviour on narrow build. We may ask on the mailing list? ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
STINNER Victor victor.stin...@haypocalc.com added the comment: lemburg This is not possible for unichr() in Python 2.x, since applications lemburg always expect len(unichr(x)) == 1 Oh, ok. lemburg Changing ord() would be possible in Python 2.x is easier, since lemburg this would only extend the range of returned values for UCS2 lemburg builds. ord() of Python3 (narrow build) rejects surrogate characters: '\U0001' len(chr(0x1)) 2 ord(0x1) Traceback (most recent call last): File stdin, line 1, in module TypeError: ord() expected string of length 1, but int found --- It looks that narrow builds with surrogates have some more problems... Test with U+1: LINEAR B SYLLABLE B008 A, category: Letter, Other. Correct result (Python 2.5, wide build): $ python Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) unichr(0x1) u'\U0001' unichr(0x1).isalpha() True Error in Python3 (narrow build): marge$ ./python Python 3.1a0 (py3k:69105M, Feb 3 2009, 15:04:35) chr(0x1).isalpha() False list(chr(0x1)) ['\ud800', '\udc00'] chr(0xd800).isalpha() False chr(0xdc00).isalpha() False Unicode ranges, all in the category Other, Surrogate: - U+D800..U+DB7F: Non Private Use High Surrogate - U+DB80..U+DBFF: Private Use High Surrogate - U+DC00..U+DFFF: Low Surrogate range ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: That would cause major breakage in the C API Not if you recompile. I don't see how this breaks the API at the C level. Well, then try to look at such a change from a C extension writer's perspective. They'd have to change all their function calls and routines to work with Py_UCS4. Supporting both the old API and the new one would be nearly impossible and require either an adapter API or a lot of #ifdef'ery. Please remember that the public Python C API is not only meant for Python developers. It's main purpose is for it to be used by other developers extending or embedding Python and those developers use different release cycles and want to support more than just the bleeding edge Python version. Python has a long history of providing very stable APIs, both in C and in Python. FWIW: The last major change in the C API (the change to Py_ssize_t from Python 2.4 to 2.5) has not even propogated to all major C extensions yet. It's only now that people start to realize problems with this, since their extensions start failing with segfaults on 64-bit machines. That said, we can of course provide additional UCS4 APIs for certain things and also provide conversion helpers between Py_UNICODE and Py_UCS4 where needed. and is not inline with the intention of having a Py_UNICODE type in the first place. Py_UNICODE is still used as the allocation unit for unicode strings. To get correct results, we need a way to access the whole unicode database even on ucs2 builds; it's possible with the unicodedata module, why not from C? I must be missing some detail, but what does the Unicode database have to do with the unicodeobject.c C API ? My motivation for the change is this post: http://mail.python.org/pipermail/python-dev/2008-July/080900.html There are certainly other ways to make Python deal with surrogates in more cases than the ones we already support. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Ezio Melotti ezio.melo...@gmail.com added the comment: haypo ord() of Python3 (narrow build) rejects surrogate characters: haypo '\U0001' haypo len(chr(0x1)) haypo 2 haypo ord(0x1) haypo TypeError: ord() expected string of length 1, but int found ord() works fine on Py3, you probably meant to do ord('\U0001') 65536 or ord(chr(0x1)) 65536 In Py3 is also stated that it accepts surrogate pairs (help(ord)). Py2 instead doesn't support them: ord(u'\U0001') TypeError: ord() expected a character, but string of length 2 found ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: I must be missing some detail, but what does the Unicode database have to do with the unicodeobject.c C API ? Ah, now I understand your concerns. My suggestion is to change only the 20 functions in unicodectype.c: _PyUnicode_IsAlpha, _PyUnicode_ToLowercase... and no change in unicodeobject.c at all. They all take a single code point as argument, some also return a single code point. Changing these functions is backwards compatible. I join a patch so we can argue on concrete code (tests are missing). Another effect of the patch: unicodedata.numeric('\N{AEGEAN NUMBER TWO}') can return 2.0. The str.isalpha() (and others) methods did not change: they still split the surrogate pairs. -- keywords: +patch Added file: http://bugs.python.org/file12934/unicodectype_ucs4.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: There were non-ascii characters in the Windows license file. This was corrected with r67860. I believe that chr(0x1) and chr(0x11000) should have the opposite behavior. This other problem is because on a narrow unicode build, Py_UNICODE_ISPRINTABLE takes a 16bit integer. And indeed, unicodedata.category(chr(0x1 % 65536)) 'Cc' unicodedata.category(chr(0x11000 % 65536)) 'Lo' -- nosy: +amaury.forgeotdarc ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
New submission from Venusaur bup...@hotmail.com: license Traceback (most recent call last): File stdin, line 1, in module File C:\Python30\lib\site.py, line 372, in __repr__ self.__setup() File C:\Python30\lib\site.py, line 359, in __setup data = fp.read() File C:\Python30\lib\io.py, line 1724, in read decoder.decode(self.buffer.read(), final=True)) File C:\Python30\lib\io.py, line 1295, in decode output = self.decoder.decode(input, final=final) UnicodeDecodeError: 'cp949' codec can't decode bytes in position 15164- 15165: il legal multibyte sequence chr(0x1) '\U0001' chr(0x11000) Traceback (most recent call last): File stdin, line 1, in module File C:\Python30\lib\io.py, line 1491, in write b = encoder.encode(s) UnicodeEncodeError: 'cp949' codec can't encode character '\ud804' in position 1: illegal multibyte sequence I also can't understand why chr(0x1) and chr(0x11000) has different behavior -- components: Unicode messages: 80924 nosy: bupjae severity: normal status: open title: UnicodeEncodeError - I can't even see license versions: Python 3.0 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Ezio Melotti ezio.melo...@gmail.com added the comment: Here (winxpsp2, Py3, cp850-terminal) the license works fine: license Type license() to see the full license text and license() works as well. I get this output for the chr()s: chr(0x1) '\U0001' chr(0x11000) Traceback (most recent call last): File stdin, line 1, in module File C:\Programs\Python30\lib\io.py, line 1491, in write b = encoder.encode(s) File C:\Programs\Python30\lib\encodings\cp850.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: character maps to undefined I believe that chr(0x1) and chr(0x11000) should have the opposite behavior. U+1 (LINEAR B SYLLABLE B008 A) belongs to the 'Lo' category and should be printed (and possibly raise a UnicodeError, see issue5110 [1]), U+11000 belongs to the 'Cn' category and should be escaped[2]. On Linux with Py3 and a UTF-8 terminal, chr(0x1) prints '\U0001' and chr(0x11000) prints the char (actually I see two boxes, but it shouldn't be a problem of Python). The license() works fine too. Also note that with cp850 the error message is 'character maps to undefined' and with cp949 is 'illegal multibyte sequence'. [1]: http://bugs.python.org/issue5110 [2]: http://www.python.org/dev/peps/pep-3138/#specification -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com