Re: [Python-Dev] New Py_UNICODE doc
On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote: Wow, what an inane way of looking at it. I don't know what world you live in, but in my world, users read the configure options and suppose that they mean something. In fact, they *have* to go off on their own to assume something, because even the documentation you refer to above doesn't say what happens if they choose UCS-2 or UCS-4. A logical assumption would be that python would use those CEFs internally, and that would be incorrect. Certainly. That's why the documentation should be improved. Changing the option breaks existing packaging systems, and should not be done lightly. I'm perfectly happy to continue supporting --enable-unicode=ucs2, but not displaying it as an option. Is that acceptable to you? -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
M.-A. Lemburg wrote: If all you're interested in is the lexical class of the code points in a string, you could use such a codec to map each code point to a code point representing the lexical class. How can I efficiently implement such a codec? The whole point is doing that in pure Python (because if I had to write an extension module, I could just as well do the entire lexical analysis in C, without any regular expressions). Any kind of associative/indexed table for this task consumes a lot of memory, and takes quite some time to initialize. Regards, Martint ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 10, 2005, at 2:48 PM, Nicholas Bastin wrote: On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote: Wow, what an inane way of looking at it. I don't know what world you live in, but in my world, users read the configure options and suppose that they mean something. In fact, they *have* to go off on their own to assume something, because even the documentation you refer to above doesn't say what happens if they choose UCS-2 or UCS-4. A logical assumption would be that python would use those CEFs internally, and that would be incorrect. Certainly. That's why the documentation should be improved. Changing the option breaks existing packaging systems, and should not be done lightly. I'm perfectly happy to continue supporting --enable-unicode=ucs2, but not displaying it as an option. Is that acceptable to you? If you're going to call python's implementation UTF-16, I'd consider all these very serious deficiencies: - unicodedata doesn't work for 2-char strings containing a surrogate pairs, nor integers. Therefore it is impossible to get any data on chars 0x. - there are no methods for determining if something is a surrogate pair and turning it into a integer codepoint. - Given that unicodedata doesn't work, I doubt also that .toupper/etc work right on surrogate pairs, although I haven't tested. - As has been noted before, the regexp engine doesn't properly treat surrogate pairs as a single unit. - Is there a method that is like unichr but that will work for codepoints 0x? I'm sure there's more as well. I think it's a mistake to consider python to be implementing UTF-16 just because it properly encodes/ decodes surrogate pairs in the UTF-8 codec. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
M.-A. Lemburg wrote: On sre character classes: I don't think that these provide a good approach to XML lexical classes - custom functions or methods or maybe even a codec mapping the characters to their XML lexical class are much more efficient in practice. That isn't my experience: functions that scan XML strings are much slower than regular expressions. I can't imagine how a custom codec could work, so I cannot comment on that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
M.-A. Lemburg wrote: I believe that it would be more appropriate to adjust the _tkinter module to adapt to the TCL Unicode size rather than forcing the complete Python system to adapt to TCL - I don't really see the point in an optional extension module defining the default for the interpreter core. _tkinter currently supports, for a UCS-2 Tcl, both UCS-2 and UCS-4 Python. For an UCS-4 Tcl, it requires Python also to be UCS-4. Contributions to support the missing case are welcome. At the very least, this should be a user controlled option. It is: by passing --enable-unicode=ucs2, you can force Python to use UCS-2 even if Tcl is UCS-4, with the result that _tkinter cannot be built anymore (and compilation fails with an #error). Otherwise, we might as well use sizeof(wchar_t) as basis for the default Unicode size. This at least, would be a much more reasonable choice than whatever TCL uses. The goal of the build process is to provide as many extension modules as possible (given the set of headers and libraries installed), and _tkinter is an important extension module because IDLE depends on it. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: -1. This breaks existing documentation and usage, and provides only minimum value. Have you been missing this conversation? UTF-16 is *WHAT PYTHON CURRENTLY IMPLEMENTS*. The current documentation is flat out wrong. Breaking that isn't a big problem in my book. The documentation I refer to is the one that says the equivalent of 'configure takes an option --enable-unicode, with the possible values ucs2, ucs4, yes (equivalent to no argument), and no (equivalent to --disable-unicode)' *THIS* documentation would break. This documentation is factually correct at the moment (configure does indeed take these options), and people rely on them in automatic build processes. Changing configure options should not be taken lightly, even if they may result from a wrong mental model. By that rule, --with-suffix should be renamed to --enable-suffix, --with-doc-strings to --enable-doc-strings, and so on. However, the nitpicking that underlies the desire to rename the option should be ignored in favour of backwards compatibility. Changing the documentation that goes along with the option would be fine. It provides more than minimum value - it provides the truth. No. It is just a command line option. It could be named --enable-quirk=(quork|quark), and would still select UTF-16. Command line options provide no truth - they don't even provide statements. With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2. I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them. So you suggest to renaming it to --enable-unicode=utf16, right? My point is that a Unicode type with UTF-16 would correctly support all assigned Unicode code points, which the current 2-byte implementation doesn't. So --enable-unicode=utf16 would *not* be the truth. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: All of my proposals for what to change the documention to have been shot down by Martin. If someone has better verbiage that they'd like to see, I'd be perfectly happy to patch the doc. I don't look into the specific wording - you speak English much better than I do. What I care about is that this part of the documentation should be complete and precise. I.e. statements like should not make assumptions might be fine, as long as they are still followed by a precise description of what the code currently does. So it should mention that the representation can be either 2 or 4 bytes, that the strings ucs2 and ucs4 can be used to select one of them, that it is always 2 bytes on Windows, that 2 bytes means that non-BMP characters can be represented as surrogate pairs, and so on. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
M.-A. Lemburg wrote: All this talk about UTF-16 vs. UCS-2 is not very useful and strikes me a purely academic. The reference to possibly breakage by slicing a Unicode and breaking a surrogate pair is valid, the idea of UCS-4 being less prone to breakage is a myth: Fair enough. The original point is that the documentation is unclear about what a Py_UNICODE[] contains. I deduced that it contains either UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong, but others will probably guess wrong too. Something in the docs needs to spell this out. Shane ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Shane Hathaway wrote: Fair enough. The original point is that the documentation is unclear about what a Py_UNICODE[] contains. I deduced that it contains either UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong, but others will probably guess wrong too. Something in the docs needs to spell this out. Again, patches are welcome. I was opposed to Nick's proposed changes, since they explicitly said that you are not supposed to know what is in a Py_UNICODE. Integrating the essence of PEP 261 into the main documentation would be a worthwhile task. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 8, 2005, at 5:15 AM, Martin v. Löwis wrote: 'configure takes an option --enable-unicode, with the possible values ucs2, ucs4, yes (equivalent to no argument), and no (equivalent to --disable-unicode)' *THIS* documentation would break. This documentation is factually correct at the moment (configure does indeed take these options), and people rely on them in automatic build processes. Changing configure options should not be taken lightly, even if they may result from a wrong mental model. By that rule, --with-suffix should be renamed to --enable-suffix, --with-doc-strings to --enable-doc-strings, and so on. However, the nitpicking that underlies the desire to rename the option should be ignored in favour of backwards compatibility. Changing the documentation that goes along with the option would be fine. That is exactly what I proposed originally, which you shot down. Please actually read the contents of my messages. What I said was change the configure option and related documentation. It provides more than minimum value - it provides the truth. No. It is just a command line option. It could be named --enable-quirk=(quork|quark), and would still select UTF-16. Command line options provide no truth - they don't even provide statements. Wow, what an inane way of looking at it. I don't know what world you live in, but in my world, users read the configure options and suppose that they mean something. In fact, they *have* to go off on their own to assume something, because even the documentation you refer to above doesn't say what happens if they choose UCS-2 or UCS-4. A logical assumption would be that python would use those CEFs internally, and that would be incorrect. With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2. I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them. So you suggest to renaming it to --enable-unicode=utf16, right? My point is that a Unicode type with UTF-16 would correctly support all assigned Unicode code points, which the current 2-byte implementation doesn't. So --enable-unicode=utf16 would *not* be the truth. The current implementation supports the UTF-16 CEF. i.e., it supports a variable width encoding form capable of representing all of the unicode space using surrogate pairs. Please point out a code point that the current 2 byte implementation does not support, either directly, or through the use of surrogate pairs. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 8, 2005, at 1:44 PM, Martin v. Löwis wrote: Shane Hathaway wrote: Fair enough. The original point is that the documentation is unclear about what a Py_UNICODE[] contains. I deduced that it contains either UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong, but others will probably guess wrong too. Something in the docs needs to spell this out. Again, patches are welcome. I was opposed to Nick's proposed changes, since they explicitly said that you are not supposed to know what is in a Py_UNICODE. Integrating the essence of PEP 261 into the main documentation would be a worthwhile task. You can't possibly assume you know specifically what's in a Py_UNICODE in any given python installation. If someone thinks this statement is untrue, please explain why. I realize you might not *want* that to be true, but it is. Users are free to configure their python however they desire, and if that means --enable-unicode=ucs2 on RH9, then that is perfectly valid. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Martin v. Löwis wrote: Define correctly. Python, in ucs2 mode, will allow to address individual surrogate codes, e.g. in indexing. So you get u\U00012345[0] When Python encodes characters internally in UCS-2, I would expect u\U00012345 to produce a UnicodeError(character can not be encoded in UCS-2). u'\ud808' This will never work correctly, and never should, because an efficient implementation isn't possible. If you want safe indexing and slicing, you need ucs4. I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think? Shane ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Martin v. Löwis wrote: Shane Hathaway wrote: I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think? This breaks backwards compatibility with existing extension modules. Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and can use that to directly access the characters. Py_UNICODE would always be 32 bits wide. PyUnicode_AsUnicode would cause the unicode object to be promoted automatically. Extensions that break as a result are technically broken already, aren't they? They're not supposed to depend on the size of Py_UNICODE. Shane ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Shane Hathaway wrote: Py_UNICODE would always be 32 bits wide. This would break PythonWin, which relies on Py_UNICODE being the same as WCHAR_T. PythonWin is not broken, it just hasn't been ported to UCS-4, yet (and porting this is difficult and will cause a performance loss). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Shane Hathaway wrote: Martin v. Löwis wrote: Shane Hathaway wrote: I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think? This breaks backwards compatibility with existing extension modules. Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and can use that to directly access the characters. Py_UNICODE would always be 32 bits wide. PyUnicode_AsUnicode would cause the unicode object to be promoted automatically. Extensions that break as a result are technically broken already, aren't they? They're not supposed to depend on the size of Py_UNICODE. -1. You are free to compile Python with --enable-unicode=ucs4 if you prefer this setting. I don't see any reason why we should force users to invest 4 bytes of storage for each Unicode code point - 2 bytes work just fine and can represent all Unicode characters that are currently defined (using surrogates if necessary). As more and more Unicode objects are used in a process, choosing UCS2 vs. UCS4 does make a huge difference in terms of used memory. All this talk about UTF-16 vs. UCS-2 is not very useful and strikes me a purely academic. The reference to possibly breakage by slicing a Unicode and breaking a surrogate pair is valid, the idea of UCS-4 being less prone to breakage is a myth: Unicode has many code points that are meant only for composition and don't have any standalone meaning, e.g. a combining acute accent (U+0301), yet they are perfectly valid code points - regardless of UCS-2 or UCS-4. It is easily possible to break such a combining sequence using slicing, so the most often presented argument for using UCS-4 instead of UCS-2 (+ surrogates) is rather weak if seen by daylight. Some may now say that combining sequences are not used all that often. However, they play a central role in Unicode normalization (http://www.unicode.org/reports/tr15/), which is needed whenever you want to semantically compare Unicode objects and are -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Martin v. Löwis wrote: M.-A. Lemburg wrote: Hmm, looking at the configure.in script, it seems you're right. I wonder why this weird dependency on TCL was added. If Python is configured for UCS-2, and Tcl for UCS-4, then Tkinter would not work out of the box. Hence the weird dependency. I believe that it would be more appropriate to adjust the _tkinter module to adapt to the TCL Unicode size rather than forcing the complete Python system to adapt to TCL - I don't really see the point in an optional extension module defining the default for the interpreter core. At the very least, this should be a user controlled option. Otherwise, we might as well use sizeof(wchar_t) as basis for the default Unicode size. This at least, would be a much more reasonable choice than whatever TCL uses. - Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote: Nicholas Bastin wrote: --enable-unicode=ucs2 be replaced with: --enable-unicode=utf16 and the docs be updated to reflect more accurately the variance of the internal storage type. -1. This breaks existing documentation and usage, and provides only minimum value. Have you been missing this conversation? UTF-16 is *WHAT PYTHON CURRENTLY IMPLEMENTS*. The current documentation is flat out wrong. Breaking that isn't a big problem in my book. It provides more than minimum value - it provides the truth. With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2. Individual surrogate values remain accessible, and supporting non-BMP characters is left to the application (with the exception of the UTF-8 codec). I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote: With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2. Individual surrogate values remain accessible, and supporting non-BMP characters is left to the application (with the exception of the UTF-8 codec). I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them. Surrogate pairs are only supported by the UTF-8 and UTF-16 codecs (and a few others), not the Python Unicode implementation itself - this treats surrogate code points just like any other Unicode code point. This allows us to be flexible and efficient in the implementation while guaranteeing the round-trip safety of Unicode data processed through Python. Your complaint about the documentation (which started this thread) is valid. However, I don't understand all the excitement about Py_UNICODE: if you don't like the way this Python typedef works, you are free to interface to Python using any of the supported encodings using PyUnicode_Encode() and PyUnicode_Decode(). I'm sure you'll find one that fits your needs and if not, you can even write your own codec and register it with Python, e.g. UTF-32 which we currently don't support ;-) Please upload your doc-patch to SF. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote: However, I don't understand all the excitement about Py_UNICODE: if you don't like the way this Python typedef works, you are free to interface to Python using any of the supported encodings using PyUnicode_Encode() and PyUnicode_Decode(). I'm sure you'll find one that fits your needs and if not, you can even write your own codec and register it with Python, e.g. UTF-32 which we currently don't support ;-) My concerns about Py_UNICODE are completely separate from my frustration that the documentation is wrong about this type. It is much more important that the documentation be correct, first, and then we can discuss the reasons why it can be one of two values, rather than just a uniform value across all python implementations. This makes distributing binary extension modules hard. It has become clear to me that no one on this list gives a *%^ about people attempting to distribute binary extension modules, or they would have cared about this problem, so I'll just drop that point. However, somehow, what keeps getting lost in the mix is that --enable-unicode=ucs2 is a lie, and we should change what this configure option says. Martin seems to disagree with me, for reasons that I don't understand. I would be fine with calling the option utf16, or just 2 and 4, but not ucs2, as that means things that Python doesn't intend it to mean. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote: However, I don't understand all the excitement about Py_UNICODE: if you don't like the way this Python typedef works, you are free to interface to Python using any of the supported encodings using PyUnicode_Encode() and PyUnicode_Decode(). I'm sure you'll find one that fits your needs and if not, you can even write your own codec and register it with Python, e.g. UTF-32 which we currently don't support ;-) My concerns about Py_UNICODE are completely separate from my frustration that the documentation is wrong about this type. It is much more important that the documentation be correct, first, and then we can discuss the reasons why it can be one of two values, rather than just a uniform value across all python implementations. This makes distributing binary extension modules hard. It has become clear to me that no one on this list gives a *%^ about people attempting to distribute binary extension modules, or they would have cared about this problem, so I'll just drop that point. Actually, many of us know about the problem of having to ship UCS2 and UCS4 builds of binary extensions and the troubles this causes with users. It just adds one more dimension to the number of builds you have to make - one for the Python version, another for the platform and in the case of Linux another one for the Unicode width. Nowadays most Linux distros ship UCS4 builds (after RedHat started this quest), so things start to normalize again. However, somehow, what keeps getting lost in the mix is that --enable-unicode=ucs2 is a lie, and we should change what this configure option says. Martin seems to disagree with me, for reasons that I don't understand. I would be fine with calling the option utf16, or just 2 and 4, but not ucs2, as that means things that Python doesn't intend it to mean. It's not a lie: the Unicode implementation does work with UCS2 code points (surrogate values are Unicode code points as well - they happen to live in a special zone of the BMP). Only the codecs add support for surrogates in a way that allows round-trip safety regardless of whether you used UCS2 or UCS4 as compile time option. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: On May 4, 2005, at 6:20 PM, Shane Hathaway wrote: Nicholas Bastin wrote: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform. But people want to know Is Python's Unicode 16-bit or 32-bit? So the documentation should explicitly say it depends. On a related note, it would be help if the documentation provided a little more background on unicode encoding. Specifically, that UCS-2 is not the same as UTF-16, even though they're both two bytes wide and most of the characters are the same. UTF-16 can encode 4 byte characters, while UCS-2 can't. A Py_UNICODE is either UCS-2 or UCS-4. It took me I'm not sure the Python documentation is the place to teach someone about unicode. The ISO 10646 pretty clearly defines UCS-2 as only containing characters in the BMP (plane zero). On the other hand, I don't know why python lets you choose UCS-2 anyhow, since it's almost always not what you want. You've got that wrong: Python let's you choose UCS-4 - UCS-2 is the default. Note that Python's Unicode codecs UTF-8 and UTF-16 are surrogate aware and thus support non-BMP code points regardless of the build type: A UCS2-build of Python will store a non-BMP code point as UTF-16 surrogate pair in the Py_UNICODE buffer while a UCS4 build will store it as a single value. Decoding is surrogate aware too, so a UTF-16 surrogate pair in a UCS2 build will get treated as single Unicode code point. Ideally, the Python programmer should not really need to know all this and I think we've achieved that up to certain point (Unicode can be complicated - there's nothing to hide there). However, the C progammer using the Python C API to interface to some other Unicode implementation will need to know these details. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Fredrik Lundh wrote: Thomas Heller wrote: AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars, independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro can be used by extension writers to determine if Py_UNICODE is the same as wchar_t. note that usable is more than just same size; it also implies that widechar predicates (iswalnum etc) works properly with Unicode characters, under all locales. Only if you intend to use --with-wctypes; a configure option which will go away soon (for exactly the reason you are referring to: the widechar predicates don't work properly under all locales). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote: Nicholas Bastin wrote: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform. But people want to know Is Python's Unicode 16-bit or 32-bit? So the documentation should explicitly say it depends. The important piece of information is that it is not guaranteed to be a particular one of those sizes. Once you can't guarantee the size, no one really cares what size it is. The documentation should discourage developers from attempting to manipulate Py_UNICODE directly, which, other than trivia, is the only reason why someone would care what size the internal representation is. I don't see why you shouldn't use Py_UNICODE buffer directly. After all, the reason why we have that typedef is to make it possible to program against an abstract type - regardless of its size on the given platform. In that respect it is similar to wchar_t (and all the other *_t typedefs in C). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 3:25 AM, M.-A. Lemburg wrote: I don't see why you shouldn't use Py_UNICODE buffer directly. After all, the reason why we have that typedef is to make it possible to program against an abstract type - regardless of its size on the given platform. Because the encoding of that buffer appears to be different depending on the configure options. If that isn't true, then someone needs to change the doc, and the configure options. Right now, it seems *very* clear that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the configure help, and you can't use the buffer directly if the encoding is variable. However, you seem to be saying that this isn't true. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote: You've got that wrong: Python let's you choose UCS-4 - UCS-2 is the default. No, that's not true. Python lets you choose UCS-4 or UCS-2. What the default is depends on your platform. If you run raw configure, some systems will choose UCS-4, and some will choose UCS-2. This is how the conversation came about in the first place - running ./configure on RHL9 gives you UCS-4. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote: If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean? It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc (Another Attempt)
After reading through the code and the comments in this thread, I propose the following in the documentation as the definition of Py_UNICODE: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size or native encoding of this type on any given platform. The main point here is that extension developers can not safely slam Py_UNICODE (which it appeared was true when the documentation stated that it was always 16-bits). I don't propose that we put this information in the doc, but the possible internal representations are: 2-byte wchar_t or unsigned short encoded as UTF-16 4-byte wchar_t encoded as UTF-32 (UCS-4) If you do not explicitly set the configure option, you cannot guarantee which you will get. Python also does not normalize the byte order of unicode strings passed into it from C (via PyUnicode_EncodeUTF16, for example), so it is possible to have UTF-16LE and UTF-16BE strings in the system at the same time, which is a bit confusing. This may or may not be worth a mention in the doc (or a patch). -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 3:42 PM, James Y Knight wrote: On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote: If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean? It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters. Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: On May 6, 2005, at 3:42 PM, James Y Knight wrote: It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters. Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed. Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs. Shane ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 5:21 PM, Shane Hathaway wrote: Nicholas Bastin wrote: On May 6, 2005, at 3:42 PM, James Y Knight wrote: It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters. Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed. Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs. Yes. Well, in as much as a large part of UTF-16 directly overlaps UCS-2, then sometimes unicode strings contain UCS-2 characters. However, characters which would not be legal in UCS-2 are still encoded properly in python, in UTF-16. And yes, I feel your pain, that's how I *got* into this position. Mapping from external unicode types is an important aspect of writing extension modules, and the documentation does not help people trying to do this. The fact that python's internal encoding is variable is a huge problem in and of itself, even if that was documented properly. This is why tools like Xerces and ICU will be happy to give you whatever form of unicode strings you want, but internally they always use UTF-16 - to avoid having to write two internal implementations of the same functionality. If you look up and down Objects/unicodeobject.c you'll see a fair amount of code written a couple of different ways (using #ifdef's) because of the variability in the internal representation. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: On May 6, 2005, at 5:21 PM, Shane Hathaway wrote: Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs. Yes. Well, in as much as a large part of UTF-16 directly overlaps UCS-2, then sometimes unicode strings contain UCS-2 characters. However, characters which would not be legal in UCS-2 are still encoded properly in python, in UTF-16. And yes, I feel your pain, that's how I *got* into this position. Mapping from external unicode types is an important aspect of writing extension modules, and the documentation does not help people trying to do this. The fact that python's internal encoding is variable is a huge problem in and of itself, even if that was documented properly. This is why tools like Xerces and ICU will be happy to give you whatever form of unicode strings you want, but internally they always use UTF-16 - to avoid having to write two internal implementations of the same functionality. If you look up and down Objects/unicodeobject.c you'll see a fair amount of code written a couple of different ways (using #ifdef's) because of the variability in the internal representation. Ok. Thanks for helping me understand where Python is WRT unicode. I can work around the issues (or maybe try to help solve them) now that I know the current state of affairs. If Python correctly handled UTF-16 strings internally, we wouldn't need the UCS-4 configuration switch, would we? Shane ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: The important piece of information is that it is not guaranteed to be a particular one of those sizes. Once you can't guarantee the size, no one really cares what size it is. Please trust many years of experience: This is just not true. People do care, and they want to know. If we tell them it depends, they ask how can I find out. The documentation should discourage developers from attempting to manipulate Py_UNICODE directly, which, other than trivia, is the only reason why someone would care what size the internal representation is. Why is that? Of *course* people will have to manipulate Py_UNICODE* buffers directly. What else can they use? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. What do you mean by that? That the interpreter crashes if you try to store a low surrogate into a Py_UNICODE? I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean? It tells you whether you have the two-octet form of the Universal Character Set, or the four-octet form. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: Because the encoding of that buffer appears to be different depending on the configure options. What makes it appear so? sizeof(Py_UNICODE) changes when you change the option - does that, in your mind, mean that the encoding changes? If that isn't true, then someone needs to change the doc, and the configure options. Right now, it seems *very* clear that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the configure help, and you can't use the buffer directly if the encoding is variable. However, you seem to be saying that this isn't true. It's a compile-time option (as all configure options). So at run-time, it isn't variable. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: No, that's not true. Python lets you choose UCS-4 or UCS-2. What the default is depends on your platform. The truth is more complicated. If your Tcl is built for UCS-4, then Python will also be built for UCS-4 (unless overridden by command line). Otherwise, Python will default to UCS-2. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
M.-A. Lemburg wrote: Hmm, looking at the configure.in script, it seems you're right. I wonder why this weird dependency on TCL was added. If Python is configured for UCS-2, and Tcl for UCS-4, then Tkinter would not work out of the box. Hence the weird dependency. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 7:43 PM, Martin v. Löwis wrote: Nicholas Bastin wrote: If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. What do you mean by that? That the interpreter crashes if you try to store a low surrogate into a Py_UNICODE? What I mean is pretty clear. UCS-2 does *NOT* support surrogate pairs. If it did, it would be called UTF-16. If Python really supported UCS-2, then surrogate pairs from UTF-16 inputs would either get turned into two garbage characters, or the I couldn't transcode this UCS-2 code point (I don't remember which on that is off the top of my head). I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean? It tells you whether you have the two-octet form of the Universal Character Set, or the four-octet form. It would, if that were the case, but it's not. Setting UCS-2 in the configure script really means UTF-16, and as such, the documentation should reflect that. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 7:45 PM, Martin v. Löwis wrote: Nicholas Bastin wrote: Because the encoding of that buffer appears to be different depending on the configure options. What makes it appear so? sizeof(Py_UNICODE) changes when you change the option - does that, in your mind, mean that the encoding changes? Yes. Not only in my mind, but in the Python source code. If Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4), otherwise the encoding is UTF-16 (*not* UCS-2). If that isn't true, then someone needs to change the doc, and the configure options. Right now, it seems *very* clear that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the configure help, and you can't use the buffer directly if the encoding is variable. However, you seem to be saying that this isn't true. It's a compile-time option (as all configure options). So at run-time, it isn't variable. What I mean by 'variable' is that you can't make any assumption as to what the size will be in any given python when you're writing (and building) an extension module. This breaks binary compatibility of extensions modules on the same platform and same version of python across interpreters which may have been built with different configure options. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Shane Hathaway wrote: Ok. Thanks for helping me understand where Python is WRT unicode. I can work around the issues (or maybe try to help solve them) now that I know the current state of affairs. If Python correctly handled UTF-16 strings internally, we wouldn't need the UCS-4 configuration switch, would we? Define correctly. Python, in ucs2 mode, will allow to address individual surrogate codes, e.g. in indexing. So you get u\U00012345[0] u'\ud808' This will never work correctly, and never should, because an efficient implementation isn't possible. If you want safe indexing and slicing, you need ucs4. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: What I mean is pretty clear. UCS-2 does *NOT* support surrogate pairs. If it did, it would be called UTF-16. If Python really supported UCS-2, then surrogate pairs from UTF-16 inputs would either get turned into two garbage characters, or the I couldn't transcode this UCS-2 code point (I don't remember which on that is off the top of my head). OTOH, if Python really supported UTF-16, then unichr(0x1) would work, and len(u\U0001) would be 1. It is primarily just the UTF-8 codec which supports UTF-16. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 8:25 PM, Martin v. Löwis wrote: Nicholas Bastin wrote: Yes. Not only in my mind, but in the Python source code. If Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4), otherwise the encoding is UTF-16 (*not* UCS-2). I see. Some people equate encoding with encoding scheme; neither UTF-32 nor UTF-16 is an encoding scheme. You were That's not true. UTF-16 and UTF-32 are both CES and CEF (although this is not true of UTF-16LE and BE). UTF-32 is a fixed-width encoding form within a code space of (0..10) and UTF-16 is a variable-width encoding form which provides a mix of one of two 16-bit code units in the code space of (0..). However, you are perhaps right to point out that people should be more explicit as to which they are referring to. UCS-2, however, is only a CEF, and thus I thought it was obvious that I was referring to UTF-16 as a CEF. I would point anyone who is confused as this point to Unicode Technical Report #17 on the Character Encoding Model, which is much more clear than trying to piece together the relevant parts out of the entire standard. In any event, Python's use of the term UCS-2 is incorrect. I quote from the TR: The UCS-2 encoding form, which is associated with ISO/IEC 10646 and can only express characters in the BMP, is a fixed-width encoding form. immediately followed by: In contrast, UTF-16 uses either one or two code units and is able to cover the entire code space of Unicode. If Python is capable of representing the entire code space of Unicode when you choose --unicode=ucs2, then that is a bug. It either should not be called UCS-2, or the interpreter should be bound by the limitations of the UCS-2 CEF. What I mean by 'variable' is that you can't make any assumption as to what the size will be in any given python when you're writing (and building) an extension module. This breaks binary compatibility of extensions modules on the same platform and same version of python across interpreters which may have been built with different configure options. True. The breakage will be quite obvious, in most cases: the module fails to load because not only sizeof(Py_UNICODE) changes, but also the names of all symbols change. Yes, but the important question here is why would we want that? Why doesn't Python just have *one* internal representation of a Unicode character? Having more than one possible definition just creates problems, and provides no value. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 6, 2005, at 8:11 PM, Martin v. Löwis wrote: Nicholas Bastin wrote: Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed. Yes to the former, no to the latter. PEP 261 specifies what should and shouldn't work. This PEP has several textual errors and ambiguities (which, admittedly, may have been a necessary state given the unicode standard in 2001). However, putting that aside, I would recommend that: --enable-unicode=ucs2 be replaced with: --enable-unicode=utf16 and the docs be updated to reflect more accurately the variance of the internal storage type. I would also like the community to strongly consider standardizing on a single internal representation, but I will leave that fight for another day. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote: Nicholas Bastin wrote: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform. But people want to know Is Python's Unicode 16-bit or 32-bit? So the documentation should explicitly say it depends. The important piece of information is that it is not guaranteed to be a particular one of those sizes. Once you can't guarantee the size, no one really cares what size it is. The documentation should discourage developers from attempting to manipulate Py_UNICODE directly, which, other than trivia, is the only reason why someone would care what size the internal representation is. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: On May 4, 2005, at 6:20 PM, Shane Hathaway wrote: On a related note, it would be help if the documentation provided a little more background on unicode encoding. Specifically, that UCS-2 is not the same as UTF-16, even though they're both two bytes wide and most of the characters are the same. UTF-16 can encode 4 byte characters, while UCS-2 can't. A Py_UNICODE is either UCS-2 or UCS-4. It took me I'm not sure the Python documentation is the place to teach someone about unicode. The ISO 10646 pretty clearly defines UCS-2 as only containing characters in the BMP (plane zero). On the other hand, I don't know why python lets you choose UCS-2 anyhow, since it's almost always not what you want. Then something in the Python docs ought to say why UCS-2 is not what you want. I still don't know; I've heard differing opinions on the subject. Some say you'll never need more than what UCS-2 provides. Is that incorrect? More generally, how should a non-unicode-expert writing Python extension code find out the minimum they need to know about unicode to use the Python unicode API? The API reference [1] ought to at least have a list of background links. I had to hunt everywhere. .. [1] http://docs.python.org/api/unicodeObjects.html Shane ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin [EMAIL PROTECTED] writes: The current documentation for Py_UNICODE states: This type represents a 16-bit unsigned storage type which is used by Python internally as basis for holding Unicode ordinals. On platforms where wchar_t is available and also has 16-bits, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short. I propose changing this to: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. On platforms where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. This just isn't true. Have you read ./configure --help recently? On all other platforms, Py_UNICODE is a typedef alias for unsigned short. Extension module developers should make no assumptions about the size of this type on any given platform. I like this last sentence, though. If no one has a problem with that, I'll make the change in CVS. I have a problem with replacing one lie with another :) Cheers, mwh -- Just put the user directories on a 486 with deadrat7.1 and turn the Octane into the afforementioned beer fridge and keep it in your office. The lusers won't notice the difference, except that you're more cheery during office hours. -- Pim van Riezen, asr ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
On May 4, 2005, at 1:02 PM, Michael Hudson wrote: Nicholas Bastin [EMAIL PROTECTED] writes: The current documentation for Py_UNICODE states: This type represents a 16-bit unsigned storage type which is used by Python internally as basis for holding Unicode ordinals. On platforms where wchar_t is available and also has 16-bits, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short. I propose changing this to: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. On platforms where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. This just isn't true. Have you read ./configure --help recently? Ok, so the above statement is true if the user does not set --enable-unicode=ucs[24] (was reading the whar_t test in configure.in, and not the generated configure help). Alternatively, we shouldn't talk about the size at all, and just leave the first and last sentences: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Thomas Heller wrote: AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars, independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro can be used by extension writers to determine if Py_UNICODE is the same as wchar_t. note that usable is more than just same size; it also implies that widechar predicates (iswalnum etc) works properly with Unicode characters, under all locales. /F ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Py_UNICODE doc
Nicholas Bastin wrote: This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform. But people want to know Is Python's Unicode 16-bit or 32-bit? So the documentation should explicitly say it depends. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com