Re: [Python-Dev] len(chr(i)) = 2?

2010-11-27 Thread Stephen J. Turnbull
Glyph Lefkowitz writes: But I don't think that anyone is filling up main memory with gigantic piles of character indexes and need to squeeze out that extra couple of bytes of memory on such a tiny object. How do you think editors and browsers represent the regions that they highlight,

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread M.-A. Lemburg
Terry Reedy wrote: On 11/24/2010 3:06 PM, Alexander Belopolsky wrote: Any non-trivial text processing is likely to be broken in presence of surrogates. Producing them on input is just trading known issue for an unknown one. Processing surrogate pairs in python code is hard. Software that

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread M.-A. Lemburg
Alexander Belopolsky wrote: On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull step...@xemacs.org wrote: .. I note that an opinion has been raised on this thread that if we want compressed internal representation for strings, we should use UTF-8. I tend to agree, but UTF-8 has been

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread Victor Stinner
On Friday 19 November 2010 23:25:03 you wrote: Python is unclear about non-BMP characters: narrow build was called ucs2 for long time, even if it is UTF-16 (each character is encoded to one or two UTF-16 words). No, no, no :-) UCS2 and UCS4 are more appropriate than narrow and wide or

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread Stephen J. Turnbull
M.-A. Lemburg writes: That would be a possibility as well... but I doubt that many users are going to bother, since slicing surrogates is just as bad as slicing combining code points and the latter are much more common in real life and they do happen to mostly live in the BMP. That's

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread Stephen J. Turnbull
M.-A. Lemburg writes: Please note that we can only provide one way of string indexing in Python using the standard s[1] notation and since we don't want that operation to be fast and no more than O(1), using the code units as items is the only reasonable way to implement it. AFAICT, the

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread Glyph Lefkowitz
On Nov 24, 2010, at 4:03 AM, Stephen J. Turnbull wrote: You end up proliferating types that all do the same kind of thing. Judicious use of inheritance helps, but getting the fundamental abstraction right is hard. Or least, Emacs hasn't found it in 20 years of trying. Emacs hasn't even

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread Glyph Lefkowitz
On Nov 24, 2010, at 10:55 PM, Stephen J. Turnbull wrote: Greg Ewing writes: On 24/11/10 22:03, Stephen J. Turnbull wrote: But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Stephen J. Turnbull
James Y Knight writes: a) You seem to be hung up implementation details of emacs. Hung up? No. It's the program whose text model I know best, and even if its design could theoretically be a lot better for this purpose, I can't say I've seen a real program whose model is obviously better for

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Stephen J. Turnbull
James Y Knight writes: But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly superior [...]a because it is an ASCII superset, and thus more easily compatible with other software. That also makes it most commonly used for internet communication. Sure, UTF-8 is very nice as a

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Antoine Pitrou
On Wed, 24 Nov 2010 18:51:49 +0900 Stephen J. Turnbull step...@xemacs.org wrote: James Y Knight writes: But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly superior [...]a because it is an ASCII superset, and thus more easily compatible with other software. That also makes

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Alexander Belopolsky
On Tue, Nov 23, 2010 at 2:18 PM, Amaury Forgeot d'Arc amaur...@gmail.com wrote: .. Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them. This was already discussed two years ago:

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread M.-A. Lemburg
Alexander Belopolsky wrote: To conclude, I feel that rather than trying to fully support non-BMP characters as surrogate pairs in narrow builds, we should make it easier for application developers to avoid them. I don't understand what you're after here. Programmers can easily avoid them by

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Alexander Belopolsky
On Wed, Nov 24, 2010 at 1:50 PM, M.-A. Lemburg m...@egenix.com wrote: .. add an option for decoders that currently produce surrogate pairs to treat non-BMP characters as errors and handle them according to user's choice. But what do you gain by doing this ? You'd lose the round-trip safety

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Greg Ewing
On 24/11/10 13:22, James Y Knight wrote: Instead, provide bidirectional iterators which can traverse the string by byte, codepoint, or by grapheme Maybe it would be a good idea to add some iterators like this to Python. (Or has the time machine beaten me there?) -- Greg

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Stephen J. Turnbull
Alexander Belopolsky writes: Any non-trivial text processing is likely to be broken in presence of surrogates. If you're worried about this, write a UCS-2-producing codec that rejects surrogates or stuffs them into the private zone of the BMP. Maybe such a codec should be default, but so far

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Greg Ewing
On 24/11/10 22:03, Stephen J. Turnbull wrote: But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Greg Ewing
On 25/11/10 06:37, Alexander Belopolsky wrote: I don't think there is a recipe on how to fix legacy character-by-character processing loop such as for c in string: ... to make it iterate over code points consistently in wide and narrow builds. A couple of possibilities: 1) Make

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Stephen J. Turnbull
Greg Ewing writes: On 24/11/10 22:03, Stephen J. Turnbull wrote: But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Alexander Belopolsky
On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull step...@xemacs.org wrote: ..   I note that an opinion has been raised on this thread that   if we want compressed internal representation for strings, we should   use UTF-8.  I tend to agree, but UTF-8 has been repeatedly rejected as   too

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread Terry Reedy
On 11/24/2010 3:06 PM, Alexander Belopolsky wrote: Any non-trivial text processing is likely to be broken in presence of surrogates. Producing them on input is just trading known issue for an unknown one. Processing surrogate pairs in python code is hard. Software that has to support non-BMP

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Stephen J. Turnbull
Terry Reedy writes: Yes. As I read the standard, UCS-2 is limited to BMP chars. Et tu, Terry? OK, I change my vote on the suggestion of UCS2 to -1. If a couple of conscientious blokes like you and David both understand it that way, I can't see any way to fight it. FWIW, ISO/IEC 10646 (which

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Stephen J. Turnbull
If you don't care about the ISO standard, but only about Python, Martin's right, I was wrong. You can stop reading now.wink Martin v. Löwis writes: I could only find the FCD of 10646:2010, where annex H was integrated into section 10: Thank you for the reference. I referred to two older

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Stephen J. Turnbull
Martin v. Löwis writes: I disagree: Quoting from Unicode 5.0, section 5.4: # The individual components of implementations may have different # levels of support for surrogates, as long as those components are # assembled and communicate correctly. Assembly is the problem. If chr() or

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Stephen J. Turnbull
Nick Coghlan writes: For practical purposes, UCS2/UCS4 convey far more inherent information than narrow/wide: That was my stance, but in fact (1) the ISO JTC1/SC2 has deliberately made them ambiguous by changing their definitions over the years[1], and (2) the more recent definitions and

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Alexander Belopolsky
On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger raymond.hettin...@gmail.com wrote: .. Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two This discussion motivated me

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Amaury Forgeot d'Arc
2010/11/23 Alexander Belopolsky alexander.belopol...@gmail.com: This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2.  I was not surprised to find that textwrap does not handle the issue that well: len(wrap(' \U00010140' *

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread M.-A. Lemburg
Alexander Belopolsky wrote: On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger raymond.hettin...@gmail.com wrote: .. Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Terry Reedy
On 11/23/2010 2:11 PM, Alexander Belopolsky wrote: This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2. I was not Good idea! surprised to find that textwrap does not handle the issue that well: len(wrap(' \U00010140'

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Greg Ewing
Alexander Belopolsky wrote: Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread James Y Knight
On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote: Maybe Python should have used UTF-8 as its internal unicode representation. Then people who were foolish enough to assume one character per string item would have their programs break rather soon under only light unicode testing. :-) You put a

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Glyph Lefkowitz
On Nov 23, 2010, at 7:22 PM, James Y Knight wrote: On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote: Maybe Python should have used UTF-8 as its internal unicode representation. Then people who were foolish enough to assume one character per string item would have their programs break rather

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Stephen J. Turnbull
Alexander Belopolsky writes: Yet finding a bug in a str object method after a 5 min review was a bit discouraging: 'xyz'.center(20, '\U00010140') Traceback (most recent call last): File stdin, line 1, in module TypeError: The fill character must be exactly one character long

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Stephen J. Turnbull
James Y Knight writes: You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with backwards-compatibility: nobody really needs to be able to

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Glyph Lefkowitz
On Nov 23, 2010, at 9:44 PM, Stephen J. Turnbull wrote: James Y Knight writes: You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread Stephen J. Turnbull
Note that I'm not saying that there shouldn't be a UTF-8 string type; I'm just saying that for some purposes it might be a good idea to keep UTF-16 and UTF-32 string types around. Glyph Lefkowitz writes: The theory is that accessing the first character of a region in a string often occurs

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread James Y Knight
On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote: Or you can give user programs memory indicies, and enjoy the fun as the poor developers do things like pos += 1 which works fine on the ASCII data they have lying around, then wonder why they get Unicode errors when they take substrings.

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread James Y Knight
On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote: By the way, to send the ball back into your court, I have this feeling that the demand for UTF-8 is once again driven by native English speakers who are very shortly going to find themselves, and the data they are most familiar with,

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Martin v. Löwis
Unicode 5.0, Chapter 3, verse C9: When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code sequences. A Unicode-conforming Python implementation would error at the chr() call, or perhaps

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Stephen J. Turnbull
Martin v. Löwis writes: More interestingly (and to the subject) is chr: how did you arrive at C9 banning Python3's definition of chr? This chr function puts the code sequence into well-formed UTF-16; that's the whole point of UTF-16. No, it doesn't, in the specific case of surrogate code

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Stephen J. Turnbull
Raymond Hettinger writes: Neither UTF-16 nor UCS-2 is exactly correct anyway. From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H (retransmitting devices) and Q (UTF-16). Annex Q makes it clear that UTF-16 was

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Martin v. Löwis
Am 22.11.2010 11:47, schrieb Stephen J. Turnbull: Martin v. Löwis writes: More interestingly (and to the subject) is chr: how did you arrive at C9 banning Python3's definition of chr? This chr function puts the code sequence into well-formed UTF-16; that's the whole point of UTF-16.

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Martin v. Löwis
Am 22.11.2010 11:48, schrieb Stephen J. Turnbull: Raymond Hettinger writes: Neither UTF-16 nor UCS-2 is exactly correct anyway. From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H (retransmitting devices)

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread M.-A. Lemburg
Martin, it is really irrelevant whether the standards have decided to no longer use the terms UCS-2 and UCS-4 in their latest standard documents. The definitions still stand (just like Unicode 2.0 is still a valid standard, even if it's ten years old): * UCS-2 is defined as Universal Character

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread James Y Knight
Why don't ya'll just call them --unichar-width=16/32. That describes precisely what the options do, and doesn't invite any quibbling over definitions. James ___ Python-Dev mailing list Python-Dev@python.org

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Nick Coghlan
On Mon, Nov 22, 2010 at 10:47 PM, M.-A. Lemburg m...@egenix.com wrote: Please also note that we have used the terms UCS-2 and UCS-4 in Python2 for 9+ years now and users are just starting to learn the difference and get acquainted with the fact that Python uses these two forms. Confronting

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Alexander Belopolsky
On Mon, Nov 22, 2010 at 10:37 AM, Nick Coghlan ncogh...@gmail.com wrote: .. *(The first Google hit for ucs2 is the UTF-16/UCS-2 article on Wikipedia, the first hit for ucs4 is the UTF-32/UCS-4 article) Do you think these articles are helpful for someone learning how to use chr() and ord() in

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Nick Coghlan
On Tue, Nov 23, 2010 at 2:03 AM, Alexander Belopolsky alexander.belopol...@gmail.com wrote: On Mon, Nov 22, 2010 at 10:37 AM, Nick Coghlan ncogh...@gmail.com wrote: .. *(The first Google hit for ucs2 is the UTF-16/UCS-2 article on Wikipedia, the first hit for ucs4 is the UTF-32/UCS-4 article)

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Alexander Belopolsky
On Mon, Nov 22, 2010 at 11:13 AM, Nick Coghlan ncogh...@gmail.com wrote: .. Do you think these articles are helpful for someone learning how to use chr() and ord() in Python for the first time? No, that's what the documentation of chr() and ord() is for. For that use case, it doesn't matter

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread R. David Murray
On Mon, 22 Nov 2010 12:00:14 -0500, Alexander Belopolsky alexander.belopol...@gmail.com wrote: I recently updated chr() and ord() documentation and used narrow/wide terms. I thought USC2/4 proponents objected to that on the basis that these terms are imprecise. For reference, a grep in

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Alexander Belopolsky
On Mon, Nov 22, 2010 at 12:30 PM, R. David Murray rdmur...@bitdance.com wrote: .. For reference, a grep in py3k/Doc reveals that there are currently exactly 23 lines mentioning UCS2 or UCS4 in the docs. Did you grep for USC-2 and USC-4 as well? I have to admit that my aversion to these terms

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Terry Reedy
On 11/22/2010 5:48 AM, Stephen J. Turnbull wrote: I disagree. I do see a problem with UCS-2, because it fails to tell us that Python implements a large number of features that make it easy to do a very good job of working with non-BMP data in 16-bit builds of Yes. As I read the standard,

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Raymond Hettinger
On Nov 22, 2010, at 2:48 AM, Stephen J. Turnbull wrote: Raymond Hettinger writes: Neither UTF-16 nor UCS-2 is exactly correct anyway. From a standards lawyer point of view, UCS-2 is exactly correct, You're twisting yourself into definitional knots. Any explanation we give users needs to

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Raymond Hettinger
On Nov 22, 2010, at 9:41 AM, Terry Reedy wrote: On 11/22/2010 5:48 AM, Stephen J. Turnbull wrote: I disagree. I do see a problem with UCS-2, because it fails to tell us that Python implements a large number of features that make it easy to do a very good job of working with non-BMP data

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread M.-A. Lemburg
Raymond Hettinger wrote: Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two The term UCS-2 is a complete communications failure in that regard. If someone looks up

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread Alexander Belopolsky
On Mon, Nov 22, 2010 at 12:41 PM, Terry Reedy tjre...@udel.edu wrote: .. What Python does might be called USC-2+ or UCS-2e (xtended). Wow! I am not the only one who can't get the order of letters right in these acronyms. (I am usually consistent within one sentence, though.) :-)

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread R. David Murray
On Mon, 22 Nov 2010 12:37:59 -0500, Alexander Belopolsky alexander.belopol...@gmail.com wrote: On Mon, Nov 22, 2010 at 12:30 PM, R. David Murray rdmur...@bitdance.com wrote: .. For reference, a grep in py3k/Doc reveals that there are currently exactly 23 lines mentioning UCS2 or UCS4 in

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Stephen J. Turnbull
Martin v. Löwis writes: Am 20.11.2010 05:11, schrieb Stephen J. Turnbull: Martin v. Löwis writes: The term UCS-2 is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread R. David Murray
On Sun, 21 Nov 2010 21:55:12 +0900, Stephen J. Turnbull step...@xemacs.org wrote: Martin v. Löwis writes: Am 20.11.2010 05:11, schrieb Stephen J. Turnbull: Martin v. Löwis writes: The term UCS-2 is a character set that can encode only encode 65536 characters; it thus

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Raymond Hettinger
On Nov 21, 2010, at 9:38 AM, R. David Murray wrote: I'm sorry, but I have to disagree. As a relative unicode ignoramus, UCS-2 and UCS-4 convey almost no information to me, and the bits I have heard about them on this list have only confused me. From the users point of view, it doesn't

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Martin v. Löwis
I disagree. Python does conform to UTF-16 I'm sure the codecs do. But the Unicode standard doesn't care about the parts of the process, it cares about what it does as a whole. Chapter and verse? Python's internal coding does not conform to UTF-16, and that internal coding can, under

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread R. David Murray
On Sun, 21 Nov 2010 10:17:57 -0800, Raymond Hettinger raymond.hettin...@gmail.com wrote: On Nov 21, 2010, at 9:38 AM, R. David Murray wrote: I'm sorry, but I have to disagree. As a relative unicode ignoramus, UCS-2 and UCS-4 convey almost no information to me, and the bits I have heard

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Alexander Belopolsky
On Fri, Nov 19, 2010 at 4:43 PM, Martin v. Löwis mar...@v.loewis.de wrote: In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like backward compatibility or historical reasons :-) No, there was a deliberate decision to not

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Stephen J. Turnbull
Martin v. Löwis writes: Chapter and verse? Unicode 5.0, Chapter 3, verse C9: When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code sequences. I think anything called UTF-8 something is likely to

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Stephen J. Turnbull
R. David Murray writes: I'm sorry, but I have to disagree. As a relative unicode ignoramus, UCS-2 and UCS-4 convey almost no information to me, and the bits I have heard about them on this list have only confused me. OK, point taken. On the other hand, I understand that 'narrow' means

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-20 Thread Martin v. Löwis
Am 20.11.2010 05:11, schrieb Stephen J. Turnbull: Martin v. Löwis writes: The term UCS-2 is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should be avoided these days. So what do

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-20 Thread Alexander Belopolsky
On Sat, Nov 20, 2010 at 4:05 AM, Martin v. Löwis mar...@v.loewis.de wrote: .. A technical correct description would be to say that Python uses either 16-bit code units or 32-bit code units; for brevity, these can be called narrow and wide code units. +1 PEP 261 introduced terms wide

[Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread Alexander Belopolsky
I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x. I suspect that I am not alone finding this behavior non-obvious given that a mistake in Python manual stating the contrary survived several releases. [1] Note that I am not arguing that the change was

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread Antoine Pitrou
On Fri, 19 Nov 2010 11:53:58 -0500 Alexander Belopolsky alexander.belopol...@gmail.com wrote: Since this feature will be first documented in the Library Reference in 3.2, I wonder if it will be appropriate to mention it in What's new in 3.2? No, since it's not new in 3.2. No need to further

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread Victor Stinner
Hi, On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote: I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x. Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in wide mode (sys.maxunicode == 1114111). I suspect that

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread Martin v. Löwis
In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like backward compatibility or historical reasons :-) No, there was a deliberate decision to not support that, see http://www.python.org/dev/peps/pep-0261/ There had been a

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread M.-A. Lemburg
Victor Stinner wrote: Hi, On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote: I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x. Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in wide mode (sys.maxunicode ==

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread Martin v. Löwis
It'S rather common to confuse a transfer encoding with a storage format. UCS2 and UCS4 refer to code units (the storage format). Actually, they don't. Instead, they refer to coded character sets, in W3C terminology: mapping of characters to natural numbers. See

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread Stephen J. Turnbull
Martin v. Löwis writes: The term UCS-2 is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should be avoided these days. So what do you propose we call the Python implementation? You can