Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-09 Thread fwierzbi...@gmail.com
On Thu, Sep 8, 2011 at 10:39 PM, Terry Reedy tjre...@udel.edu wrote: On 9/8/2011 6:15 PM, fwierzbi...@gmail.com wrote: Oops, forgot to add the link for the gory details for Java and  2 byte unicode: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ This is dated 2004.

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-09 Thread Terry Reedy
On 9/9/2011 12:12 PM, fwierzbi...@gmail.com wrote: On Thu, Sep 8, 2011 at 10:39 PM, Terry Reedytjre...@udel.edu wrote: On 9/8/2011 6:15 PM, fwierzbi...@gmail.com wrote: Oops, forgot to add the link for the gory details for Java and2 byte unicode:

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-09 Thread fwierzbi...@gmail.com
On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy tjre...@udel.edu wrote: I am curious how you index by code point rather than code unit with 16-bit code units and how it compares with the method I posted. Is there anything I can read? Reply off list if you want. I'll post on-list until someone

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-09 Thread Guido van Rossum
I, for one, am very interested. It sounds like the 'unicode' datatype in Jython does not in fact have O(1) indexing characteristics if the string contains any characters in the astral plane. Interesting. I wonder if you have heard from anyone about this affecting their app's performance? --Guido

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-09 Thread fwierzbi...@gmail.com
On Fri, Sep 9, 2011 at 2:21 PM, Guido van Rossum gu...@python.org wrote: I, for one, am very interested. It sounds like the 'unicode' datatype in Jython does not in fact have O(1) indexing characteristics if the string contains any characters in the astral plane. Interesting. I wonder if you

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-09 Thread Guido van Rossum
Well, I'd be interesting how it goes, since if Jython users find this acceptable then maybe we shouldn't be quite so concerned about it for CPython... On the third hand we don't have working code for this approach in CPython, while we do have working code for the PEP 393 solution... --Guido On

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-09 Thread Terry Reedy
On 9/9/2011 5:21 PM, Guido van Rossum wrote: I, for one, am very interested. It sounds like the 'unicode' datatype in Jython does not in fact have O(1) indexing characteristics if the string contains any characters in the astral plane. Interesting. I wonder if you have heard from anyone about

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-08 Thread fwierzbi...@gmail.com
On Fri, Aug 26, 2011 at 3:00 PM, Guido van Rossum gu...@python.org wrote: I have a different question about IronPython and Jython now. Do their regular expression libraries support Unicode better than CPython's? E.g. does . match a surrogate pair? Tom C suggests that Java's regex libraries get

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-08 Thread fwierzbi...@gmail.com
Oops, forgot to add the link for the gory details for Java and 2 byte unicode: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-08 Thread fwierzbi...@gmail.com
On Fri, Aug 26, 2011 at 3:00 PM, Guido van Rossum gu...@python.org wrote: I have a different question about IronPython and Jython now. Do their regular expression libraries support Unicode better than CPython's? E.g. does . match a surrogate pair? Tom C suggests that Java's regex libraries get

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-08 Thread Terry Reedy
On 9/8/2011 6:15 PM, fwierzbi...@gmail.com wrote: Oops, forgot to add the link for the gory details for Java and 2 byte unicode: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ This is dated 2004. Basically, they considered several options, tried out 4, and ended up

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-05 Thread Éric Araujo
Le 02/09/2011 05:59, Stephen J. Turnbull a écrit : I believe that the deprecation of the digraphs as separate letters occurred as the telephone became widely used in Spain, and the telephone company demanded an official proclamation from whatever Ministry is responsible for culture that it was

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-02 Thread Stefan Behnel
Greg Ewing, 02.09.2011 02:36: Guido van Rossum wrote: But in a word like coëxistentie (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. Sometimes this is

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-02 Thread Terry Reedy
On 9/1/2011 11:59 PM, Stephen J. Turnbull wrote: I believe that the deprecation of the digraphs as separate letters occurred as the telephone became widely used in Spain, and the telephone company demanded an official proclamation from whatever Ministry is responsible for culture that it was OK

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-02 Thread Zvezdan Petkovic
On Sep 1, 2011, at 9:30 PM, Steven D'Aprano wrote: Antoine Pitrou wrote: Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-02 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/01/2011 11:59 PM, Stephen J. Turnbull wrote: Tres Seaver writes: FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. That was always a Castellano

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-02 Thread Greg Ewing
Terry Reedy wrote: While it has apparently been criticized as 'conservative' (which is well ought to be), it has been rather progressive in promoting changes such as 'ph' to 'f' (fisica, fone) and dropping silent 'p' in leading 'psi' (sicologia) and silent 's' in leading 'sci' (ciencia). I

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-02 Thread Stephen J. Turnbull
Greg Ewing writes: I find it curious that pronunciation always seems to take precedence over spelling in campaigns like this. Nowadays, especially with the internet increasingly taking over from personal interaction, we probably see words written a lot more often than we hear them

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Where I cut your words, we are in 100% agreement. (FWIW :-) Guido van Rossum writes: On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull step...@xemacs.org wrote: Well, that's why I wrote intended to be suggestive.  The Unicode Standard does not specify at all what the internal

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Neil Hodgson
Glenn Linderman: How many different iterators into the same text would be concurrently needed by an application?  And why? Seems like if it is dealing with text at the level of grapheme clusters, it needs that type of iterator.  Of course, if it does I/O it needs codec access, but that is by

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Glenn Linderman writes: I found your discussion of streams versus arrays, as separate concepts related to Unicode, along with Terry's bisect indexing implementation, to rather inspiring. Just because Unicode defines streams of codeunits of various sizes (UTF-8, UTF-16, UTF-32) to

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Neil Hodgson
Stephen J. Turnbull: ...  Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text. Qt uses UTF-16 for its basic QString type. While QString is mostly treated as a black box which you can create from input buffers in any encoding,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Glenn Linderman writes: How many different iterators into the same text would be concurrently needed by an application? And why? A WYSIWYG editor for structured text (TeX, HTML) might want two (at least), one for the source window and one for the rendered window. One might want to save the

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Glenn Linderman
On 9/1/2011 2:15 AM, Stephen J. Turnbull wrote: Glenn Linderman writes: How many different iterators into the same text would be concurrently needed by an application? And why? A WYSIWYG editor for structured text (TeX, HTML) might want two (at least), one for the source window and

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Glenn Linderman
On 9/1/2011 12:59 AM, Stephen J. Turnbull wrote: Glenn Linderman writes: We can either artificially constrain ourselves to minor tweaks of the legal conforming bytestreams, It's not artificial. Having the internal representation be the same as a standard encoding is very useful for a

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Hagen Fürstenau
Ok, I thought there was also a form normalized (denormalized?) to decomposed form. But I'll take your word. If I understood the example correctly, he needs a mixed form, with some characters decomposed and some composed (depending on which one looks better in the given font). I agree that this

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Guido van Rossum
On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull step...@xemacs.org wrote: Where I cut your words, we are in 100% agreement.  (FWIW :-) Not quite the same here, but I don't feel the need to have the last word. Most of what you say makes sense, in some cases we'll quibble later, but there are

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Indeed, they are not separate

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Guido van Rossum
On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou solip...@pitrou.net wrote: Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
The example given was someone who didn't agree with how a particular font rendered those accented characters. I agree that's obscure though. I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time)

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stefan Behnel
Guido van Rossum, 01.09.2011 18:31: On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou wrote: Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stefan Behnel
Antoine Pitrou, 01.09.2011 18:46: AFAIR, road signs also keep the accents, but I'm no driver Right, I noticed that, too. That's certainly not uncommon. I think it's mostly because of local pride (after all, the road signs are all that many drivers ever see of a city), but sometimes also

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Terry Reedy
On 9/1/2011 11:45 AM, Guido van Rossum wrote: typewriter). Dutch does have one native use of the umlaut (though it has a different name, I forget which, maybe trema :-), You remember correctly. According to https://secure.wikimedia.org/wikipedia/en/wiki/Trema_%28diacritic%29 'trema' (Greek

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Glenn Linderman writes: Windows 7 64-bit on one of my computers happily crashes several times a day when it detects inconsistent internal state... under the theory, I guess, that losing work is better than saving bad work. You sound the opposite. Definitely. Windows apps habitually

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Guido van Rossum writes: On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull step...@xemacs.org wrote: while at least this Spanish-as-a-second-language learner was taught that `ñ' is an atomic character represented by a discontiguous glyph, like `i', and it is no more related to

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
Finally, my guess is that the Spanish emphasis on ñ as a separate letter has to do with teaching how it has a separate position in the localized collation sequence, doesn't it? You'd have to ask Mr. Gonzalez. I suspect he may have taught that way less because of his Castellano

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/01/2011 02:54 PM, Antoine Pitrou wrote: If you look at Wikipedia, it says: “El alfabeto español consta de 27 letras”. The Ñ is separate from the N (and so is it in my French-Spanish dictionnary). The accented letters, however, are not

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Ethan Furman
Tres Seaver wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/01/2011 02:54 PM, Antoine Pitrou wrote: If you look at Wikipedia, it says: “El alfabeto español consta de 27 letras”. The Ñ is separate from the N (and so is it in my French-Spanish dictionnary). The accented letters,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
On Thu, 01 Sep 2011 12:38:07 -0700 Ethan Furman et...@stoneleaf.us wrote: FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. Kids-these-days'ly, Not sure what's going on, but according to the

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing
Guido van Rossum wrote: I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? This page features a number of French street signs

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing
Guido van Rossum wrote: But in a word like coëxistentie (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. Sometimes this is done in English too --

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
On Fri, 02 Sep 2011 12:30:12 +1200 Greg Ewing greg.ew...@canterbury.ac.nz wrote: Guido van Rossum wrote: I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing
Terry Reedy wrote: Too bad, since people sometimes use '-' to serve the same purpose. Which actually seems more logical to me -- a separating symbol is better placed between the things being separated, rather than over the top of one of them! Maybe we could compromise by turning the diaeresis

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Steven D'Aprano
Antoine Pitrou wrote: Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Indeed,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Tres Seaver writes: FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. That was always a Castellano vs. Americano issue, IIRC. As I wrote, Mr. Gonzalez was Castellano. I believe that the deprecation

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing
Antoine Pitrou wrote: I don't think some American souvenir shop is a good reference, though :) (for example, there's no Paris street named château de Versailles) Hmmm, I'd assumed they were reproductions of actual street signs found in Paris, but maybe not. :-( -- Greg

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman
On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote: Guido van Rossum writes: On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbullstep...@xemacs.org wrote: For starters, one that doesn't ever return lone surrogates, but rather interprets surrogate pairs as Unicode code points as

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Stephen J. Turnbull
Glenn Linderman writes: From comments Guido has made, he is not interested in changing the efficiency or access methods of the str type to raise the level of support of Unicode to the composed character, or grapheme cluster concepts. IMO, that would be a bad idea, as higher-level Unicode

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull step...@xemacs.org wrote: [me]   That sounds like a contradiction -- it wouldn't be a UTF-16 array if   you couldn't tell that it was using UTF-16. Well, that's why I wrote intended to be suggestive.  The Unicode Standard does not specify

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman v+pyt...@g.nevcal.com wrote: So from reading all this discussion, I think this point is rather a key one... and it has been made repeatedly in different ways:  Arrays are not suitable for manipulating Unicode character sequences, and the str type

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman v+pyt...@g.nevcal.com wrote: The str type itself can presently be used to process other character encodings: if they are fixed width 32-bit elements those encodings might be considered Unicode encodings, but there is no requirement that they

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman
On 8/31/2011 10:12 AM, Guido van Rossum wrote: On Wed, Aug 31, 2011 at 1:09 AM, Glenn Lindermanv+pyt...@g.nevcal.com wrote: So from reading all this discussion, I think this point is rather a key one... and it has been made repeatedly in different ways: Arrays are not suitable for

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman v+pyt...@g.nevcal.comwrote: On 8/31/2011 10:12 AM, Guido van Rossum wrote: On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman v+pyt...@g.nevcal.com v+pyt...@g.nevcal.com wrote: So from reading all this discussion, I think this point is

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman
On 8/31/2011 11:56 AM, Guido van Rossum wrote: On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman v+pyt...@g.nevcal.com mailto:v%2bpyt...@g.nevcal.com wrote: On 8/31/2011 10:12 AM, Guido van Rossum wrote: On Wed, Aug 31, 2011 at 1:09 AM, Glenn Lindermanv+pyt...@g.nevcal.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman
On 8/31/2011 10:20 AM, Guido van Rossum wrote: On Wed, Aug 31, 2011 at 1:09 AM, Glenn Lindermanv+pyt...@g.nevcal.com wrote: The str type itself can presently be used to process other character encodings: if they are fixed width 32-bit elements those encodings might be considered Unicode

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman
On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote: Glenn Linderman writes: From comments Guido has made, he is not interested in changing the efficiency or access methods of the str type to raise the level of support of Unicode to the composed character, or grapheme cluster

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman
On 8/31/2011 10:10 AM, Guido van Rossum wrote: On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull step...@xemacs.org wrote: [me] That sounds like a contradiction -- it wouldn't be a UTF-16 array if you couldn't tell that it was using UTF-16. Well, that's why I wrote intended to be

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Terry Reedy
On 8/31/2011 1:10 PM, Guido van Rossum wrote: This is why I find the issue of Python, the language (and stdlib), as a whole conforming to the Unicode standard such a troublesome concept -- I think it is something that an application may claim, but the language should make much more modest

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Nick Coghlan
On Thu, Sep 1, 2011 at 8:02 AM, Terry Reedy tjre...@udel.edu wrote: On 8/31/2011 1:10 PM, Guido van Rossum wrote: Ok, I dig this, to some extent. However saying it is UCS-2 is equally bad. As I said on the tracker, our narrow builds are in-between (while moving closer to UTF-16), and both

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Neil Hodgson
Glenn Linderman: That said, regexp, or some sort of cursor on a string, might be a workable solution.  Will it have adequate performance?  Perhaps, at least for some applications.  Will it be as conceptually simple as indexing an array of graphemes?  No.  Will it ever reach the efficiency of

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 5:58 PM, Neil Hodgson nyamaton...@gmail.com wrote: [...] some text drawing engines draw decomposed characters (o followed by ̈ - ö) differently compared to their composite equivalents (ö) and this may be perceived as better or worse. I'd like to offer an option to

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Hagen Fürstenau
[...] some text drawing engines draw decomposed characters (o followed by ̈ - ö) differently compared to their composite equivalents (ö) and this may be perceived as better or worse. I'd like to offer an option to replace some decomposed characters with their composite equivalent before

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Neil Hodgson
Guido van Rossum: On Wed, Aug 31, 2011 at 5:58 PM, Neil Hodgson nyamaton...@gmail.com wrote: [...] some text drawing engines draw decomposed characters (o followed by ̈ - ö) differently compared to their composite equivalents (ö) and this may be perceived as better or worse. I'd like to

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 6:29 PM, Neil Hodgson nyamaton...@gmail.com wrote: Guido van Rossum: On Wed, Aug 31, 2011 at 5:58 PM, Neil Hodgson nyamaton...@gmail.com wrote: [...] some text drawing engines draw decomposed characters (o followed by ̈ - ö) differently compared to their composite

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman
On 8/31/2011 5:58 PM, Neil Hodgson wrote: Glenn Linderman: That said, regexp, or some sort of cursor on a string, might be a workable solution. Will it have adequate performance? Perhaps, at least for some applications. Will it be as conceptually simple as indexing an array of graphemes?

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-30 Thread Stephen J. Turnbull
Antoine Pitrou writes: On Mon, 29 Aug 2011 12:43:24 +0900 Stephen J. Turnbull step...@xemacs.org wrote: Since when can s[0] represent a code point outside the BMP, for s a Unicode string in a narrow build? Remember, the UCS-2/narrow vs. UCS-4/wide distinction is *not* about

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-30 Thread Antoine Pitrou
The problem with a narrow build (whether for space efficiency in CPython or for platform compatibility in Jython and IronPython) is not that we have no UTF-16 codecs. It's that array ops aren't UTF-16 conformant. Sorry, what is a conformant UTF-16 array op? Thanks Antoine.

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-30 Thread Stephen J. Turnbull
Antoine Pitrou writes: Sorry, what is a conformant UTF-16 array op? For starters, one that doesn't ever return lone surrogates, but rather interprets surrogate pairs as Unicode code points as in UTF-16. (This is not a Unicode standard definition, it's intended to be suggestive of why many app

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-30 Thread Guido van Rossum
On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull step...@xemacs.org wrote: Antoine Pitrou writes:   Sorry, what is a conformant UTF-16 array op? For starters, one that doesn't ever return lone surrogates, but rather interprets surrogate pairs as Unicode code points as in UTF-16.  (This

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-30 Thread Stephen J. Turnbull
Guido van Rossum writes: On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull step...@xemacs.org wrote: For starters, one that doesn't ever return lone surrogates, but rather interprets surrogate pairs as Unicode code points as in UTF-16.  (This is not a Unicode standard definition,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-29 Thread Antoine Pitrou
On Mon, 29 Aug 2011 12:43:24 +0900 Stephen J. Turnbull step...@xemacs.org wrote: Since when can s[0] represent a code point outside the BMP, for s a Unicode string in a narrow build? Remember, the UCS-2/narrow vs. UCS-4/wide distinction is *not* about what Python supports vs. the outside

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-28 Thread Stephen J. Turnbull
Paul Moore writes: IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. [...] They lose the O(1)

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-28 Thread Stephen J. Turnbull
Guido van Rossum writes: I don't think anyone else has that impression. Please cite chapter and verse if you really think this is important. IIUC, UCS-2 does not allow surrogate pairs, In the original definition of UCS-2 in draft ISO 10646 (1990), everything in the BMP except for 0x

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-28 Thread Stephen J. Turnbull
Raymond Hettinger writes: The naming convention for codecs is that the UTF prefix is used for lossless encodings that cover the entire range of Unicode. Sure. The operative word here is codec, not str, though. The first amendment to the original edition of the UCS defined UTF-16, an

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-27 Thread Raymond Hettinger
On Aug 26, 2011, at 8:51 PM, Terry Reedy wrote: On 8/26/2011 8:42 PM, Guido van Rossum wrote: On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedytjre...@udel.edu wrote: My impression is that a UFT-16 implementation, to be properly called such, must do len and [] in terms of code points,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-27 Thread Terry Reedy
On 8/26/2011 8:23 PM, Antoine Pitrou wrote: I would only agree as long as it wasn't too much worse than O(1). O(log n) might be all right, but O(n) would be unacceptable, I think. It also depends a lot on *actual* measured performance Amen. Some regard O(n*n) sorts to be, by definition,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-27 Thread Steven D'Aprano
Terry Reedy wrote: On 8/26/2011 8:23 PM, Antoine Pitrou wrote: I would only agree as long as it wasn't too much worse than O(1). O(log n) might be all right, but O(n) would be unacceptable, I think. It also depends a lot on *actual* measured performance Amen. Some regard O(n*n) sorts to

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-27 Thread Martin v. Löwis
Am 27.08.2011 09:40, schrieb Steven D'Aprano: Terry Reedy wrote: On 8/26/2011 8:23 PM, Antoine Pitrou wrote: I would only agree as long as it wasn't too much worse than O(1). O(log n) might be all right, but O(n) would be unacceptable, I think. It also depends a lot on *actual* measured

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Paul Moore
On 26 August 2011 03:52, Guido van Rossum gu...@python.org wrote: I know that by now I am repeating myself, but I think it would be really good if we could get rid of this ambiguity. PEP 393 seems the best way forward, even if it doesn't directly address what to do for IronPython or Jython,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread M.-A. Lemburg
Stefan Behnel wrote: Isaac Morland, 26.08.2011 04:28: On Thu, 25 Aug 2011, Guido van Rossum wrote: I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Ezio Melotti
On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum gu...@python.org wrote: On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland ijmor...@uwaterloo.ca wrote: On Thu, 25 Aug 2011, Guido van Rossum wrote: I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Martin v. Löwis
IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. That means that they won't conform, period. There

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Stefan Behnel
Martin v. Löwis, 26.08.2011 11:29: You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not. I think we can leave this discussion aside. Jython and IronPython have their own platform specific constraints to which they need to adapt their implementation. For a

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Antoine Pitrou
Why would PEP 393 apply to other implementations than CPython? Regards Antoine. On Fri, 26 Aug 2011 00:01:42 + Dino Viehland di...@microsoft.com wrote: Guido wrote: Which reminds me. The PEP does not say what other Python implementations besides CPython should do. presumably Jython

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Stefan Behnel
Antoine Pitrou, 26.08.2011 12:51: Why would PEP 393 apply to other implementations than CPython? Not the PEP itself, just the implications of the result. The question was whether the language specification in a post PEP-393 can (and if so, should) be changed into requiring unicode objects to

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum
On Fri, Aug 26, 2011 at 2:29 AM, Martin v. Löwis mar...@v.loewis.de wrote: IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum
On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel stefan...@behnel.de wrote: Martin v. Löwis, 26.08.2011 11:29: You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not. I think we can leave this discussion aside. (And yet, you keep arguing. :-) Jython and

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Paul Moore
On 26 August 2011 17:51, Guido van Rossum gu...@python.org wrote: On Fri, Aug 26, 2011 at 2:29 AM, Martin v. Löwis mar...@v.loewis.de wrote: (Regarding my comments on code point semantics) You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not. Indeed. On

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum
On Fri, Aug 26, 2011 at 10:13 AM, Paul Moore p.f.mo...@gmail.com wrote: On 26 August 2011 18:02, Guido van Rossum gu...@python.org wrote: Eek. No, please. Those platforms' native string types have length and slicing operations that are O(1) and work in terms of 16-bit code points. Python

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Stefan Behnel
Guido van Rossum, 26.08.2011 19:02: On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel wrote: Besides, what if these implementations provided indexing in, say, O(log N) instead of O(1) or O(N), e.g. by building a tree index into each string? You could have an index that simply marks runs of

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum
I have a different question about IronPython and Jython now. Do their regular expression libraries support Unicode better than CPython's? E.g. does . match a surrogate pair? Tom C suggests that Java's regex libraries get this and many other details right despite Java's use of UTF-16 to represent

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Terry Reedy
On 8/26/2011 5:29 AM, Martin v. Löwis wrote: IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. My

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Greg Ewing
Paul Moore wrote: IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. ... They lose the O(1) guarantee,

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Antoine Pitrou
On Sat, 27 Aug 2011 12:17:18 +1200 Greg Ewing greg.ew...@canterbury.ac.nz wrote: Paul Moore wrote: IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Greg Ewing
M.-A. Lemburg wrote: Simply going with UCS-4 does not solve the problem, since even with UCS-4 storage, you can still have surrogates in your Python Unicode string. Yes, but in that case, you presumably *intend* them to be treated as separate indexing units. If you didn't, there would be no

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum
On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy tjre...@udel.edu wrote: On 8/26/2011 5:29 AM, Martin v. Löwis wrote: IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Terry Reedy
On 8/26/2011 8:42 PM, Guido van Rossum wrote: On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedytjre...@udel.edu wrote: My impression is that a UFT-16 implementation, to be properly called such, must do len and [] in terms of code points, which is why Python's narrow builds are called UCS-2 and

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Greg Ewing
On 25/08/11 14:29, Guido van Rossum wrote: Let's get things right so users won't have to worry about code points vs. code units any more. What about things like the surrogateescape codec that deliberately use code units in non-standard ways? Will tricks like that still be possible if the

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Martin v. Löwis
Strings contain Unicode code units, which for most purposes can be treated as Unicode characters. However, even as simple an operation as s1[0] == s2[0] cannot be relied upon to give Unicode-conforming results. The second sentence remains true under PEP 393. Really? If

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Martin v. Löwis
What is non-conforming about comparing two code points? Unicode conformance means treating characters correctly. Re-read the text. You are interpreting something that isn't there. Seriously, what does Unicode-conforming mean here? Chapter 3, all verses. Here, specifically C6, p.

  1   2   >