Re: [pypy-dev] PyPy 2 unicode class

2014-01-24 Thread Armin Rigo
Hi all, Thanks everybody for your comments on this topic. Our initial motivation for doing that is to simplify RPython by getting rid of the RPython unicode type. I think that the outcome of these mails is that there is no single obvious answer as to whether the change would benefit or hurt Pyth

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Oscar Benjamin
On 23 January 2014 20:54, Steven D'Aprano wrote: > On Thu, Jan 23, 2014 at 01:27:50PM +, Oscar Benjamin wrote: > >> Steven wrote: >> > With a UTF-8 implementation, won't that mean that string indexing >> > operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 >> > byte(s) to l

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Nathan Hurst
On Thu, Jan 23, 2014 at 10:45:25PM +0200, Elefterios Stamatogiannakis wrote: > >But having said all this, I know that using UTF-8 internally for strings > >is quite common (e.g. Haskell does it, without even an index cache, and > >documents that indexing operations can be slow). CPython's FSR has >

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Elefterios Stamatogiannakis
On 23/1/2014 10:54 μμ, Steven D'Aprano wrote: On Thu, Jan 23, 2014 at 01:27:50PM +, Oscar Benjamin wrote: Steven wrote: With a UTF-8 implementation, won't that mean that string indexing operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 byte(s) to look at to get the cha

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Dan Stromberg
On Tue, Jan 21, 2014 at 11:01 PM, Johan Råde wrote: > At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode > class. He gave two versions of the design: Why spend brain cycles on a Pypy unicode class, when you could just move on to Pypy3? The majority of the Python community is

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Steven D'Aprano
On Thu, Jan 23, 2014 at 01:27:50PM +, Oscar Benjamin wrote: > Steven wrote: > > With a UTF-8 implementation, won't that mean that string indexing > > operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 > > byte(s) to look at to get the character at index 42 without having to

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Armin Rigo
Hi Oscar, Thanks for explaining the caching in detail :-) On Thu, Jan 23, 2014 at 2:27 PM, Oscar Benjamin wrote: > big saving. If the string comes from anything other than utf-8 the indexing > cache can be built while decoding (and reencoding as utf-8 under the hood). Actually, you need to walk

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Johan Råde
On 2014-01-22 08:01, Johan Råde wrote: Next, would such a change break any existing Python 2 code on Windows? Yes it will. For instance the following code for counting characters in a string: f = [0] * (1 << 16) for c in s: f[ord(c)] += 1 I would like to qualify this statement. Get

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Oscar Benjamin
On Wed, Jan 22, 2014 at 06:56:32PM +0100, Armin Rigo wrote: > Hi Johan, > > On Wed, Jan 22, 2014 at 8:01 AM, Johan Råde wrote: > > (I hope this makes more sense than my ramblings on IRC last night.) > > All versions you gave make sense as far as I'm concerned :-) But this > last one is the clea

Re: [pypy-dev] PyPy 2 unicode class

2014-01-23 Thread Steven D'Aprano
On Wed, Jan 22, 2014 at 08:01:31AM +0100, Johan Råde wrote: > At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode > class. He gave two versions of the design: > > A: unicode with a UTF-8 implementation and a UTF-32 interface. > > B: unicode with a UTF-8 implementation, a UT

Re: [pypy-dev] PyPy 2 unicode class

2014-01-22 Thread Armin Rigo
Hi Johan, On Wed, Jan 22, 2014 at 8:01 AM, Johan Råde wrote: > (I hope this makes more sense than my ramblings on IRC last night.) All versions you gave make sense as far as I'm concerned :-) But this last one is the clearest indeed. It seems that Python 3 went that way anyway too, and exposes

[pypy-dev] PyPy 2 unicode class

2014-01-21 Thread Johan Råde
At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode class. He gave two versions of the design: A: unicode with a UTF-8 implementation and a UTF-32 interface. B: unicode with a UTF-8 implementation, a UTF-16 interface on Windows and a UTF-32 interface on UNIX-like systems.