subject:"Performance of int\/long in Python 3"

On Wed, Apr 3, 2013 at 3:03 PM, Neil Hodgson nhodg...@iinet.net.au wrote:
 rusi wrote:
Every program attempts to expand until it can read mail. Those programs
 which cannot so expand are replaced by ones which can.

In my personal experience, it's calculators. I put command-line
calculators into *everything*... often in the form of more general
executors, and thus restricted to admins, but it's still a calculator.

For some reason, the ability to type calc 1+2 and get back 3 is very
satisfying to me. You know, in case I ever forget what one plus two
makes.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 4:32 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 03 Apr 2013 14:31:03 +1100, Neil Hodgson wrote:

 Sorting a million string list (all the file paths on a particular
 computer) went from 0.4 seconds with Python 3.2 to 0.78 with 3.3 so
 we're out of the 'not noticeable by humans' range. Perhaps this is still
 a 'micro-benchmark' - I'd just like to avoid adding email access to get
 this over the threshold.

 I cannot confirm this performance regression. On my laptop (Debian Linux,
 not Windows), I can sort a million file names in approximately 1.2
 seconds in both Python 3.2 and 3.3. There is no meaningful difference in
 speed between the two versions.

I'd be curious to know the sorts of characters used. Given that it's
probably a narrow-vs-wide Python difference we're talking here, the
actual distribution of codepoints may well make a difference.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


Chris Angelico:


I'd be curious to know the sorts of characters used. Given that it's
probably a narrow-vs-wide Python difference we're talking here, the
actual distribution of codepoints may well make a difference.


   I was going to upload it but then I thought of potential client 
-confidentiality problems and the need to audit a list that long.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


Terry Jan Reedy:


What system *and* what compiler and compiler options. Unless 3.2 and 3.3
are both compiler with the same compiler and settings, we do not know
the source of the difference.


   The version signatures are:

3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]

3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit 
(Intel)]


   The machine is running Windows 8 64-bit (the Python installations 
are 32-bit though) and the processor is an i3 2350M running at 2.3 GHz.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 5:29 PM, Neil Hodgson nhodg...@iinet.net.au wrote:
 Chris Angelico:


 I'd be curious to know the sorts of characters used. Given that it's
 probably a narrow-vs-wide Python difference we're talking here, the
 actual distribution of codepoints may well make a difference.


I was going to upload it but then I thought of potential client
 -confidentiality problems and the need to audit a list that long.

Hmm. I was about to say Can you just do a quick collections.Counter()
of the string widths in 3.3, as an easy way of seeing which ones use
BMP or higher characters, but I can't find a simple way to query a
string's width. Can't see it as a method of the string object, nor in
the string or sys modules. It ought to be easy enough at the C level -
just look up the two bits representing 'kind' - but I've not found it
exposed to Python. Is there anything?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 12:52 AM, Chris Angelico ros...@gmail.com wrote:
 Hmm. I was about to say Can you just do a quick collections.Counter()
 of the string widths in 3.3, as an easy way of seeing which ones use
 BMP or higher characters, but I can't find a simple way to query a
 string's width. Can't see it as a method of the string object, nor in
 the string or sys modules. It ought to be easy enough at the C level -
 just look up the two bits representing 'kind' - but I've not found it
 exposed to Python. Is there anything?

4 if max(map(ord, s))  0x else 2 if max(map(ord, s))  0xff else 1
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 6:06 PM, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Wed, Apr 3, 2013 at 12:52 AM, Chris Angelico ros...@gmail.com wrote:
 Hmm. I was about to say Can you just do a quick collections.Counter()
 of the string widths in 3.3, as an easy way of seeing which ones use
 BMP or higher characters, but I can't find a simple way to query a
 string's width. Can't see it as a method of the string object, nor in
 the string or sys modules. It ought to be easy enough at the C level -
 just look up the two bits representing 'kind' - but I've not found it
 exposed to Python. Is there anything?

 4 if max(map(ord, s))  0x else 2 if max(map(ord, s))  0xff else 1

Yeah, that's iterating over the whole string (twice, if it isn't width
4). The system already knows what the size is, I was hoping for an
uber-quick inspection of the string header.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

   Reran the programs taking a bit more care with the encoding of the 
file. This had no effect on the speeds. There are only a small amount of 
paths that don't fit into ASCII:


ASCII 1076101
Latin1 218
BMP 113
Astral 0

# encoding:utf-8
import codecs, os, time
from os.path import join, getsize
with codecs.open(filelist.txt, r, utf-8) as f:
paths = f.read().split(\n)
bucket = [0,0,0,0]
for p in paths:
b = 0
maxChar = max([ord(ch) for ch in p])
if maxChar = 65536:
b = 3
elif maxChar = 256:
b = 2
elif maxChar = 128:
b = 1
bucket[b] = bucket[b] + 1
print(ASCII, bucket[0])
print(Latin1, bucket[1])
print(BMP, bucket[2])
print(Astral, bucket[3])

   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, 03 Apr 2013 18:24:25 +1100, Chris Angelico wrote:

 On Wed, Apr 3, 2013 at 6:06 PM, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Wed, Apr 3, 2013 at 12:52 AM, Chris Angelico ros...@gmail.com
 wrote:
 Hmm. I was about to say Can you just do a quick collections.Counter()
 of the string widths in 3.3, as an easy way of seeing which ones use
 BMP or higher characters, but I can't find a simple way to query a
 string's width. Can't see it as a method of the string object, nor in
 the string or sys modules. It ought to be easy enough at the C level -
 just look up the two bits representing 'kind' - but I've not found it
 exposed to Python. Is there anything?

 4 if max(map(ord, s))  0x else 2 if max(map(ord, s))  0xff else 1
 
 Yeah, that's iterating over the whole string (twice, if it isn't width
 4). 

Then don't write it as a one-liner :-P

n = max(map(ord, s))
4 if n  0x else 2 if n  0xff else 1


Here's another way:


(sys.getsizeof(s) - sys.getsizeof(''))/len(s)

should work.


There's probably also a way to do it using ctypes.



 The system already knows what the size is, I was hoping for an
 uber-quick inspection of the string header.

I'm not sure that I would want strings to have a method reporting this, 
but it might be nice to have a function in the inspect module to do so. 



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 6:53 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Here's another way:


 (sys.getsizeof(s) - sys.getsizeof(''))/len(s)

 should work.

Hmm, I had been under the impression that there was a certain base
length below which strings all had the same size. Yes, that also
works; though again, it's something that can be directly queried, at
the C level.

 There's probably also a way to do it using ctypes.

 The system already knows what the size is, I was hoping for an
 uber-quick inspection of the string header.

 I'm not sure that I would want strings to have a method reporting this,
 but it might be nice to have a function in the inspect module to do so.

Yeah, that's why I also looked in 'sys'; 'inspect' might well be a
good place for it, too. But it seems such a function doesn't exist,
which is what I was asking.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread rusi

On Apr 3, 12:37 pm, Neil Hodgson nhodg...@iinet.net.au wrote:
     Reran the programs taking a bit more care with the encoding of the
 file. This had no effect on the speeds. There are only a small amount of
 paths that don't fit into ASCII:

 ASCII 1076101
 Latin1 218
 BMP 113
 Astral 0

 # encoding:utf-8
 import codecs, os, time
 from os.path import join, getsize
 with codecs.open(filelist.txt, r, utf-8) as f:
      paths = f.read().split(\n)
 bucket = [0,0,0,0]
 for p in paths:
      b = 0
      maxChar = max([ord(ch) for ch in p])
      if maxChar = 65536:
          b = 3
      elif maxChar = 256:
          b = 2
      elif maxChar = 128:
          b = 1
      bucket[b] = bucket[b] + 1
 print(ASCII, bucket[0])
 print(Latin1, bucket[1])
 print(BMP, bucket[2])
 print(Astral, bucket[3])

     Neil

Can you please try one more experiment Neil?
Knock off all non-ASCII strings (paths) from your dataset and try
again.

[It should take little more than converting your above code to a
filter:
if b == 0: print
if b  0: ignore
]
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread jmfauth



This FSR is wrong by design. A naive way to embrace Unicode.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


Roy Smith:


On the other hand, how long did it take you to do the directory tree
walk required to find those million paths?  I'll bet a long longer than
0.78 seconds, so this gets lost in the noise.


   About 2 minutes. But that's just getting an example data set. Other 
data sets may be loaded more quickly from databases or files or be 
created by processing. Reading the example data from a file takes around 
the same time as sorting.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


rusi:


Can you please try one more experiment Neil?
Knock off all non-ASCII strings (paths) from your dataset and try
again.


   Results are the same 0.40 (well, 0.001 less but I don't think the 
timer is that accurate) for Python 3.2 and 0.78 for Python 3.3.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread Dave Angel


On 04/03/2013 04:22 AM, Neil Hodgson wrote:

rusi:


Can you please try one more experiment Neil?
Knock off all non-ASCII strings (paths) from your dataset and try
again.


Results are the same 0.40 (well, 0.001 less but I don't think the
timer is that accurate) for Python 3.2 and 0.78 for Python 3.3.

Neil


That would seem to imply that the speed regression on your data is NOT 
caused by the differing size encodings.  Perhaps it is the difference in 
MSC compiler version, or other changes made between 3.2 and 3.3


Of course, I can't then explain why Steven didn't get the same results. 
 Perhaps the difference between 32bit Python and 64 on Windows?  Or 
perhaps you have significantly more (or significantly fewer) 
collisions than Steven did.



Before I saw this message, I was thinking of suggesting that you supply 
a key= parameter to sort, specifying as a key the Unicode character 
65536 higher than the one supplied.  That way all the keys to be sorted 
would be 32 bits in size.  If this made the timings change noticeably, 
it could be a big clue.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


Dave Angel:


That would seem to imply that the speed regression on your data is NOT
caused by the differing size encodings. Perhaps it is the difference in
MSC compiler version, or other changes made between 3.2 and 3.3


   Its not caused by there actually being different size encodings but 
that the code is checking encoding size 2-4 times for each character.


   Back in 3.2 the comparison loop looked like:

while (len1  0  len2  0) {
Py_UNICODE c1, c2;

c1 = *s1++;
c2 = *s2++;

if (c1 != c2)
return (c1  c2) ? -1 : 1;

len1--; len2--;
}

   For 3.3 this has changed to

for (i = 0; i  len1  i  len2; ++i) {
Py_UCS4 c1, c2;
c1 = PyUnicode_READ(kind1, data1, i);
c2 = PyUnicode_READ(kind2, data2, i);

if (c1 != c2)
return (c1  c2) ? -1 : 1;
}

with PyUnicode_READ being

#define PyUnicode_READ(kind, data, index) \
((Py_UCS4) \
((kind) == PyUnicode_1BYTE_KIND ? \
((const Py_UCS1 *)(data))[(index)] : \
((kind) == PyUnicode_2BYTE_KIND ? \
((const Py_UCS2 *)(data))[(index)] : \
((const Py_UCS4 *)(data))[(index)] \
) \
))

There are either 1 or 2 kind checks in each call to PyUnicode_READ 
and 2 calls to PyUnicode_READ inside the loop. A compiler may decide to 
move the kind checks out of the loop and specialize the loop but MSVC 
2010 appears to not do so. The assembler (32-bit build) for each 
PyUnicode_READ looks like


mov ecx, DWORD PTR _kind1$[ebp]
cmp ecx, 1
jne SHORT $LN17@unicode_co@2
lea ecx, DWORD PTR [ebx+eax]
movzx   edx, BYTE PTR [ecx+edx]
jmp SHORT $LN16@unicode_co@2
$LN17@unicode_co@2:
cmp ecx, 2
jne SHORT $LN15@unicode_co@2
movzx   edx, WORD PTR [ebx+edi]
jmp SHORT $LN16@unicode_co@2
$LN15@unicode_co@2:
mov edx, DWORD PTR [ebx+esi]
$LN16@unicode_co@2:

   The kind1/kind2 variables aren't even going into registers and at 
least one test+branch and a jump are executed for every character. Two 
tests for 2 and 4 byte kinds. len1 and len2 don't get to go into 
registers either.


   Here's the full assembler output for unicode_compare:

;   COMDAT _unicode_compare
_TEXT   SEGMENT
_kind2$ = -20   ; size = 4
_kind1$ = -16   ; size = 4
_len2$ = -12; size = 4
_len1$ = -8 ; size = 4
_data2$ = -4; size = 4
_unicode_compare PROC   ; COMDAT
; _str1$ = ecx
; _str2$ = eax

; 10417: {

pushebp
mov ebp, esp
sub esp, 20 ; 0014H
pushebx
pushesi
mov esi, eax

; 10418: int kind1, kind2;
; 10419: void *data1, *data2;
; 10420: Py_ssize_t len1, len2, i;
; 10421:
; 10422: kind1 = PyUnicode_KIND(str1);

mov eax, DWORD PTR [ecx+16]
mov edx, eax
shr edx, 2
and edx, 7
pushedi
mov DWORD PTR _kind1$[ebp], edx

; 10423: kind2 = PyUnicode_KIND(str2);

mov edx, DWORD PTR [esi+16]
mov edi, edx
shr edi, 2
and edi, 7
mov DWORD PTR _kind2$[ebp], edi

; 10424: data1 = PyUnicode_DATA(str1);

testal, 32  ; 0020H
je  SHORT $LN9@unicode_co@2
testal, 64  ; 0040H
je  SHORT $LN7@unicode_co@2
lea ebx, DWORD PTR [ecx+24]
jmp SHORT $LN10@unicode_co@2
$LN7@unicode_co@2:
lea ebx, DWORD PTR [ecx+36]
jmp SHORT $LN10@unicode_co@2
$LN9@unicode_co@2:
mov ebx, DWORD PTR [ecx+36]
$LN10@unicode_co@2:

; 10425: data2 = PyUnicode_DATA(str2);

testdl, 32  ; 0020H
je  SHORT $LN13@unicode_co@2
testdl, 64  ; 0040H
je  SHORT $LN11@unicode_co@2
lea edx, DWORD PTR [esi+24]
jmp SHORT $LN30@unicode_co@2
$LN11@unicode_co@2:
lea eax, DWORD PTR [esi+36]
mov DWORD PTR _data2$[ebp], eax
mov edx, eax
jmp SHORT $LN14@unicode_co@2
$LN13@unicode_co@2:
mov edx, DWORD PTR [esi+36]
$LN30@unicode_co@2:
mov DWORD PTR _data2$[ebp], edx
$LN14@unicode_co@2:

; 10426: len1 = PyUnicode_GET_LENGTH(str1);

mov edi, DWORD PTR [ecx+8]

; 10427: len2 = PyUnicode_GET_LENGTH(str2);

mov ecx, DWORD PTR [esi+8]

; 10428:
; 10429: for (i = 0; i  len1  i  len2; ++i) {

xor eax, eax
mov DWORD PTR _len1$[ebp], edi
mov DWORD

Re: Performance of int/long in Python 3

2013-04-03 Thread Mark Lawrence


On 03/04/2013 09:08, jmfauth wrote:



This FSR is wrong by design. A naive way to embrace Unicode.

jmf



The hole you're digging for yourself is getting bigger and bigger and 
I'm loving it :)


--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread Dave Angel


On 04/03/2013 07:05 AM, Neil Hodgson wrote:

Dave Angel:


That would seem to imply that the speed regression on your data is NOT
caused by the differing size encodings. Perhaps it is the difference in
MSC compiler version, or other changes made between 3.2 and 3.3


Its not caused by there actually being different size encodings but
that the code is checking encoding size 2-4 times for each character.

Back in 3.2 the comparison loop looked like:

 while (len1  0  len2  0) {
 Py_UNICODE c1, c2;

 c1 = *s1++;
 c2 = *s2++;

 if (c1 != c2)
 return (c1  c2) ? -1 : 1;

 len1--; len2--;
 }

For 3.3 this has changed to

 for (i = 0; i  len1  i  len2; ++i) {
 Py_UCS4 c1, c2;
 c1 = PyUnicode_READ(kind1, data1, i);
 c2 = PyUnicode_READ(kind2, data2, i);

 if (c1 != c2)
 return (c1  c2) ? -1 : 1;
 }

 with PyUnicode_READ being

#define PyUnicode_READ(kind, data, index) \
 ((Py_UCS4) \
 ((kind) == PyUnicode_1BYTE_KIND ? \
 ((const Py_UCS1 *)(data))[(index)] : \
 ((kind) == PyUnicode_2BYTE_KIND ? \
 ((const Py_UCS2 *)(data))[(index)] : \
 ((const Py_UCS4 *)(data))[(index)] \
 ) \
 ))

 There are either 1 or 2 kind checks in each call to PyUnicode_READ
and 2 calls to PyUnicode_READ inside the loop. A compiler may decide to
move the kind checks out of the loop and specialize the loop but MSVC
2010 appears to not do so.


I don't know how good MSC's template logic is, but it seems this would 
be a good case for an explicit template, typed on the 'kind's values. 
Or are all C++ features disabled when compiling Python?  Failing that, 
just code up 9 cases, and do a switch on the kinds.


I'm also puzzled.  I thought that the sort algorithm used a hash of all 
the items to be sorted, and only reverted to a raw comparison of the 
original values when the hash collided.  Is that not the case?  Or is 
the code you post here only used when the hash collides?



 The assembler (32-bit build) for each

PyUnicode_READ looks like

 movecx, DWORD PTR _kind1$[ebp]
 cmpecx, 1
 jneSHORT $LN17@unicode_co@2
 leaecx, DWORD PTR [ebx+eax]
 movzxedx, BYTE PTR [ecx+edx]
 jmpSHORT $LN16@unicode_co@2
$LN17@unicode_co@2:
 cmpecx, 2
 jneSHORT $LN15@unicode_co@2
 movzxedx, WORD PTR [ebx+edi]
 jmpSHORT $LN16@unicode_co@2
$LN15@unicode_co@2:
 movedx, DWORD PTR [ebx+esi]
$LN16@unicode_co@2:


It appears that the compiler is keeping the three pointers in three 
separate registers (eax, esi and edi) even though those are 3 aliases 
for the same pointer.   This is preventing it from putting other values 
in those registers.


It'd probably do better if the C code manipulated the pointers, rather 
than using an index i each time.  But if it did, perhaps gcc would 
generate worse code.


If I were coding the assembler by hand (Intel only), I'd be able to 
avoid the multiple cmp operations, simply by comparing first to 2, then 
doing a jne and a ja.  I dunno whether the compiler would notice if I 
coded the equivalent in C.  (make both comparisons to 2, one for less, 
and one for more)




The kind1/kind2 variables aren't even going into registers and at
least one test+branch and a jump are executed for every character. Two
tests for 2 and 4 byte kinds. len1 and len2 don't get to go into
registers either.

Here's the full assembler output for unicode_compare:

;COMDAT _unicode_compare
_TEXTSEGMENT
_kind2$ = -20; size = 4
_kind1$ = -16; size = 4
_len2$ = -12; size = 4
_len1$ = -8; size = 4
_data2$ = -4; size = 4
_unicode_compare PROC; COMDAT
; _str1$ = ecx
; _str2$ = eax

; 10417: {

 pushebp
 movebp, esp
 subesp, 20; 0014H
 pushebx
 pushesi
 movesi, eax

; 10418: int kind1, kind2;
; 10419: void *data1, *data2;
; 10420: Py_ssize_t len1, len2, i;
; 10421:
; 10422: kind1 = PyUnicode_KIND(str1);

 moveax, DWORD PTR [ecx+16]
 movedx, eax
 shredx, 2
 andedx, 7
 pushedi
 movDWORD PTR _kind1$[ebp], edx

; 10423: kind2 = PyUnicode_KIND(str2);

 movedx, DWORD PTR [esi+16]
 movedi, edx
 shredi, 2
 andedi, 7
 movDWORD PTR _kind2$[ebp], edi

; 10424: data1 = PyUnicode_DATA(str1);

 testal, 32; 0020H
 jeSHORT $LN9@unicode_co@2
 testal, 64; 0040H
 jeSHORT $LN7@unicode_co@2
 leaebx, DWORD PTR [ecx+24]
 jmpSHORT $LN10@unicode_co@2
$LN7@unicode_co@2:
 leaebx, DWORD PTR [ecx+36]
 jmpSHORT $LN10@unicode_co@2
$LN9@unicode_co@2:
 movebx, DWORD PTR

Re: Performance of int/long in Python 3

In article 1f2dnfpbhy54embmnz2dnuvz_osdn...@westnet.com.au,
 Neil Hodgson nhodg...@iinet.net.au wrote:

 Roy Smith:
 
  On the other hand, how long did it take you to do the directory tree
  walk required to find those million paths?  I'll bet a long longer than
  0.78 seconds, so this gets lost in the noise.
 
 About 2 minutes. But that's just getting an example data set. Other 
 data sets may be loaded more quickly from databases or files or be 
 created by processing. Reading the example data from a file takes around 
 the same time as sorting.

Fair enough.  In fact, given that reading the file from disk is O(n) and 
sorting it is O(n log n), at some point, the sort will totally swamp the 
input time.  Your original example just happened to be one of the 
unusual cases where the sort time is not the rate limiting factor in the 
overall process.

I remember reading somewhere that more CPU cycles in the entire history 
of computing have been spend doing sorting than anything else.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

In article mailman.37.1364970149.3114.python-l...@python.org,
 Chris Angelico ros...@gmail.com wrote:

 On Wed, Apr 3, 2013 at 3:03 PM, Neil Hodgson nhodg...@iinet.net.au wrote:
  rusi wrote:
 Every program attempts to expand until it can read mail. Those programs
  which cannot so expand are replaced by ones which can.
 
 In my personal experience, it's calculators. I put command-line
 calculators into *everything*... often in the form of more general
 executors, and thus restricted to admins, but it's still a calculator.
 
 For some reason, the ability to type calc 1+2 and get back 3 is very
 satisfying to me. You know, in case I ever forget what one plus two
 makes.

I discovered recently that Spotlight (the OSX built-in search engine) 
can do this.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Thu, Apr 4, 2013 at 12:28 AM, Roy Smith r...@panix.com wrote:
 In article mailman.37.1364970149.3114.python-l...@python.org,
  Chris Angelico ros...@gmail.com wrote:

 On Wed, Apr 3, 2013 at 3:03 PM, Neil Hodgson nhodg...@iinet.net.au wrote:
  rusi wrote:
 Every program attempts to expand until it can read mail. Those programs
  which cannot so expand are replaced by ones which can.

 In my personal experience, it's calculators. I put command-line
 calculators into *everything*... often in the form of more general
 executors, and thus restricted to admins, but it's still a calculator.

 For some reason, the ability to type calc 1+2 and get back 3 is very
 satisfying to me. You know, in case I ever forget what one plus two
 makes.

 I discovered recently that Spotlight (the OSX built-in search engine)
 can do this.

Good feature, not surprising. Google Search has had that feature for a
while, and it just feels right to be able to look up information the
same way regardless of its source.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Thu, Apr 4, 2013 at 12:25 AM, Roy Smith r...@panix.com wrote:

 Fair enough.  In fact, given that reading the file from disk is O(n) and
 sorting it is O(n log n), at some point, the sort will totally swamp the
 input time.

But given the much larger fixed cost of disk access, that might take
an awful lot of strings...

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

In article 515be00e$0$29891$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 On Wed, 03 Apr 2013 18:24:25 +1100, Chris Angelico wrote:
 
  On Wed, Apr 3, 2013 at 6:06 PM, Ian Kelly ian.g.ke...@gmail.com wrote:
  On Wed, Apr 3, 2013 at 12:52 AM, Chris Angelico ros...@gmail.com
  wrote:
  Hmm. I was about to say Can you just do a quick collections.Counter()
  of the string widths in 3.3, as an easy way of seeing which ones use
  BMP or higher characters, but I can't find a simple way to query a
  string's width. Can't see it as a method of the string object, nor in
  the string or sys modules. It ought to be easy enough at the C level -
  just look up the two bits representing 'kind' - but I've not found it
  exposed to Python. Is there anything?
 
  4 if max(map(ord, s))  0x else 2 if max(map(ord, s))  0xff else 1
  
  Yeah, that's iterating over the whole string (twice, if it isn't width
  4). 
 
 Then don't write it as a one-liner :-P
 
 n = max(map(ord, s))
 4 if n  0x else 2 if n  0xff else 1

This has to inspect the entire string, no?  I posted (essentially) this 
a few days ago:

   if all(ord(c) = 0x for c in s):
return it's all bmp
else:
return it's got astral crap in it

I'm reasonably sure all() is smart enough to stop at the first False 
value.


 (sys.getsizeof(s) - sys.getsizeof(''))/len(s)
 
I wouldn't trust getsizeof() to return exactly what you're looking for.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread Mark Lawrence


On 02/04/2013 10:28, Neil Hodgson wrote:

jmfauth:


3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]


That's a larger performance decrease than the 64-bit version.

Reported the issue as
http://bugs.python.org/issue17615

Neil


FTR this has been closed as fixed see 
http://bugs.python.org/issue17615#msg185862


--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Thu, Apr 4, 2013 at 12:43 AM, Roy Smith r...@panix.com wrote:
 This has to inspect the entire string, no?  I posted (essentially) this
 a few days ago:

if all(ord(c) = 0x for c in s):
 return it's all bmp
 else:
 return it's got astral crap in it

 I'm reasonably sure all() is smart enough to stop at the first False
 value.

Probably, but it still has to scan the body of the string. It'd not be
too bad if it's all astral, but if it's all BMP, it has to scan the
whole string. In the max() case, it has to scan the whole string
anyway, as there's no other way to determine the maximum. I'm thinking
here of this function:

http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/width.html

It's implemented as a simple lookup into the header. (Pike strings,
like PEP 393 strings, are stored in the most compact way possible - 1,
2, or 4 bytes per character - with a conceptually similar header
structure.) Is this something that would be worth having available?
Should I post an issue about it?

ChrisA

more for self-ref than anyone else's: source of Pike's String.width():
http://pike-git.lysator.liu.se/gitweb.cgi?p=pike.git;a=blob;f=src/builtin.cmod;hb=HEAD#l1077
-- 
http://mail.python.org/mailman/listinfo/python-list

Sorting [was Re: Performance of int/long in Python 3]

On Wed, 03 Apr 2013 07:52:42 -0400, Dave Angel wrote:

 I thought that the sort algorithm used a hash of all
 the items to be sorted, and only reverted to a raw comparison of the
 original values when the hash collided.  Is that not the case?  Or is
 the code you post here only used when the hash collides?

Sorting does not require that the elements being sorted are hashable. 

If I have understood the implementation here:

http://hg.python.org/releasing/3.3.1/file/2ab2a09901f9/Objects/listobject.c

sorting in Python only requires that objects implement the less-than 
comparison.

py class Funny:
... def __init__(self, x):
... self.x = x
... def __lt__(self, other):
... return self.x  other.x
... def __gt__(self, x):
... raise AttributeError
... __le__ = __ge__ = __eq__ = __ne__ = __gt__
...
py L = [Funny(i) for i in range(10)]
py random.shuffle(L)
py [f.x for f in L]
[8, 5, 7, 0, 9, 2, 3, 6, 1, 4]
py [f.x for f in sorted(L)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


but if I change Funny.__lt__ to also raise, sorting fails.

I seem to recall that sort relies only on  operator is a language 
promise, but I can't seem to find it documented anywhere official.




-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]
 n = max(map(ord, s))
 4 if n  0x else 2 if n  0xff else 1
 
 This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n  0x: return 4
if n  0xFF: return 2
return 1



 I posted (essentially) this a few days ago:
 
if all(ord(c) = 0x for c in s):
 return it's all bmp
 else:
 return it's got astral crap in it


It's not astral crap. People use it, and they'll use it more in the 
future. Just because you don't, doesn't give you leave to make 
disparaging remarks about it.

Honestly, it's really painful to see how history repeats itself:

Bah humbug, why do we need to support the SMP astral crap? The Unicode 
BMP is more than enough for everybody.

Bah humbug, why do we need to support Unicode crap? Latin1 is more than 
enough for everybody.

Bah humbug, why do we need to support Latin1 crap? ASCII is more than 
enough for everybody.

Bah humbug, why do we need to support ASCII crap? Uppercase A-Z is more 
than enough for everybody.

Seriously. Go back long enough, to the telegraph days, and you have 
people arguing that there was no need for upper and lower case letters.



 I'm reasonably sure all() is smart enough to stop at the first False
 value.

Yes, all() and any() are guaranteed to be short-circuit functions. They 
will stop as soon as they see a False or a True value respectively.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Sorting [was Re: Performance of int/long in Python 3]

In article 515c400e$0$29966$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 I seem to recall that sort relies only on  operator is a language 
 promise, but I can't seem to find it documented anywhere official.

That's pretty typical for sort implementations in all languages.  Except 
for those which rely on less than and equal to :-)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Thu, 04 Apr 2013 01:17:28 +1100, Chris Angelico wrote:

 Probably, but it still has to scan the body of the string. It'd not be
 too bad if it's all astral, but if it's all BMP, it has to scan the
 whole string. In the max() case, it has to scan the whole string anyway,
 as there's no other way to determine the maximum. I'm thinking here of
 this function:
 
 http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/
width.html
 
 It's implemented as a simple lookup into the header. (Pike strings, like
 PEP 393 strings, are stored in the most compact way possible - 1, 2, or
 4 bytes per character - with a conceptually similar header structure.)
 Is this something that would be worth having available? Should I post an
 issue about it?

I'm not really sure why I would want to know, apart from pure 
intellectual curiosity, but sure, post a feature request. Be sure to 
mention that Pike supports this feature.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread rusi

On Apr 3, 6:43 pm, Roy Smith r...@panix.com wrote:
 This has to inspect the entire string, no?  I posted (essentially) this
 a few days ago:

        if all(ord(c) = 0x for c in s):
             return it's all bmp
         else:
             return it's got astral crap in it

Astral crap? CRAP?
Verily sir I am offended!

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!

In short you are a complete philistine
No… On second thoughts I take that back. For all we know philistine
may be one of the blessings of the Unicode gods?
So following the ilustrious example of jmf, I shall pronounce upon you
the ultimate curse:

You are American!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 5:52 AM, Dave Angel da...@davea.name wrote:
 I'm also puzzled.  I thought that the sort algorithm used a hash of all the
 items to be sorted, and only reverted to a raw comparison of the original
 values when the hash collided.  Is that not the case?  Or is the code you
 post here only used when the hash collides?

I think you are mistaken, because I don't see how that could work.  If
the hashes of two items are different then you can assume they are not
equal, but sorting requires a partial ordering comparison, not simply
an equality comparison.  You cannot determine which item is less or
greater than the other from the hash values alone.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 9:02 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

 [...]
 n = max(map(ord, s))
 4 if n  0x else 2 if n  0xff else 1

 This has to inspect the entire string, no?

 Correct. A more efficient implementation would be:

 def char_size(s):
 for n in map(ord, s):
 if n  0x: return 4
 if n  0xFF: return 2
 return 1

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in the
string.  It's only safe to short-circuit return 4, not 2 or 1.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, Apr 3, 2013 at 1:53 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 (sys.getsizeof(s) - sys.getsizeof(''))/len(s)

 s = '\x80\x81\x82\x83\x84\x85'
 len(s)
6
 import sys
 sys.getsizeof(s)
43
 sys.getsizeof(s) - sys.getsizeof('')
18
 (sys.getsizeof(s) - sys.getsizeof('')) / len(s)
3.0

I didn't know there was a 3-byte-width representation. :-)

More seriously, it fails because '' is ASCII and s is not, and the
overhead for the two strings is different.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread Ethan Furman


On 04/03/2013 09:10 AM, rusi wrote:

On Apr 3, 6:43 pm, Roy Smith r...@panix.com wrote:

This has to inspect the entire string, no?  I posted (essentially) this
a few days ago:

if all(ord(c) = 0x for c in s):
 return it's all bmp
 else:
 return it's got astral crap in it


Astral crap? CRAP?
Verily sir I am offended!

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!

In short you are a complete philistine
No… On second thoughts I take that back. For all we know philistine
may be one of the blessings of the Unicode gods?
So following the ilustrious example of jmf, I shall pronounce upon you
the ultimate curse:

You are American!


LOL!
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Wed, 03 Apr 2013 10:38:20 -0600, Ian Kelly wrote:

 On Wed, Apr 3, 2013 at 9:02 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

 [...]
 n = max(map(ord, s))
 4 if n  0x else 2 if n  0xff else 1

 This has to inspect the entire string, no?

 Correct. A more efficient implementation would be:

 def char_size(s):
 for n in map(ord, s):
 if n  0x: return 4
 if n  0xFF: return 2
 return 1
 
 That's an incorrect implementation, as it would return 2 at the first
 non-Latin-1 BMP character, even if there were SMP characters later in
 the string.  It's only safe to short-circuit return 4, not 2 or 1.


Doh!

I mean, well done sir, you have successfully passed my little test!



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread Dave Angel


On 04/03/2013 12:30 PM, Ian Kelly wrote:

On Wed, Apr 3, 2013 at 5:52 AM, Dave Angel da...@davea.name wrote:

I'm also puzzled.  I thought that the sort algorithm used a hash of all the
items to be sorted, and only reverted to a raw comparison of the original
values when the hash collided.  Is that not the case?  Or is the code you
post here only used when the hash collides?


I think you are mistaken, because I don't see how that could work.  If
the hashes of two items are different then you can assume they are not
equal, but sorting requires a partial ordering comparison, not simply
an equality comparison.  You cannot determine which item is less or
greater than the other from the hash values alone.



You are of course correct.  The particular data that Neil had provided 
might well have had many duplicates, but that won't be the typical case, 
so there's not much point in doing an unordered hash.  I guess I was 
confusing it with the key= argument for modifying sort order, where the 
key function might replace a slow-to-compare data type with something 
faster.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Thu, Apr 4, 2013 at 4:43 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 03 Apr 2013 10:38:20 -0600, Ian Kelly wrote:

 On Wed, Apr 3, 2013 at 9:02 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

 [...]
 n = max(map(ord, s))
 4 if n  0x else 2 if n  0xff else 1

 This has to inspect the entire string, no?

 Correct. A more efficient implementation would be:

 def char_size(s):
 for n in map(ord, s):
 if n  0x: return 4
 if n  0xFF: return 2
 return 1

 That's an incorrect implementation, as it would return 2 at the first
 non-Latin-1 BMP character, even if there were SMP characters later in
 the string.  It's only safe to short-circuit return 4, not 2 or 1.


 Doh!

 I mean, well done sir, you have successfully passed my little test!

Try this:

def str_width(s):
  width=1
  for ch in map(ord,s):
if ch  0x: return 4
if cn  0xFF: width=2
  return width

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Thu, Apr 4, 2013 at 2:07 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 04 Apr 2013 01:17:28 +1100, Chris Angelico wrote:

 Probably, but it still has to scan the body of the string. It'd not be
 too bad if it's all astral, but if it's all BMP, it has to scan the
 whole string. In the max() case, it has to scan the whole string anyway,
 as there's no other way to determine the maximum. I'm thinking here of
 this function:

 http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/
 width.html

 It's implemented as a simple lookup into the header. (Pike strings, like
 PEP 393 strings, are stored in the most compact way possible - 1, 2, or
 4 bytes per character - with a conceptually similar header structure.)
 Is this something that would be worth having available? Should I post an
 issue about it?

 I'm not really sure why I would want to know, apart from pure
 intellectual curiosity, but sure, post a feature request. Be sure to
 mention that Pike supports this feature.

http://bugs.python.org/issue17629 opened.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-03 Thread Mark Lawrence


On 03/04/2013 22:55, Chris Angelico wrote:

On Thu, Apr 4, 2013 at 4:43 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:

On Wed, 03 Apr 2013 10:38:20 -0600, Ian Kelly wrote:


On Wed, Apr 3, 2013 at 9:02 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]

n = max(map(ord, s))
4 if n  0x else 2 if n  0xff else 1


This has to inspect the entire string, no?


Correct. A more efficient implementation would be:

def char_size(s):
 for n in map(ord, s):
 if n  0x: return 4
 if n  0xFF: return 2
 return 1


That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in
the string.  It's only safe to short-circuit return 4, not 2 or 1.



Doh!

I mean, well done sir, you have successfully passed my little test!


Try this:

def str_width(s):
   width=1
   for ch in map(ord,s):
 if ch  0x: return 4
 if cn  0xFF: width=2
   return width

ChrisA



Given the quality of some code posted here recently this patch can't be 
accepted until there are some unit tests :)


--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


Neil Hodgson, replying to self:


The assembler (32-bit build) for each
PyUnicode_READ looks like


   Don't have 64-bit MSVC 2010 set up but the code from 64-bit MSVC 
2012 is better since there are an extra 8 registers in 64-bit mode:


; 10431: c1 = PyUnicode_READ(kind1, data1, i);
cmp rsi, 1
jne SHORT $LN17@unicode_co
lea rax, QWORD PTR [r9+rcx]
movzx   r8d, BYTE PTR [rax+rbx]
jmp SHORT $LN16@unicode_co
$LN17@unicode_co:
cmp rsi, 2
jne SHORT $LN15@unicode_co
movzx   r8d, WORD PTR [r9+r11]
jmp SHORT $LN16@unicode_co
$LN15@unicode_co:
mov r8d, DWORD PTR [r9+r10]
$LN16@unicode_co:

   All the variables used in the loop are now in registers but the 
tests and branches are the same. This lines up with 64-bit being better 
than 32-bit on Windows but not as good as Python 3.2 or Unix.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

In article 
aa3b500f-bebf-4d77-9855-3d90b07ea...@y7g2000pbu.googlegroups.com,
 rusi rustompm...@gmail.com wrote:

 On Apr 3, 6:43 pm, Roy Smith r...@panix.com wrote:
  This has to inspect the entire string, no?  I posted (essentially) this
  a few days ago:
 
         if all(ord(c) = 0x for c in s):
              return it's all bmp
          else:
              return it's got astral crap in it
 
 Astral crap? CRAP?
 Verily sir I am offended!
 [...]
 You are American!

This is true.

But, to be fair, in the (I don't have the exact number here) roughly 200 
million records in our recent big data import job, I found exactly FOUR 
strings with astral characters.  Which boiled down to two versions of 
each of two different song titles.

One had a Unicode Character 'BALLOON' (U+1F388).  The other had some 
heart symbol (sorry, I don't remember the exact code point).  These 
hardly seem a matter of national pride.

And, if you don't believe there is astral crap, how do you explain 
U+1F4A9?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

In article 515c448c$0$29966$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:
 
 [...]
  n = max(map(ord, s))
  4 if n  0x else 2 if n  0xff else 1
  
  This has to inspect the entire string, no?
 
 Correct. A more efficient implementation would be:
 
 def char_size(s):
 for n in map(ord, s):
 if n  0x: return 4
 if n  0xFF: return 2
 return 1
 
 
 
  I posted (essentially) this a few days ago:
  
 if all(ord(c) = 0x for c in s):
  return it's all bmp
  else:
  return it's got astral crap in it
 
 
 It's not astral crap. People use it, and they'll use it more in the 
 future. Just because you don't, doesn't give you leave to make 
 disparaging remarks about it.
 
 Honestly, it's really painful to see how history repeats itself:
 
 Bah humbug, why do we need to support the SMP astral crap? The Unicode 
 BMP is more than enough for everybody.

Come on, guys.  It was a joke.  I'm the guy who was complaining that my 
database doesn't support non-BMP, remember?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On 2 avr, 01:43, Neil Hodgson nhodg...@iinet.net.au wrote:
 Mark Lawrence:

  You've given many examples of the same type of micro benchmark, not many
  examples of different types of benchmark.

     Trying to work out what jmfauth is on about I found what appears to
 be a performance regression with '' string comparisons on Windows
 64-bit. Its around 30% slower on a 25 character string that differs in
 the last character and 70-100% on a 100 character string that differs at
 the end.

     Can someone else please try this to see if its reproducible? Linux
 doesn't show this problem.

  c:\python32\python -u charwidth.py
 3.2 (r32:88445, Feb 20 2011, 21:30:00) [MSC v.1500 64 bit (AMD64)]
 a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']176
 [0.7116295577956576, 0.7055591343157613, 0.7203483026429418]

 a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']176
 [0.7664397841378787, 0.7199902325464409, 0.713719289812504]

 a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']176
 [0.7341851791817691, 0.6994205901833599, 0.7106807593741005]

 a=['C:/Users/Neil/Documents/','C:/Users/Neil/Documents/']180
 [0.7346812372666784, 0.699543377914, 0.7064768417728411]

  c:\python33\python -u charwidth.py
 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit
 (AMD64)]
 a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']108
 [0.9913326076446045, 0.9455845241056282, 0.9459076605341776]

 a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']192
 [1.0472289217234318, 1.0362342484091207, 1.0197109728048384]

 a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']192
 [1.0439643704533834, 0.9878581050301687, 0.9949265834034335]

 a=['C:/Users/Neil/Documents/','C:/Users/Neil/Documents/']312
 [1.0987483965446412, 1.0130257167690004, 1.024832248526499]

     Here is the code:

 # encoding:utf-8
 import os, sys, timeit
 print(sys.version)
 examples = [
 a=['$b','$z'],
 a=['$λ','$η'],
 a=['$b','$η'],
 a=['$\U0002','$\U00020001']]
 baseDir = C:/Users/Neil/Documents/
 #~ baseDir = C:/Users/Neil/Documents/Visual Studio
 2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug
 for t in examples:
      t = t.replace($, baseDir)
      # Using os.write as simple way get UTF-8 to stdout
      os.write(sys.stdout.fileno(), t.encode(utf-8))
      print(sys.getsizeof(t))
      print(timeit.repeat(a[0]  a[1],t,number=500))
      print()

     For a more significant performance difference try replacing the
 baseDir setting with (may be wrapped):
 baseDir = C:/Users/Neil/Documents/Visual Studio
 2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug

     Neil



Hi,

c:\python32\pythonw -u charwidth.py
3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchz']168
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]

a=['D:\jm\jmpy\py3app\stringbenchÎ»','D:\jm\jmpy\py3app
\stringbenchÎ·']168
[0.818378092261062, 0.8180854713107406, 0.8192279926793571]

a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchÎ·']168
[0.8131353330542339, 0.8126985677326912, 0.8122744051977042]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
\stringbenchð €']172
[0.8271094603211102, 0.82704053883214, 0.8265781741004083]

Exit code: 0
c:\Python33\pythonw -u charwidth.py
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
(Intel)]
a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchz']94
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]

a=['D:\jm\jmpy\py3app\stringbenchÎ»','D:\jm\jmpy\py3app
\stringbenchÎ·']176
[1.6217970707185678, 1.6279369907932706, 1.6207041728220117]

a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchÎ·']176
[1.5150522562729396, 1.5130369919353992, 1.5121890607025037]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
\stringbenchð €']316
[1.6135375194801664, 1.6117739170366434, 1.6134331526540109]

Exit code: 0

- win7 32-bits
- The file is in utf-8
- Do not be afraid by this output, it is just a copy/paste for your
excellent editor, the coding output pane is configured to use the
locale
coding.
- Of course and as expected, similar behaviour from a console. (Which
btw
show, how good is you application).

==

Something different.

From a previous msg, on this thread.

---

 Sure. And over a different set of samples, it is less compact. If you
 write a lot of Latin-1, Python will use one byte per character, while
 UTF-8 will use two bytes per character.

I think you mean writing a lot of Latin-1 characters outside
ASCII.
However, even people writing texts in, say, French will find that only
a
small proportion of their text is outside ASCII and so the cost of
UTF-8
is correspondingly small.

The counter-problem is that a French document that needs to
include
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string.

---

Re: Performance of int/long in Python 3

2013-04-02 Thread Chris Angelico

On Tue, Apr 2, 2013 at 6:24 PM, jmfauth wxjmfa...@gmail.com wrote:
 An editor may reflect very well the example a gave. You enter
 thousand ascii chars, then - boum - as you enter a non ascii
 char, your editor (assuming is uses a mechanism like the FSR),
 has to internally reencode everything!

That assumes that the editor stores the entire buffer as a single
Python string. Frankly, I think this unlikely; the nature of
insertions and deletions makes this impractical. (I've known editors
that do function this way. They're utterly unusable on large files.)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Steven D'Aprano

On Tue, 02 Apr 2013 19:03:17 +1100, Chris Angelico wrote:

 On Tue, Apr 2, 2013 at 6:24 PM, jmfauth wxjmfa...@gmail.com wrote:
 An editor may reflect very well the example a gave. You enter thousand
 ascii chars, then - boum - as you enter a non ascii char, your editor
 (assuming is uses a mechanism like the FSR), has to internally reencode
 everything!
 
 That assumes that the editor stores the entire buffer as a single Python
 string. Frankly, I think this unlikely; the nature of insertions and
 deletions makes this impractical. (I've known editors that do function
 this way. They're utterly unusable on large files.)

Nevertheless, for *some* size of text block (a word? line? paragraph?) an 
implementation may need to re-encode the block as characters are inserted 
or deleted.

So what? Who cares if it takes 0.2 second to insert a character 
instead of 0.1 second? That's still a hundred times faster than you 
can type.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On 2 avr, 10:03, Chris Angelico ros...@gmail.com wrote:
 On Tue, Apr 2, 2013 at 6:24 PM, jmfauth wxjmfa...@gmail.com wrote:
  An editor may reflect very well the example a gave. You enter
  thousand ascii chars, then - boum - as you enter a non ascii
  char, your editor (assuming is uses a mechanism like the FSR),
  has to internally reencode everything!

 That assumes that the editor stores the entire buffer as a single
 Python string. Frankly, I think this unlikely; the nature of
 insertions and deletions makes this impractical. (I've known editors
 that do function this way. They're utterly unusable on large files.)

 ChrisA



No, no, no, no, ... as we say in French (this is a kindly
form).

The length of a string may have its importance. This
bad behaviour may happen on every char. The most
complicated chars are the chars with diacritics and
ligatured [1, 2] chars, eg chars used in Arabic script [2].

It is somehow funny to see, the FSR fails precisely
on problems Unicode will solve/handle, eg normalization or
sorting [3].

No really a problem for those you are endorsing the good
work Unicode does [5].


[1] A point which was not, in my mind, very well understood
when I read the PEP393 discussion.

[2] Take a unicode TeX compliant engine and toy with
the decomposed form of these chars. A very good way, to
understand what can be really a char, when you wish to
process text seriously.

[3] I only test and tested these chars blindly with the help
of the doc I have. Btw, when I test complicated Arabic chars,
I noticed, Py33 crashes, it does not really crash, it get stucked
in some king of infinite loop (or is it due to timeit?).

[4] Am I the only one who test this kind of stuff?

[5] Unicode is a fascinating construction.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On 2 avr, 10:35, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:
 On Tue, 02 Apr 2013 19:03:17 +1100, Chris Angelico wrote:

 So what? Who cares if it takes 0.2 second to insert a character
 instead of 0.1 second? That's still a hundred times faster than you
 can type.

-

This not the problem. The interesting point is that they
are good and less good Unicode implementations.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Neil Hodgson


jmfauth:


3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]


   That's a larger performance decrease than the 64-bit version.

   Reported the issue as
http://bugs.python.org/issue17615

   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


On 02/04/2013 10:24, jmfauth wrote:

On 2 avr, 10:35, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:

On Tue, 02 Apr 2013 19:03:17 +1100, Chris Angelico wrote:

So what? Who cares if it takes 0.2 second to insert a character
instead of 0.1 second? That's still a hundred times faster than you
can type.


-

This not the problem. The interesting point is that they
are good and less good Unicode implementations.

jmf



The interesting point is that the Python 3.3 unicode implementation is 
correct, that of most other languages is buggy.  Or have I fallen victim 
to the vicious propaganda of the various Pythonistas who frequent this list?


--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Steve Simmons



On 02/04/2013 10:43, Mark Lawrence wrote:

On 02/04/2013 10:24, jmfauth wrote:

On 2 avr, 10:35, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:

On Tue, 02 Apr 2013 19:03:17 +1100, Chris Angelico wrote:

So what? Who cares if it takes 0.2 second to insert a character
instead of 0.1 second? That's still a hundred times faster than you
can type.


-

This not the problem. The interesting point is that they
are good and less good Unicode implementations.

jmf



The interesting point is that the Python 3.3 unicode implementation is 
correct, that of most other languages is buggy. Or have I fallen 
victim to the vicious propaganda of the various Pythonistas who 
frequent this list?



Mark,

Thanks for asking this question.

It seems to me that jmf *might* be moving towards a vindicated 
position.  There is some interest now in duplicating, understanding and 
(hopefully!) extending his test results, which can only be a Good Thing 
- whatever the outcome and wherever the facepalm might land.


However, as you rightly point out, there is only value in following this 
through if the functionality is (at least near) 100% correct. I am sure 
there are some that will disagree but in most cases, functionality is 
the primary requirement and poor performance can be managed initially 
and fixed in due time.


Steve
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Apr 2, 3:58 pm, Steve Simmons square.st...@gmail.com wrote:
 On 02/04/2013 10:43, Mark Lawrence wrote:







  On 02/04/2013 10:24, jmfauth wrote:
  On 2 avr, 10:35, Steven D'Aprano steve
  +comp.lang.pyt...@pearwood.info wrote:
  On Tue, 02 Apr 2013 19:03:17 +1100, Chris Angelico wrote:

  So what? Who cares if it takes 0.2 second to insert a character
  instead of 0.1 second? That's still a hundred times faster than you
  can type.

  -

  This not the problem. The interesting point is that they
  are good and less good Unicode implementations.

  jmf

  The interesting point is that the Python 3.3 unicode implementation is
  correct, that of most other languages is buggy. Or have I fallen
  victim to the vicious propaganda of the various Pythonistas who
  frequent this list?

 Mark,

 Thanks for asking this question.

 It seems to me that jmf *might* be moving towards a vindicated
 position.  There is some interest now in duplicating, understanding and
 (hopefully!) extending his test results, which can only be a Good Thing
 - whatever the outcome and wherever the facepalm might land.

Whew! Very reassuring to hear some sanity in this discussion at long
last!

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Steven D'Aprano

On Tue, 02 Apr 2013 11:58:11 +0100, Steve Simmons wrote:

 It seems to me that jmf *might* be moving towards a vindicated position.
  There is some interest now in duplicating, understanding and
 (hopefully!) extending his test results, which can only be a Good Thing
 - whatever the outcome and wherever the facepalm might land.

Some interest now? Oh please.

http://mail.python.org/pipermail/python-list/2012-September/629810.html

Mark Lawrence even created a bug report to track this, also back in 
September.

http://bugs.python.org/issue16061

I'm sure you didn't intend to be insulting, but some of us *have* taken 
JMF seriously, at least at first. His repeated overblown claims of how 
Python is destroying Unicode, his lack of acknowledgement that other 
people have seen string handling *speed up* not slow down, and his 
refusal to assist in diagnosing this performance regression except to 
repeatedly quote the same artificial micro-benchmarks over and over again 
have lost him whatever credibility he started with.

This feature is a *memory optimization*, not a speed optimization, and 
yet as a side-effect of saving memory, it also saves time. Real-world 
benchmarks of actual applications demonstrate this. One or two trivial 
slowdowns of artificial micro-benchmarks simply are not important, even 
if they are genuine. I believe they are genuine, but likely operating 
system and hardware dependent.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


On 02/04/2013 11:58, Steve Simmons wrote:


On 02/04/2013 10:43, Mark Lawrence wrote:

On 02/04/2013 10:24, jmfauth wrote:

On 2 avr, 10:35, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:

On Tue, 02 Apr 2013 19:03:17 +1100, Chris Angelico wrote:

So what? Who cares if it takes 0.2 second to insert a character
instead of 0.1 second? That's still a hundred times faster than you
can type.


-

This not the problem. The interesting point is that they
are good and less good Unicode implementations.

jmf



The interesting point is that the Python 3.3 unicode implementation is
correct, that of most other languages is buggy. Or have I fallen
victim to the vicious propaganda of the various Pythonistas who
frequent this list?


Mark,

Thanks for asking this question.

It seems to me that jmf *might* be moving towards a vindicated
position.  There is some interest now in duplicating, understanding and
(hopefully!) extending his test results, which can only be a Good Thing
- whatever the outcome and wherever the facepalm might land.



The position that is already documented in PEP393, how so?


However, as you rightly point out, there is only value in following this
through if the functionality is (at least near) 100% correct. I am sure
there are some that will disagree but in most cases, functionality is
the primary requirement and poor performance can be managed initially
and fixed in due time.


I've already raised an issue about performance and Neil Hodgson has 
raised a new one.  To balance this out perhaps we should have counter 
issues asking for the amount of memory being used to be increased to old 
levels and the earlier buggier behaviour of Python to be reintroduced? 
Swings and roundabouts?




Steve



--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Steve Simmons



On 02/04/2013 15:03, Steven D'Aprano wrote:

On Tue, 02 Apr 2013 11:58:11 +0100, Steve Simmons wrote:


It seems to me that jmf *might* be moving towards a vindicated position.
  There is some interest now in duplicating, understanding and
(hopefully!) extending his test results, which can only be a Good Thing
- whatever the outcome and wherever the facepalm might land.

Some interest now? Oh please.

http://mail.python.org/pipermail/python-list/2012-September/629810.html

Mark Lawrence even created a bug report to track this, also back in
September.

http://bugs.python.org/issue16061

I'm sure you didn't intend to be insulting, but some of us *have* taken
JMF seriously, at least at first. His repeated overblown claims of how
Python is destroying Unicode, his lack of acknowledgement that other
people have seen string handling *speed up* not slow down, and his
refusal to assist in diagnosing this performance regression except to
repeatedly quote the same artificial micro-benchmarks over and over again
have lost him whatever credibility he started with.

This feature is a *memory optimization*, not a speed optimization, and
yet as a side-effect of saving memory, it also saves time. Real-world
benchmarks of actual applications demonstrate this. One or two trivial
slowdowns of artificial micro-benchmarks simply are not important, even
if they are genuine. I believe they are genuine, but likely operating
system and hardware dependent.


First off, no insult intended and I haven't been part of this list long 
enough to be fully immersed in the history of this so I'm sure there are 
events of which I am unaware.


However, it seems to me that, for whatever reason, JMF has reached the 
end of his capacity (time, capability, patience, ...) to extend his 
benchmarks into a more credible test set - i.e. one that demonstrates an 
acceptably 'real life' profile with a marked drop in performance.  As a 
community we have choices.  We can brand him a Troll - and some of his 
behaviour may mandate that - or we can put some additional energy into 
drawing this 'disagreement' to a more amicable and constructive conclusion.


My post was primarily aimed at recognising the work that people like 
Mark, Neil and others have done to move the problem forward and was 
intended to help shift the focus to a more productive approach. Again, 
my apologies if it was ill timed or ill-directed.


Steve Simmons


--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


On 02/04/2013 15:39, Steve Simmons wrote:


My post was primarily aimed at recognising the work that people like
Mark, Neil and others have done to move the problem forward and was
intended to help shift the focus to a more productive approach. Again,
my apologies if it was ill timed or ill-directed.

Steve Simmons



I must point out that I only raised issue 16061 based on data provided 
by Steven D'Aprano and Serhiy Storchaka.


--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Steve Simmons



On 02/04/2013 15:12, Mark Lawrence wrote:
I've already raised an issue about performance and Neil Hodgson has 
raised a new one. 

Recognised in a separate post
To balance this out perhaps we should have counter issues asking for 
the amount of memory being used to be increased to old levels and the 
earlier buggier behaviour of Python to be reintroduced? Swings and 
roundabouts?
I don't think I came anywhere near suggesting that we should regress 
correct functionality or memory usage improvements.  I just don't 
believe that we can't have good performance alongside it.


Steve S
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On 2 avr, 16:03, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:
 On Tue, 02 Apr 2013 11:58:11 +0100, Steve Simmons wrote:

 I'm sure you didn't intend to be insulting, but some of us *have* taken
 JMF seriously, at least at first. His repeated overblown claims of how
 Python is destroying Unicode ...


Sorrry I never claimed this, I'm just seeing on how Python is becoming
less Unicode friendly.


 This feature is a *memory optimization*, not a speed optimization,

I totaly agree, and utf-8 is doing that with a great art. (see Neil
Hodgson
comment).
(Do not interpret this as if i'm saying Python should use utf-8, as
I'have read).

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Ethan Furman


On 04/02/2013 08:03 AM, Steve Simmons wrote:

On 02/04/2013 15:12, Mark Lawrence wrote:

I've already raised an issue about performance and Neil Hodgson has raised a 
new one.

Recognised in a separate post


To balance this out perhaps we should have counter issues asking for the amount 
of memory being used to be increased
to old levels and the earlier buggier behaviour of Python to be reintroduced? 
Swings and roundabouts?


I don't think I came anywhere near suggesting that we should regress correct 
functionality or memory usage
improvements.  I just don't believe that we can't have good performance 
alongside it.


It's always a trade-off between time and memory.

However, as it happens, there are plenty of instances where the new FSR is faster -- and this in real world code, not 
useless micro-benchmarks.


Simmons (too many Steves!), I know you're new so don't have all the history with jmf that many of us do, but consider 
that the original post was about numbers, had nothing to do with characters or unicode *in any way*, and yet jmf still 
felt the need to bring unicode up.


--
~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3


On 02/04/2013 16:12, jmfauth wrote:


Sorrry I never claimed this, I'm just seeing on how Python is becoming
less Unicode friendly.


Please explain this.  I see no justification for this comment.  How can 
an implementation that fixes bugs be less Unicode friendly than its 
earlier, buggier equivalents?




jmf



--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Ethan Furman

On 04/02/2013 07:39 AM, Steve Simmons wrote:

On 02/04/2013 15:03, Steven D'Aprano wrote:

On Tue, 02 Apr 2013 11:58:11 +0100, Steve Simmons wrote:

It seems to me that jmf *might* be moving towards a vindicated position.
There is some interest now in duplicating, understanding and
(hopefully!) extending his test results, which can only be a Good Thing
- whatever the outcome and wherever the facepalm might land.

Some interest now? Oh please.

http://mail.python.org/pipermail/python-list/2012-September/629810.html

Mark Lawrence even created a bug report to track this, also back in
September.

http://bugs.python.org/issue16061

I'm sure you didn't intend to be insulting, but some of us *have* taken
JMF seriously, at least at first. His repeated overblown claims of how
Python is destroying Unicode, his lack of acknowledgement that other
people have seen string handling *speed up* not slow down, and his
refusal to assist in diagnosing this performance regression except to
repeatedly quote the same artificial micro-benchmarks over and over again
have lost him whatever credibility he started with.

This feature is a *memory optimization*, not a speed optimization, and
yet as a side-effect of saving memory, it also saves time. Real-world
benchmarks of actual applications demonstrate this. One or two trivial
slowdowns of artificial micro-benchmarks simply are not important, even
if they are genuine. I believe they are genuine, but likely operating
system and hardware dependent.

First off, no insult intended and I haven't been part of this list long enough
to be fully immersed in the history of
this so I'm sure there are events of which I am unaware.

Yes, that would be his months of trollish behavior on this subject.

However, it seems to me that, for whatever reason, JMF has reached the end of
his capacity

His capacity, maybe; his time? Not by a long shot. I am positive we will continue to see his uncooperative, bratty*
behavior continue ad nauseum.

--
~Ethan~

*I was going to say childish, but I know plenty of better behaved children.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Apr 2, 8:17 pm, Ethan Furman et...@stoneleaf.us wrote:

 Simmons (too many Steves!), I know you're new so don't have all the history 
 with jmf that many
 of us do, but consider that the original post was about numbers, had nothing 
 to do with
 characters or unicode *in any way*, and yet jmf still felt the need to bring 
 unicode up.

Just for reference, here is the starting para of Chris' original mail
that started this thread.

 The Python 3 merge of int and long has effectively penalized
 small-number arithmetic by removing an optimization. As we've seen
 from PEP 393 strings (jmf aside), there can be huge benefits from
 having a single type with multiple representations internally. Is
 there value in making the int type have a machine-word optimization in
 the same way?

ie it mentions numbers, strings, PEP 393 *AND jmf.*  So while it is
true that jmf has been butting in with trollish behavior into
completely unrelated threads with his unicode rants, that cannot be
said for this thread.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Apr 2, 8:12 pm, jmfauth wxjmfa...@gmail.com wrote:

 Sorrry I never claimed this, I'm just seeing on how Python is becoming
 less Unicode friendly.

jmf: I suggest you try to use less emotionally loaded and more precise
language if you want people to pay heed to your technical observations/
contributions.
In particular, while you say unicode, your examples always (as far as
I remember) refer to BMP.
Also words like 'friendly' are so emotionally charged that people stop
being friendly :-)

So may I suggest that you rephrase your complaint as
I am seeing python is becoming poorly performant on BMP-chars at the
expense of correct support for the whole (6.0?) charset

(assuming thats what you want to say)

In any case PLEASE note that 'performant' and 'correct' are different
for most practical purposes.
If you dont respect this semantics, people are unlikely to pay heed to
your complaints.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On 2 avr, 18:57, rusi rustompm...@gmail.com wrote:
 On Apr 2, 8:17 pm, Ethan Furman et...@stoneleaf.us wrote:

  Simmons (too many Steves!), I know you're new so don't have all the history 
  with jmf that many
  of us do, but consider that the original post was about numbers, had 
  nothing to do with
  characters or unicode *in any way*, and yet jmf still felt the need to 
  bring unicode up.

 Just for reference, here is the starting para of Chris' original mail
 that started this thread.

  The Python 3 merge of int and long has effectively penalized
  small-number arithmetic by removing an optimization. As we've seen
  from PEP 393 strings (jmf aside), there can be huge benefits from
  having a single type with multiple representations internally. Is
  there value in making the int type have a machine-word optimization in
  the same way?

 ie it mentions numbers, strings, PEP 393 *AND jmf.*  So while it is
 true that jmf has been butting in with trollish behavior into
 completely unrelated threads with his unicode rants, that cannot be
 said for this thread.

-

That's because you did not understand the analogy, int/long - FSR.

One another illustration,

 def AddOne(i):
... if 0  i = 100:
... return i + 10 + 10 + 10 - 10 - 10 - 10 + 1
... elif 100  i = 1000:
... return i + 100 + 100 + 100  + 100 - 100 - 100 - 100 - 100
+ 1
... else:
... return i + 1
...

Do it work? yes.
Is is correct? this can be discussed.

Now replace i by a char, a representent of each subset
of the FSR, select a method where this FST behave badly
and take a look of what happen.


 timeit.repeat('a' * 1000 + 'z')
[0.6532032148133153, 0.6407248807756699, 0.6407264561239894]
 timeit.repeat('a' * 1000 + '9')
[0.6429508479509245, 0.6242782443215589, 0.6240490311410927]


 timeit.repeat('a' * 1000 + '€')
[1.095694927496563, 1.0696347279235603, 1.0687741939041082]
 timeit.repeat('a' * 1000 + 'ẞ')
[1.0796421281222877, 1.0348612767961853, 1.035325216876231]
 timeit.repeat('a' * 1000 + '\u2345')
[1.0855414137412112, 1.0694677410017164, 1.0688096392412945]


 timeit.repeat('œ' * 1000 + '\U00010001')
[1.237314015362017, 1.2226262553064657, 1.21994619397816]
 timeit.repeat('œ' * 1000 + '\U00010002')
[1.245773635836997, 1.2303978424029651, 1.2258257877430765]

Where does it come from? Simple, the FSR breaks the
simple rules used in all coding schemes (unicode or not).
1) a unique set of chars
2) the same algorithm for all chars.

And again that's why utf-8 is working very smoothly.

The corporates which understood this very well and
wanted to incorporate, let say, the used characters
of the French language had only the choice to
create new coding schemes (eg mac-roman, cp1252).

In unicode, the latin-1 range is real plague.

After years of experience, I'm still fascinated to see
the corporates has solved this issue easily and the free
software is still relying on latin-1.
I never succeed to find an explanation.

Even, the TeX folks, when they shifted to the Cork
encoding in 199?, were aware of this and consequently
provides special package(s).

No offense, this is in my mind why corporate software
will always be corporate software and hobbyist software
will always stay at the level of hobbyist software.

A French windows user, understanding nothing in the
coding of characters, assuming he is aware of its
existence (!), has certainly no problem.


Fascinating how it is possible to use Python to teach,
to illustrate, to explain the coding of the characters. No?


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Apr 2, 11:22 pm, jmfauth wxjmfa...@gmail.com wrote:
 On 2 avr, 18:57, rusi rustompm...@gmail.com wrote:









  On Apr 2, 8:17 pm, Ethan Furman et...@stoneleaf.us wrote:

   Simmons (too many Steves!), I know you're new so don't have all the 
   history with jmf that many
   of us do, but consider that the original post was about numbers, had 
   nothing to do with
   characters or unicode *in any way*, and yet jmf still felt the need to 
   bring unicode up.

  Just for reference, here is the starting para of Chris' original mail
  that started this thread.

   The Python 3 merge of int and long has effectively penalized
   small-number arithmetic by removing an optimization. As we've seen
   from PEP 393 strings (jmf aside), there can be huge benefits from
   having a single type with multiple representations internally. Is
   there value in making the int type have a machine-word optimization in
   the same way?

  ie it mentions numbers, strings, PEP 393 *AND jmf.*  So while it is
  true that jmf has been butting in with trollish behavior into
  completely unrelated threads with his unicode rants, that cannot be
  said for this thread.

 -

 That's because you did not understand the analogy, int/long - FSR.

 One another illustration,

  def AddOne(i):

 ...     if 0  i = 100:
 ...         return i + 10 + 10 + 10 - 10 - 10 - 10 + 1
 ...     elif 100  i = 1000:
 ...         return i + 100 + 100 + 100  + 100 - 100 - 100 - 100 - 100
 + 1
 ...     else:
 ...         return i + 1
 ...

 Do it work? yes.
 Is is correct? this can be discussed.

 Now replace i by a char, a representent of each subset
 of the FSR, select a method where this FST behave badly
 and take a look of what happen.

  timeit.repeat('a' * 1000 + 'z')

 [0.6532032148133153, 0.6407248807756699, 0.6407264561239894] 
 timeit.repeat('a' * 1000 + '9')

 [0.6429508479509245, 0.6242782443215589, 0.6240490311410927]



  timeit.repeat('a' * 1000 + '€')

 [1.095694927496563, 1.0696347279235603, 1.0687741939041082] 
 timeit.repeat('a' * 1000 + 'ẞ')

 [1.0796421281222877, 1.0348612767961853, 1.035325216876231] 
 timeit.repeat('a' * 1000 + '\u2345')

 [1.0855414137412112, 1.0694677410017164, 1.0688096392412945]



  timeit.repeat('œ' * 1000 + '\U00010001')

 [1.237314015362017, 1.2226262553064657, 1.21994619397816] 
 timeit.repeat('œ' * 1000 + '\U00010002')

 [1.245773635836997, 1.2303978424029651, 1.2258257877430765]

 Where does it come from? Simple, the FSR breaks the
 simple rules used in all coding schemes (unicode or not).
 1) a unique set of chars
 2) the same algorithm for all chars.

Can you give me a source for this requirement?
Numbers are after all numbers. SO we should use the same code/
algorithms/machine-instructions for floating-point and integers?


 And again that's why utf-8 is working very smoothly.

How wonderful. Heres a suggestion.
Code up the UTF-8 and any of the python string reps in C and profile
them.
Please come back and tell us if UTF-8 outperforms any of the python
representations for strings on any operation (except straight copy).


 The corporates which understood this very well and
 wanted to incorporate, let say, the used characters
 of the French language had only the choice to
 create new coding schemes (eg mac-roman, cp1252).

 In unicode, the latin-1 range is real plague.

 After years of experience, I'm still fascinated to see
 the corporates has solved this issue easily and the free
 software is still relying on latin-1.
 I never succeed to find an explanation.

 Even, the TeX folks, when they shifted to the Cork
 encoding in 199?, were aware of this and consequently
 provides special package(s).

 No offense, this is in my mind why corporate software
 will always be corporate software and hobbyist software
 will always stay at the level of hobbyist software.

 A French windows user, understanding nothing in the
 coding of characters, assuming he is aware of its
 existence (!), has certainly no problem.

 Fascinating how it is possible to use Python to teach,
 to illustrate, to explain the coding of the characters. No?

 jmf

You troll with eclat and elan!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Ian Kelly

On Tue, Apr 2, 2013 at 3:20 AM, jmfauth wxjmfa...@gmail.com wrote:
 It is somehow funny to see, the FSR fails precisely
 on problems Unicode will solve/handle, eg normalization or
 sorting [3].

Neither of these problems have anything to do with the FSR.  Can you
give us an example of normalization or sorting where Python 3.3 fails
and Python 3.2 does not?

 [3] I only test and tested these chars blindly with the help
 of the doc I have. Btw, when I test complicated Arabic chars,
 I noticed, Py33 crashes, it does not really crash, it get stucked
 in some king of infinite loop (or is it due to timeit?).

Without knowing what the actual test that you ran was, we have no way
of answering that.  Unless you give us more detail, my assumption
would be that the number of repetitions that you passed to timeit was
excessively large for the particular test case.

 [4] Am I the only one who test this kind of stuff?

No, you're just the only one who considers it important.
Micro-benchmarks like the ones you have been reporting are *useful*
when it comes to determining what operations can be better optimized,
but they are not *important* in and of themselves.  What is important
is that actual, real-world programs are not significantly slowed by
these kinds of optimizations.  Until you can demonstrate that real
programs are adversely affected by PEP 393, there is not in my opinion
any regression that is worth worrying over.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Terry Jan Reedy


On 4/2/2013 11:12 AM, jmfauth wrote:

On 2 avr, 16:03, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:



I'm sure you didn't intend to be insulting, but some of us *have* taken
JMF seriously, at least at first. His repeated overblown claims of how
Python is destroying Unicode ...


... = 'usability in Python or some variation on that.


Sorrry I never claimed this, I'm just seeing on how Python is becoming
less Unicode friendly.


Let us see what Jim has claimed, starting in 2012 August.

http://mail.python.org/pipermail/python-list/2012-August/628826.html
Devs are developing sophisticed tools based on a non working basis.

http://mail.python.org/pipermail/python-list/2012-August/629514.html
This Flexible String Representation fails.

http://mail.python.org/pipermail/python-list/2012-August/629554.html
This flexible representation is working absurdly.

Reader can decide whether 'non-working', 'fails', 'working absurdly' are 
closer to 'destroying Unicode usability or just 'less friendly'.


On speed:

http://mail.python.org/pipermail/python-list/2012-August/628781.html
Python 3.3 is slower than Python 3.2.

http://mail.python.org/pipermail/python-list/2012-August/628762.html
I can open IDLE with Py 3.2 ou Py 3.3 and compare strings 
manipulations. Py 3.3 is always slower. Period.


False. Period. Here is my followup at the time.
python.org/pipermail/python-list/2012-August/628779.html
You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit( 'a'*1 ))
3.3.0b2: .5
3.2.3: .8

print(timeit(c in a, c  = '…'; a = 'a'*1))
3.3: .05 (independent of len(a)!)
3.2: 5.8  100 times slower! Increase len(a) and the ratio can be made as
high as one wants!

print(timeit(a.encode(), a = 'a'*1000))
3.2: 1.5
3.3:  .26

If one runs stringbency.ph with its 40 or so tests, 3.2 is sometimes 
faster and 3.3 is sometimes faster.


http://mail.python.org/pipermail/python-list/2012-September/630736.html

On to September:

http://mail.python.org/pipermail/python-list/2012-September/630736.html;
Avoid Py3.3

In other words, ignore all the benefits and reject because a couple of 
selected microbenchmarks show a slowdown.


http://mail.python.org/pipermail/python-list/2012-September/631730.html
Py 3.3 succeeded to somehow kill unicode

I will stop here and let Jim explain how 'kill unicode' is different 
from 'destroy unicode'.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Joshua Landau

The initial post posited:
The Python 3 merge of int and long has effectively penalized
small-number arithmetic by removing an optimization. As we've seen
from PEP 393 strings (jmf aside), there can be huge benefits from
having a single type with multiple representations internally. Is
there value in making the int type have a machine-word optimization in
the same way?

Thanks to the fervent response jmf has gotten, the point above has been
mostly abandoned  May I request that next time such an obvious diversion
(aka. jmf) occurs, responses happen in a different thread?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Lele Gaifax

jmfauth wxjmfa...@gmail.com writes:

 Now replace i by a char, a representent of each subset
 of the FSR, select a method where this FST behave badly
 and take a look of what happen.

You insist in cherry-picking a single method where this FST behave
badly, even when it is so obviously a corner case (IMHO it is not
reasonably a common case when you have relatively big chunks of ASCII
characters where you are adding one single non-ASCII char...)

Anyway, these are my results on the opposite case, where you have a big
chunk of non-ASCII characters and a single ASCII char added:

Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2
Type help, copyright, credits or license for more information.
 import timeit
 timeit.repeat('€' * 1000 + 'z')
[0.2817099094390869, 0.2811391353607178, 0.2811310291290283]
 timeit.repeat(u'œ' * 1000 + u'\U00010001')
[0.549591064453125, 0.5502040386199951, 0.5490291118621826]
 timeit.repeat(u'\U00010001' * 1000 + u'œ')
[0.3823568820953369, 0.3823089599609375, 0.3820679187774658]
 timeit.repeat(u'\U00010002' * 1000 + 'a')
[0.45046305656433105, 0.45000195503234863, 0.44980502128601074]

Python 3.3.0 (default, Mar 18 2013, 12:00:52) 
[GCC 4.7.2] on linux
Type help, copyright, credits or license for more information.
 import timeit
 timeit.repeat('€' * 1000 + 'z')
[0.23264244200254325, 0.23299441300332546, 0.2325888039995334]
 timeit.repeat('œ' * 1000 + '\U00010001')
[0.3760241370036965, 0.37552819900156464, 0.3755163860041648]
 timeit.repeat('\U00010001' * 1000 + 'œ')
[0.28259182300098473, 0.2825558360054856, 0.2824251129932236]
 timeit.repeat('\U00010002' * 1000 + 'a')
[0.28227063300437294, 0.2815949220021139, 0.2829978369991295]

IIUC, while it may be true that Py3 is slightly slower than Py2 when the
string operation involves an internal representation change (all your
examples, and the second operation above), in the much more common case
it is considerably faster. This, and the fact that Py3 actually handles
the whole Unicode space without glitches, make it a better environment
in my eyes. Kudos to the core team!

Just my 0.45-0.28 cents,
ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Neil Hodgson


Ian Kelly:


Micro-benchmarks like the ones you have been reporting are *useful*
when it comes to determining what operations can be better optimized,
but they are not *important* in and of themselves.  What is important
is that actual, real-world programs are not significantly slowed by
these kinds of optimizations.  Until you can demonstrate that real
programs are adversely affected by PEP 393, there is not in my opinion
any regression that is worth worrying over.


   The problem with only responding to issues with real-world programs 
is that real-world programs are complex and their performance issues 
often difficult to diagnose. See, for example, scons which is written in 
Python and which has not been able to overcome performance problems over 
several years. 
(http://www.electric-cloud.com/blog/2010/07/21/a-second-look-at-scons-performance/)


   Bottom-up performance work has advantages in that a narrow focus 
area can be more easily analyzed and tested and can produce widely 
applicable benefits.


   The choice of comparison for the script wasn't arbitrary. Comparison 
is one of the main building blocks of higher-level code. Sorting, for 
example, depends strongly on comparison performance with a decrease in 
comparison speed multiplied when applied to sorting.


   Its unfortunate that stringbench.py does not contain any comparison 
or sorting tests.


   Sorting a million string list (all the file paths on a particular 
computer) went from 0.4 seconds with Python 3.2 to 0.78 with 3.3 so 
we're out of the 'not noticeable by humans' range. Perhaps this is still 
a 'micro-benchmark' - I'd just like to avoid adding email access to get 
this over the threshold.


   Here's some code. Replace the if 1 with if 0 on subsequent runs 
to avoid the costly file system walk.


import os, time
from os.path import join, getsize
paths = []
if 1:
for root, dirs, files in os.walk('c:\\'):
for name in files:
paths.append(join(root, name))
with open(filelist.txt, w) as f:
f.write(\n.join(paths))
else:
with open(filelist.txt, r) as f:
paths = f.read().split(\n)
print(len(paths))
timeStart = time.time()
paths.sort()
timeEnd = time.time()
print(Time taken=, timeEnd - timeStart)

   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Apr 3, 8:31 am, Neil Hodgson nhodg...@iinet.net.au wrote:

     Sorting a million string list (all the file paths on a particular
 computer) went from 0.4 seconds with Python 3.2 to 0.78 with 3.3 so
 we're out of the 'not noticeable by humans' range. Perhaps this is still
 a 'micro-benchmark' - I'd just like to avoid adding email access to get
 this over the threshold.

What does that last statement mean?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Neil Hodgson


rusi wrote:

...

a 'micro-benchmark' - I'd just like to avoid adding email access to get
this over the threshold.


What does that last statement mean?


   Its a reference to a comment by Jamie Zawinski (relatively famous 
developer of Netscape Navigator and other things):


   Every program attempts to expand until it can read mail. Those 
programs which cannot so expand are replaced by ones which can.


   One of the games played in bug reporting and avoidance is to deny 
that the report is a real problem. A short script is dismissed as 
unrepresentative of actual programs. Once it can read email though, it 
has to be a real program.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Roy Smith

In article 
5f8ed721-7c89-4ffd-8f2b-21979cc33...@kk11g2000pbb.googlegroups.com,
 rusi rustompm...@gmail.com wrote:

 On Apr 3, 8:31 am, Neil Hodgson nhodg...@iinet.net.au wrote:
 
      Sorting a million string list (all the file paths on a particular
  computer) went from 0.4 seconds with Python 3.2 to 0.78 with 3.3 so
  we're out of the 'not noticeable by humans' range.

On the other hand, how long did it take you to do the directory tree 
walk required to find those million paths?  I'll bet a long longer than 
0.78 seconds, so this gets lost in the noise.

Still, it is unfortunate if sort performance got hurt significantly.  My 
mind was blown a while ago when I discovered that python could sort a 
file of strings faster than the unix command-line sort utility.  That's 
pretty impressive.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Apr 3, 9:03 am, Neil Hodgson nhodg...@iinet.net.au wrote:
 rusi wrote:
  ...
  a 'micro-benchmark' - I'd just like to avoid adding email access to get
  this over the threshold.

  What does that last statement mean?

     Its a reference to a comment by Jamie Zawinski (relatively famous
 developer of Netscape Navigator and other things):

And Xemacs (which is famous in the free sw world for other things!)


     Every program attempts to expand until it can read mail. Those
 programs which cannot so expand are replaced by ones which can.

:-) Ok got it

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-02 Thread Steven D'Aprano

On Wed, 03 Apr 2013 14:31:03 +1100, Neil Hodgson wrote:

 Sorting a million string list (all the file paths on a particular
 computer) went from 0.4 seconds with Python 3.2 to 0.78 with 3.3 so
 we're out of the 'not noticeable by humans' range. Perhaps this is still
 a 'micro-benchmark' - I'd just like to avoid adding email access to get
 this over the threshold.

I cannot confirm this performance regression. On my laptop (Debian Linux, 
not Windows), I can sort a million file names in approximately 1.2 
seconds in both Python 3.2 and 3.3. There is no meaningful difference in 
speed between the two versions.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Mon, Apr 1, 2013 at 4:33 PM, rusi rustompm...@gmail.com wrote:
 So I really wonder: Is python losing more by supporting SMP with
 performance hit on BMP?

If your strings fit entirely within the BMP, then you should see no
penalty compared to previous versions of Python. If they happen to fit
inside ASCII, then there may well be significant improvements. But
regardless, what you gain is the ability to work with *any* string,
regardless of its content, without worrying about it. You can count
characters regardless of their content. Imagine if a tuple of integers
behaved differently if some of those integers flipped to being long
ints:

x = (1, 2, 4, 8, 130, 1300, 110)

Wouldn't you be surprised if len(x) returned 8? I certainly would be.
And that's what a narrow build of Python does with Unicode.

Unicode strings are approximately comparable to tuples of integers. In
fact, they can be interchanged fairly readily:

string = Treble clef: \U0001D11E
array = tuple(map(ord,string))
assert(len(array) == 14)
out_string = ''.join(map(chr,array))
assert(out_string == string)

This doesn't work in Python 2.6 on Windows, partly because of
surrogates, but also because chr() isn't designed for Unicode strings.
There's probably a solution to the second, but not really to the
first. The tuple of ords should match the way the characters are laid
out to a human.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread Steven D'Aprano

On Sun, 31 Mar 2013 22:33:45 -0700, rusi wrote:

 On Mar 31, 5:55 pm, Mark Lawrence breamore...@yahoo.co.uk wrote:
 
 snipped jmf's broken-record whine
 
 I'm feeling very sorry for this horse, it's been flogged so often it's
 down to bare bones.
 
 While I am now joining the camp of those fed up with jmf's whining, I do
 wonder if we are shooting the messenger…

No. The trouble is that the messenger is shouting that the Unicode world 
is ending on December 21st 2012, and hasn't noticed that was over three 
months ago and the world didn't end.

 
[...]
 OK, that leads to the next question.  Is there anyway I can (in Python
 2.7) detect when a string is not entirely in the BMP?  If I could find
 all the non-BMP characters, I could replace them with U+FFFD
 (REPLACEMENT CHARACTER) and life would be good (enough).

Of course you can do this, but you should not. If your input data 
includes character C, you should deal with character C and not just throw 
it away unnecessarily. That would be rude, and in Python 3.3 it should be 
unnecessary.

Although, since the person you are quoting is stuck in Python 2.7, it may 
be less bad than having to deal with potentially broken Unicode strings.


 Steven's:
 But it means that if you're one of the 99.9% of users who mostly use
 characters in the BMP, …

Yes. Mostly does not mean exclusively, and given (say) a billion 
computer users, that leaves about a million users who have significant 
need for non-BMP characters.

If you don't agree with my estimate, feel free to invent your own :-)


 And from http://www.tlg.uci.edu/~opoudjis/unicode/unicode_astral.html
 The informal name for the supplementary planes of Unicode is astral
 planes, since (especially in the late '90s) their use seemed to be as
 remote as the theosophical great beyond. …

That was nearly two decades ago. Two decades ago, the idea that the 
entire computing world could standardize on a single character set, 
instead of having to deal with dozens of different code pages, seemed 
as likely as people landing on the Moon seemed in 1940.

Today, the entire computing world has standardized on such a system, 
code pages (encodings) are mostly only needed for legacy data and 
shitty applications, but most implementations don't support the entire 
Unicode range. A couple of programming languages, including Pike and 
Python, support Unicode fully and correctly. Pike has never had the same 
high-profile as Python, but now that Python can support the entire 
Unicode range without broken surrogate support, maybe users of other 
languages will start to demand the same.


 So I really wonder: Is python losing more by supporting SMP with
 performance hit on BMP?

No.

As many people have demonstrated, both with code snippets and whole-
program benchmarks, Python 3.3 is *as fast* or *faster* than Python 3.2 
narrow builds. In practice, Python 3.3 saves enough memory by using 
sensible string implementations that real world software is faster in 
Python 3.3 than in 3.2.


 The problem as I see it is that a choice that is sufficiently skew is no
 more a straightforward choice. An example will illustrate:
 
 I can choose to drive or not -- a choice. Statistics tell me that on
 average there are 3 fatalities every day; I am very concerned that I
 could get killed so I choose not to drive. Which neglects that there are
 a couple of million safe-drives at the same time as the '3 fatalities'

Clear as mud. What does this have to do with supporting Unicode?




-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread Roy Smith

In article 515941d8$0$29967$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 [...]
  OK, that leads to the next question.  Is there anyway I can (in Python
  2.7) detect when a string is not entirely in the BMP?  If I could find
  all the non-BMP characters, I could replace them with U+FFFD
  (REPLACEMENT CHARACTER) and life would be good (enough).
 
 Of course you can do this, but you should not. If your input data 
 includes character C, you should deal with character C and not just throw 
 it away unnecessarily. That would be rude, and in Python 3.3 it should be 
 unnecessary.

The import job isn't done yet, but so far we've processed 116 million 
records and had to clean up four of them.  I can live with that.  
Sometimes practicality trumps correctness.

It turns out, the problem is that the version of MySQL we're using 
doesn't support non-BMP characters.  Newer versions do (but you have to 
declare the column to use the utf8bm4 character set).  I could upgrade 
to a newer MySQL version, but it's just not worth it.

Actually, I did try spinning up a 5.5 instance (one of the nice things 
of being in the cloud) and experimented with that, but couldn't get it 
to work there either.  I'll admit that I didn't invest a huge amount of 
effort to make that work before just writing this:

def bmp_filter(self, s):
Filter a unicode string to remove all non-BMP (basic
 multilingual plane) characters.  All such characters are
 replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).

 
if all(ord(c) = 0x for c in s):
return s
else:
self.logger.warning(making %r BMP-clean, s)
bmp_chars = [(c if ord(c) = 0x else u'\ufffd') for c in 
s]
return ''.join(bmp_chars)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread rusi

On Apr 1, 5:15 pm, Roy Smith r...@panix.com wrote:
 In article 515941d8$0$29967$c3e8da3$54964...@news.astraweb.com,
  Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

  [...]
   OK, that leads to the next question.  Is there anyway I can (in Python
   2.7) detect when a string is not entirely in the BMP?  If I could find
   all the non-BMP characters, I could replace them with U+FFFD
   (REPLACEMENT CHARACTER) and life would be good (enough).

  Of course you can do this, but you should not. If your input data
  includes character C, you should deal with character C and not just throw
  it away unnecessarily. That would be rude, and in Python 3.3 it should be
  unnecessary.

 The import job isn't done yet, but so far we've processed 116 million
 records and had to clean up four of them.  I can live with that.
 Sometimes practicality trumps correctness.

That works out to 0.03%. Of course I assume it is US only data.
Still its good to know how skew the distribution is.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread Steven D'Aprano

On Mon, 01 Apr 2013 06:11:50 -0700, rusi wrote:

 On Apr 1, 5:15 pm, Roy Smith r...@panix.com wrote:

 The import job isn't done yet, but so far we've processed 116 million
 records and had to clean up four of them.  I can live with that.
 Sometimes practicality trumps correctness.
 
 That works out to 0.03%. Of course I assume it is US only data.
 Still its good to know how skew the distribution is.

If the data included Japanese names, or used Emoji, it would be much 
closer to 100% than 0.03%.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread Steven D'Aprano

On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:

 In article 515941d8$0$29967$c3e8da3$54964...@news.astraweb.com,
  Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 
 [...]
  OK, that leads to the next question.  Is there anyway I can (in
  Python 2.7) detect when a string is not entirely in the BMP?  If I
  could find all the non-BMP characters, I could replace them with
  U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).
 
 Of course you can do this, but you should not. If your input data
 includes character C, you should deal with character C and not just
 throw it away unnecessarily. That would be rude, and in Python 3.3 it
 should be unnecessary.
 
 The import job isn't done yet, but so far we've processed 116 million
 records and had to clean up four of them.  I can live with that.
 Sometimes practicality trumps correctness.

Well, true. It has to be said that few programming languages (and 
databases) make it easy to do the right thing. On the other hand, you're 
a programmer. Your job is to write correct code, not easy code.

 
 It turns out, the problem is that the version of MySQL we're using

Well there you go. Why don't you use a real database? 

http://www.postgresql.org/docs/9.2/static/multibyte.html

:-)

Postgresql has supported non-broken UTF-8 since at least version 8.1.


 doesn't support non-BMP characters.  Newer versions do (but you have to
 declare the column to use the utf8bm4 character set).  I could upgrade
 to a newer MySQL version, but it's just not worth it.

My brain just broke. So-called UTF-8 in MySQL only includes up to a 
maximum of three-byte characters. There has *never* been a time where 
UTF-8 excluded four-byte characters. What were the developers thinking, 
arbitrarily cutting out support for 50% of UTF-8?



 Actually, I did try spinning up a 5.5 instance (one of the nice things
 of being in the cloud) and experimented with that, but couldn't get it
 to work there either.  I'll admit that I didn't invest a huge amount of
 effort to make that work before just writing this:
 
 def bmp_filter(self, s):
 Filter a unicode string to remove all non-BMP (basic
  multilingual plane) characters.  All such characters are
  replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).
 
  

I expect that in 5-10 years, applications that remove or mangle non-BMP 
characters will be considered as unacceptable as applications that mangle 
BMP characters. Or for that matter, applications that cannot handle names 
with apostrophes.

Hell, if your customer base is in Asia, chances are that mangling non-BMP 
characters is *already* considered unacceptable.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Tue, Apr 2, 2013 at 4:07 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:
 It turns out, the problem is that the version of MySQL we're using

 Well there you go. Why don't you use a real database?

 http://www.postgresql.org/docs/9.2/static/multibyte.html

 :-)

 Postgresql has supported non-broken UTF-8 since at least version 8.1.

Not only that, but I *rely* on PostgreSQL to test-or-reject stuff that
comes from untrustworthy languages, like PHP. If it's malformed in any
way, it won't get past the database.

 doesn't support non-BMP characters.  Newer versions do (but you have to
 declare the column to use the utf8bm4 character set).  I could upgrade
 to a newer MySQL version, but it's just not worth it.

 My brain just broke. So-called UTF-8 in MySQL only includes up to a
 maximum of three-byte characters. There has *never* been a time where
 UTF-8 excluded four-byte characters. What were the developers thinking,
 arbitrarily cutting out support for 50% of UTF-8?

Steven, you punctuated that wrongly.

What, were the developers *thinking*? Arbitrarily etc?

It really is brain-breaking. I could understand a naive UTF-8 codec
being too permissive (allowing over-long encodings, allowing
codepoints above what's allocated (eg FA 80 80 80 80, which would
notionally represent U+200), etc), but why should it arbitrarily
stop short? There must have been some internal limitation - that,
perhaps, collation was defined only within the BMP.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread MRAB


On 01/04/2013 18:07, Steven D'Aprano wrote:

On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:


In article 515941d8$0$29967$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:


[...]
 OK, that leads to the next question.  Is there anyway I can (in
 Python 2.7) detect when a string is not entirely in the BMP?  If I
 could find all the non-BMP characters, I could replace them with
 U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).

Of course you can do this, but you should not. If your input data
includes character C, you should deal with character C and not just
throw it away unnecessarily. That would be rude, and in Python 3.3 it
should be unnecessary.


The import job isn't done yet, but so far we've processed 116 million
records and had to clean up four of them.  I can live with that.
Sometimes practicality trumps correctness.


Well, true. It has to be said that few programming languages (and
databases) make it easy to do the right thing. On the other hand, you're
a programmer. Your job is to write correct code, not easy code.



It turns out, the problem is that the version of MySQL we're using


Well there you go. Why don't you use a real database?

http://www.postgresql.org/docs/9.2/static/multibyte.html

:-)

Postgresql has supported non-broken UTF-8 since at least version 8.1.



doesn't support non-BMP characters.  Newer versions do (but you have to
declare the column to use the utf8bm4 character set).  I could upgrade
to a newer MySQL version, but it's just not worth it.


My brain just broke. So-called UTF-8 in MySQL only includes up to a
maximum of three-byte characters. There has *never* been a time where
UTF-8 excluded four-byte characters. What were the developers thinking,
arbitrarily cutting out support for 50% of UTF-8?


[snip]
50%? The BMP is one of 17 planes, so wouldn't that be 94%?

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread jmfauth

-


I'm not whining or and I'm not complaining (and never did).
I always exposed facts.

I'm not especially interested in Python, I'm interested in
Unicode.

Usualy when I posted examples, there are confirmed.


What I see is this (std download-abled Python's on Windows 7 (and
other
Windows/platforms/machines):

Py32
 import timeit
 timeit.repeat('a' * 1000 + 'ẞ')
[0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
 timeit.repeat('a' * 1000 + 'z')
[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

Py33
import timeit
timeit.repeat('a' * 1000 + 'ẞ')
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat('a' * 1000 + 'z')
[0.6640958193635527, 0.6469043692851528, 0.645896142397]

I have systematically such a behaviour, in 99.9% of my tests.
When there is something better, it is usually because something else
(3.2/3.3) has been modified.

I have my idea where this is coming from.

Question: When it is claimed, that this has been tested,
do you mean stringbench.py as proposed many times by Terry?
(Thanks for an answer).

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

On Tue, Apr 2, 2013 at 6:15 AM, jmfauth wxjmfa...@gmail.com wrote:
 Py32
 import timeit
 timeit.repeat('a' * 1000 + 'ẞ')
 [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
 timeit.repeat('a' * 1000 + 'z')
 [0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

 Py33
 import timeit
 timeit.repeat('a' * 1000 + 'ẞ')
 [1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
 timeit.repeat('a' * 1000 + 'z')
 [0.6640958193635527, 0.6469043692851528, 0.645896142397]

This is what's called a microbenchmark. Can you show me any instance
in production code where an operation like this is done repeatedly, in
a time-critical place? It's a contrived example, and it's usually
possible to find regressions in any system if you fiddle enough with
the example. Do you have, for instance, a web server that can handle
1000 tps on 3.2 and only 600 tps on 3.3, all other things being equal?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread Mark Lawrence


On 01/04/2013 20:15, jmfauth wrote:

-


I'm not whining or and I'm not complaining (and never did).
I always exposed facts.


The only fact I'm aware of is an edge case that is being addressed on 
the Python bug tracker, sorry I'm too lazy to look up the number again.




I'm not especially interested in Python, I'm interested in
Unicode.


So why do you keep harping on about the same old edge case?



Usualy when I posted examples, there are confirmed.


The only thing you've ever posted are the same old boring micro 
benchmarks.  You never, ever comment on the memory savings that are IIRC 
extremely popular with the Django folks amongst others.  Neither do you 
comment on the fact that the unicode implementation in Python 3.3 is 
correct.  I can only assume that you prefer a fast but buggy 
implementation to a correct but slow one.  Except that in many cases the 
3.3 implementation is actually faster, so you conveniently ignore this.





What I see is this (std download-abled Python's on Windows 7 (and
other
Windows/platforms/machines):

Py32

import timeit
timeit.repeat('a' * 1000 + 'ẞ')

[0.7005365263669056, 0.6810694766790423, 0.6811978680727229]

timeit.repeat('a' * 1000 + 'z')

[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

Py33
import timeit
timeit.repeat('a' * 1000 + 'ẞ')
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat('a' * 1000 + 'z')
[0.6640958193635527, 0.6469043692851528, 0.645896142397]

I have systematically such a behaviour, in 99.9% of my tests.


Always run on your micro benchmarks, never anything else.


When there is something better, it is usually because something else
(3.2/3.3) has been modified.

I have my idea where this is coming from.


I know where this is coming from as it's been stated umpteen times on 
numerous threads.  As usual you simply ignore any facts that you feel 
like, particularly with respect to any real world use cases.




Question: When it is claimed, that this has been tested,
do you mean stringbench.py as proposed many times by Terry?
(Thanks for an answer).


I find it amusing that you ask for an answer but refuse point blank to 
provide answers yourself.  I suspect that you've bitten off more than 
you can chew.




jmf



--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread jmfauth

On 1 avr, 21:28, Chris Angelico ros...@gmail.com wrote:
 On Tue, Apr 2, 2013 at 6:15 AM, jmfauth wxjmfa...@gmail.com wrote:
  Py32
  import timeit
  timeit.repeat('a' * 1000 + 'ẞ')
  [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
  timeit.repeat('a' * 1000 + 'z')
  [0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

  Py33
  import timeit
  timeit.repeat('a' * 1000 + 'ẞ')
  [1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
  timeit.repeat('a' * 1000 + 'z')
  [0.6640958193635527, 0.6469043692851528, 0.645896142397]

 This is what's called a microbenchmark. Can you show me any instance
 in production code where an operation like this is done repeatedly, in
 a time-critical place? It's a contrived example, and it's usually
 possible to find regressions in any system if you fiddle enough with
 the example. Do you have, for instance, a web server that can handle
 1000 tps on 3.2 and only 600 tps on 3.3, all other things being equal?

 ChrisA

-

Of course this is an example, as many I gave. Examples you may find in
apps.

Can you point and give at least a bunch of examples, showing
there is no regression, at least to contradict me. The only
one I succeed to see (in month), is the one given by Steven, a status
quo.

I will happily accept them. The only think I read is this is faster,
it has been tested, ...

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3

2013-04-01 Thread Roy Smith

In article 5159beb6$0$29967$c3e8da3$54964...@news.astraweb.com,
Steven D'Aprano  steve+comp.lang.pyt...@pearwood.info wrote:
 The import job isn't done yet, but so far we've processed 116 million
 records and had to clean up four of them.  I can live with that.
 Sometimes practicality trumps correctness.

Well, true. It has to be said that few programming languages (and 
databases) make it easy to do the right thing. On the other hand, you're 
a programmer. Your job is to write correct code, not easy code.

This is really getting off topic, but fundamentally, I'm an engineer.
My job is to build stuff that make money for my company.  That means
making judgement calls about what's not worth fixing, because the cost
to fix it exceeds the value.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance of int/long in Python 3