[Numpy-discussion] poor performance of sum with sub-machine-word integer types

2011-06-21 Thread Zachary Pincus
Hello all,

As a result of the fast greyscale conversion thread, I noticed an anomaly 
with numpy.ndararray.sum(): summing along certain axes is much slower with 
sum() than versus doing it explicitly, but only with integer dtypes and when 
the size of the dtype is less than the machine word. I checked in 32-bit and 
64-bit modes and in both cases only once the dtype got as large as that did the 
speed difference go away. See below...

Is this something to do with numpy or something inexorable about machine / 
memory architecture?

Zach

Timings -- 64-bit mode:
--
In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
In [3]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 2.57 ms per loop

In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
In [6]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 4.75 ms per loop

In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
In [9]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 6.37 ms per loop

In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
In [12]: timeit i.sum(axis=-1)
100 loops, best of 3: 16.6 ms per loop
In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 15.1 ms per loop



Timings -- 32-bit mode:
--
In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
In [3]: timeit i.sum(axis=-1)
10 loops, best of 3: 138 ms per loop
In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 3.68 ms per loop

In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
In [6]: timeit i.sum(axis=-1)
10 loops, best of 3: 140 ms per loop
In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 4.17 ms per loop

In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
In [9]: timeit i.sum(axis=-1)
10 loops, best of 3: 22.4 ms per loop
In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 12.2 ms per loop

In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
In [12]: timeit i.sum(axis=-1)
10 loops, best of 3: 29.2 ms per loop
In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
10 loops, best of 3: 23.8 ms per loop

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types

2011-06-21 Thread Charles R Harris
On Tue, Jun 21, 2011 at 10:46 AM, Zachary Pincus zachary.pin...@yale.eduwrote:

 Hello all,

 As a result of the fast greyscale conversion thread, I noticed an anomaly
 with numpy.ndararray.sum(): summing along certain axes is much slower with
 sum() than versus doing it explicitly, but only with integer dtypes and when
 the size of the dtype is less than the machine word. I checked in 32-bit and
 64-bit modes and in both cases only once the dtype got as large as that did
 the speed difference go away. See below...

 Is this something to do with numpy or something inexorable about machine /
 memory architecture?


It's because of the type conversion sum uses by default for greater
precision.

 In [8]: timeit i.sum(axis=-1)
10 loops, best of 3: 140 ms per loop

In [9]: timeit i.sum(axis=-1, dtype=int8)
100 loops, best of 3: 16.2 ms per loop

If you have 1.6, einsum is faster but also conserves type:

In [10]: timeit einsum('ijk-ij', i)
100 loops, best of 3: 5.95 ms per loop


We could probably make better loops for summing within kinds, i.e.,
accumulate in higher precision, then cast to specified precision.

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types

2011-06-21 Thread Keith Goodman
On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus zachary.pin...@yale.edu wrote:
 Hello all,

 As a result of the fast greyscale conversion thread, I noticed an anomaly 
 with numpy.ndararray.sum(): summing along certain axes is much slower with 
 sum() than versus doing it explicitly, but only with integer dtypes and when 
 the size of the dtype is less than the machine word. I checked in 32-bit and 
 64-bit modes and in both cases only once the dtype got as large as that did 
 the speed difference go away. See below...

 Is this something to do with numpy or something inexorable about machine / 
 memory architecture?

 Zach

 Timings -- 64-bit mode:
 --
 In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
 In [3]: timeit i.sum(axis=-1)
 10 loops, best of 3: 131 ms per loop
 In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 2.57 ms per loop

 In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
 In [6]: timeit i.sum(axis=-1)
 10 loops, best of 3: 131 ms per loop
 In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 4.75 ms per loop

 In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
 In [9]: timeit i.sum(axis=-1)
 10 loops, best of 3: 131 ms per loop
 In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 6.37 ms per loop

 In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
 In [12]: timeit i.sum(axis=-1)
 100 loops, best of 3: 16.6 ms per loop
 In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 15.1 ms per loop



 Timings -- 32-bit mode:
 --
 In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
 In [3]: timeit i.sum(axis=-1)
 10 loops, best of 3: 138 ms per loop
 In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 3.68 ms per loop

 In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
 In [6]: timeit i.sum(axis=-1)
 10 loops, best of 3: 140 ms per loop
 In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 4.17 ms per loop

 In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
 In [9]: timeit i.sum(axis=-1)
 10 loops, best of 3: 22.4 ms per loop
 In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 12.2 ms per loop

 In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
 In [12]: timeit i.sum(axis=-1)
 10 loops, best of 3: 29.2 ms per loop
 In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 10 loops, best of 3: 23.8 ms per loop

One difference is that i.sum() changes the output dtype of int input
when the int dtype is less than the default int dtype:

 i.dtype
   dtype('int32')
 i.sum(axis=-1).dtype
   dtype('int64') #  -- dtype changed
 (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
   dtype('int32')

Here are my timings

 i = numpy.ones((1024,1024,4), numpy.int32)
 timeit i.sum(axis=-1)
1 loops, best of 3: 278 ms per loop
 timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 12.1 ms per loop
 import bottleneck as bn
 timeit bn.func.nansum_3d_int32_axis2(i)
100 loops, best of 3: 8.27 ms per loop

Does making an extra copy of the input explain all of the speed
difference (is this what np.sum does internally?):

 timeit i.astype(numpy.int64)
10 loops, best of 3: 29.2 ms per loop

No.

Initializing the output also adds some time:

 timeit np.empty((1024,1024,4), dtype=np.int32)
10 loops, best of 3: 2.67 us per loop
 timeit np.empty((1024,1024,4), dtype=np.int64)
10 loops, best of 3: 12.8 us per loop

Switching back and forth between the input and output array takes more
memory time too with int64 arrays compared to int32.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types

2011-06-21 Thread Charles R Harris
On Tue, Jun 21, 2011 at 11:17 AM, Keith Goodman kwgood...@gmail.com wrote:

 On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus zachary.pin...@yale.edu
 wrote:
  Hello all,
 
  As a result of the fast greyscale conversion thread, I noticed an
 anomaly with numpy.ndararray.sum(): summing along certain axes is much
 slower with sum() than versus doing it explicitly, but only with integer
 dtypes and when the size of the dtype is less than the machine word. I
 checked in 32-bit and 64-bit modes and in both cases only once the dtype got
 as large as that did the speed difference go away. See below...
 
  Is this something to do with numpy or something inexorable about machine
 / memory architecture?
 
  Zach
 
  Timings -- 64-bit mode:
  --
  In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
  In [3]: timeit i.sum(axis=-1)
  10 loops, best of 3: 131 ms per loop
  In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  100 loops, best of 3: 2.57 ms per loop
 
  In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
  In [6]: timeit i.sum(axis=-1)
  10 loops, best of 3: 131 ms per loop
  In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  100 loops, best of 3: 4.75 ms per loop
 
  In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
  In [9]: timeit i.sum(axis=-1)
  10 loops, best of 3: 131 ms per loop
  In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  100 loops, best of 3: 6.37 ms per loop
 
  In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
  In [12]: timeit i.sum(axis=-1)
  100 loops, best of 3: 16.6 ms per loop
  In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  100 loops, best of 3: 15.1 ms per loop
 
 
 
  Timings -- 32-bit mode:
  --
  In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
  In [3]: timeit i.sum(axis=-1)
  10 loops, best of 3: 138 ms per loop
  In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  100 loops, best of 3: 3.68 ms per loop
 
  In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
  In [6]: timeit i.sum(axis=-1)
  10 loops, best of 3: 140 ms per loop
  In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  100 loops, best of 3: 4.17 ms per loop
 
  In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
  In [9]: timeit i.sum(axis=-1)
  10 loops, best of 3: 22.4 ms per loop
  In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  100 loops, best of 3: 12.2 ms per loop
 
  In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
  In [12]: timeit i.sum(axis=-1)
  10 loops, best of 3: 29.2 ms per loop
  In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
  10 loops, best of 3: 23.8 ms per loop

 One difference is that i.sum() changes the output dtype of int input
 when the int dtype is less than the default int dtype:

 i.dtype
   dtype('int32')
  i.sum(axis=-1).dtype
dtype('int64') #  -- dtype changed
 (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
   dtype('int32')

 Here are my timings

 i = numpy.ones((1024,1024,4), numpy.int32)
  timeit i.sum(axis=-1)
 1 loops, best of 3: 278 ms per loop
  timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
 100 loops, best of 3: 12.1 ms per loop
 import bottleneck as bn
 timeit bn.func.nansum_3d_int32_axis2(i)
100 loops, best of 3: 8.27 ms per loop

 Does making an extra copy of the input explain all of the speed
 difference (is this what np.sum does internally?):

 timeit i.astype(numpy.int64)
 10 loops, best of 3: 29.2 ms per loop

 No.


I think you can see the overhead here:

In [14]: timeit einsum('ijk-ij', i, dtype=int32)
100 loops, best of 3: 17.6 ms per loop

In [15]: timeit einsum('ijk-ij', i, dtype=int64)
100 loops, best of 3: 18 ms per loop

In [16]: timeit einsum('ijk-ij', i, dtype=int16)
100 loops, best of 3: 18.3 ms per loop

In [17]: timeit einsum('ijk-ij', i, dtype=int8)
100 loops, best of 3: 5.87 ms per loop


 Initializing the output also adds some time:

 timeit np.empty((1024,1024,4), dtype=np.int32)
10 loops, best of 3: 2.67 us per loop
 timeit np.empty((1024,1024,4), dtype=np.int64)
10 loops, best of 3: 12.8 us per loop

 Switching back and forth between the input and output array takes more
 memory time too with int64 arrays compared to int32.


Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types

2011-06-21 Thread Zachary Pincus
On Jun 21, 2011, at 1:16 PM, Charles R Harris wrote:

 It's because of the type conversion sum uses by default for greater precision.

Aah, makes sense. Thanks for the detailed explanations and timings!
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion