[Numpy-discussion] poor performance of sum with sub-machine-word integer types
Hello all, As a result of the fast greyscale conversion thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below... Is this something to do with numpy or something inexorable about machine / memory architecture? Zach Timings -- 64-bit mode: -- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 2.57 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.75 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 6.37 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 100 loops, best of 3: 16.6 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 15.1 ms per loop Timings -- 32-bit mode: -- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 138 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 3.68 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 140 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.17 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 22.4 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.2 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 10 loops, best of 3: 29.2 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 10 loops, best of 3: 23.8 ms per loop ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types
On Tue, Jun 21, 2011 at 10:46 AM, Zachary Pincus zachary.pin...@yale.eduwrote: Hello all, As a result of the fast greyscale conversion thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below... Is this something to do with numpy or something inexorable about machine / memory architecture? It's because of the type conversion sum uses by default for greater precision. In [8]: timeit i.sum(axis=-1) 10 loops, best of 3: 140 ms per loop In [9]: timeit i.sum(axis=-1, dtype=int8) 100 loops, best of 3: 16.2 ms per loop If you have 1.6, einsum is faster but also conserves type: In [10]: timeit einsum('ijk-ij', i) 100 loops, best of 3: 5.95 ms per loop We could probably make better loops for summing within kinds, i.e., accumulate in higher precision, then cast to specified precision. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types
On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus zachary.pin...@yale.edu wrote: Hello all, As a result of the fast greyscale conversion thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below... Is this something to do with numpy or something inexorable about machine / memory architecture? Zach Timings -- 64-bit mode: -- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 2.57 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.75 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 6.37 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 100 loops, best of 3: 16.6 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 15.1 ms per loop Timings -- 32-bit mode: -- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 138 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 3.68 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 140 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.17 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 22.4 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.2 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 10 loops, best of 3: 29.2 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 10 loops, best of 3: 23.8 ms per loop One difference is that i.sum() changes the output dtype of int input when the int dtype is less than the default int dtype: i.dtype dtype('int32') i.sum(axis=-1).dtype dtype('int64') # -- dtype changed (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype dtype('int32') Here are my timings i = numpy.ones((1024,1024,4), numpy.int32) timeit i.sum(axis=-1) 1 loops, best of 3: 278 ms per loop timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.1 ms per loop import bottleneck as bn timeit bn.func.nansum_3d_int32_axis2(i) 100 loops, best of 3: 8.27 ms per loop Does making an extra copy of the input explain all of the speed difference (is this what np.sum does internally?): timeit i.astype(numpy.int64) 10 loops, best of 3: 29.2 ms per loop No. Initializing the output also adds some time: timeit np.empty((1024,1024,4), dtype=np.int32) 10 loops, best of 3: 2.67 us per loop timeit np.empty((1024,1024,4), dtype=np.int64) 10 loops, best of 3: 12.8 us per loop Switching back and forth between the input and output array takes more memory time too with int64 arrays compared to int32. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types
On Tue, Jun 21, 2011 at 11:17 AM, Keith Goodman kwgood...@gmail.com wrote: On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus zachary.pin...@yale.edu wrote: Hello all, As a result of the fast greyscale conversion thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below... Is this something to do with numpy or something inexorable about machine / memory architecture? Zach Timings -- 64-bit mode: -- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 2.57 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.75 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 6.37 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 100 loops, best of 3: 16.6 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 15.1 ms per loop Timings -- 32-bit mode: -- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 138 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 3.68 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 140 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.17 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 22.4 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.2 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 10 loops, best of 3: 29.2 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 10 loops, best of 3: 23.8 ms per loop One difference is that i.sum() changes the output dtype of int input when the int dtype is less than the default int dtype: i.dtype dtype('int32') i.sum(axis=-1).dtype dtype('int64') # -- dtype changed (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype dtype('int32') Here are my timings i = numpy.ones((1024,1024,4), numpy.int32) timeit i.sum(axis=-1) 1 loops, best of 3: 278 ms per loop timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.1 ms per loop import bottleneck as bn timeit bn.func.nansum_3d_int32_axis2(i) 100 loops, best of 3: 8.27 ms per loop Does making an extra copy of the input explain all of the speed difference (is this what np.sum does internally?): timeit i.astype(numpy.int64) 10 loops, best of 3: 29.2 ms per loop No. I think you can see the overhead here: In [14]: timeit einsum('ijk-ij', i, dtype=int32) 100 loops, best of 3: 17.6 ms per loop In [15]: timeit einsum('ijk-ij', i, dtype=int64) 100 loops, best of 3: 18 ms per loop In [16]: timeit einsum('ijk-ij', i, dtype=int16) 100 loops, best of 3: 18.3 ms per loop In [17]: timeit einsum('ijk-ij', i, dtype=int8) 100 loops, best of 3: 5.87 ms per loop Initializing the output also adds some time: timeit np.empty((1024,1024,4), dtype=np.int32) 10 loops, best of 3: 2.67 us per loop timeit np.empty((1024,1024,4), dtype=np.int64) 10 loops, best of 3: 12.8 us per loop Switching back and forth between the input and output array takes more memory time too with int64 arrays compared to int32. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] poor performance of sum with sub-machine-word integer types
On Jun 21, 2011, at 1:16 PM, Charles R Harris wrote: It's because of the type conversion sum uses by default for greater precision. Aah, makes sense. Thanks for the detailed explanations and timings! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion