Re: [Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)

2013-03-07 Thread Francesc Alted
On 3/6/13 7:42 PM, Kurt Smith wrote:
 And regarding performance, doing simple timings shows a 30%-ish
 slowdown for unaligned operations:

 In [36]: %timeit packed_arr['b']**2
 100 loops, best of 3: 2.48 ms per loop

 In [37]: %timeit aligned_arr['b']**2
 1000 loops, best of 3: 1.9 ms per loop

Hmm, that clearly depends on the architecture.  On my machine:

In [1]: import numpy as np

In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)

In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)

In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)

In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)

In [6]: baligned = aligned_arr['b']

In [7]: bpacked = packed_arr['b']

In [8]: %timeit baligned**2
1000 loops, best of 3: 1.96 ms per loop

In [9]: %timeit bpacked**2
100 loops, best of 3: 7.84 ms per loop

That is, the unaligned column is 4x slower (!).  numexpr allows somewhat 
better results:

In [11]: %timeit numexpr.evaluate('baligned**2')
1000 loops, best of 3: 1.13 ms per loop

In [12]: %timeit numexpr.evaluate('bpacked**2')
1000 loops, best of 3: 865 us per loop

Yes, in this case, the unaligned array goes faster (as much as 30%).  I 
think the reason is that numexpr optimizes the unaligned access by doing 
a copy of the different chunks in internal buffers that fits in L1 
cache.  Apparently this is very beneficial in this case (not sure why, 
though).


 Whereas summing shows just a 10%-ish slowdown:

 In [38]: %timeit packed_arr['b'].sum()
 1000 loops, best of 3: 1.29 ms per loop

 In [39]: %timeit aligned_arr['b'].sum()
 1000 loops, best of 3: 1.14 ms per loop

On my machine:

In [14]: %timeit baligned.sum()
1000 loops, best of 3: 1.03 ms per loop

In [15]: %timeit bpacked.sum()
100 loops, best of 3: 3.79 ms per loop

Again, the 4x slowdown is here.  Using numexpr:

In [16]: %timeit numexpr.evaluate('sum(baligned)')
100 loops, best of 3: 2.16 ms per loop

In [17]: %timeit numexpr.evaluate('sum(bpacked)')
100 loops, best of 3: 2.08 ms per loop

Again, the unaligned case is (sligthly better).  In this case numexpr is 
a bit slower that NumPy because sum() is not parallelized internally.  
Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy 
could help improving unaligned performance. Worth a try?

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)

2013-03-07 Thread Kurt Smith
On Thu, Mar 7, 2013 at 11:47 AM, Francesc Alted franc...@continuum.io wrote:
 On 3/6/13 7:42 PM, Kurt Smith wrote:

 Hmm, that clearly depends on the architecture.  On my machine:
 ...
 That is, the unaligned column is 4x slower (!).  numexpr allows somewhat
 better results:
 ...
 Yes, in this case, the unaligned array goes faster (as much as 30%).  I
 think the reason is that numexpr optimizes the unaligned access by doing
 a copy of the different chunks in internal buffers that fits in L1
 cache.  Apparently this is very beneficial in this case (not sure why,
 though).

 On my machine:
 ...
 Again, the 4x slowdown is here.  Using numexpr:
 ...
 Again, the unaligned case is (sligthly better).  In this case numexpr is
 a bit slower that NumPy because sum() is not parallelized internally.
 Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy
 could help improving unaligned performance. Worth a try?


Very interesting -- thanks for sharing.

 --
 Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)

2013-03-06 Thread Kurt Smith
On Wed, Mar 6, 2013 at 12:12 PM, Kurt Smith kwmsm...@gmail.com wrote:
 On Wed, Mar 6, 2013 at 4:29 AM, Francesc Alted franc...@continuum.io wrote:

 I would not run too much.  The example above takes 9 bytes to host the
 structure, while a `aligned=True` will take 16 bytes.  I'd rather let
 the default as it is, and in case performance is critical, you can
 always copy the unaligned field to a new (homogeneous) array.

 Yes, I can absolutely see the case you're making here, and I made my
 vote with the understanding that `aligned=False` will almost
 certainly stay the default.  Adding 'aligned=True' is simple for me to
 do, so no harm done.

 My case is based on what's the least surprising behavior: C structs /
 all C compilers, the builtin `struct` module, and ctypes `Structure`
 subclasses all use padding to ensure aligned fields by default.  You
 can turn this off to get packed structures, but the default behavior
 in these other places is alignment, which is why I was surprised when
 I first saw that NumPy structured dtypes are packed by default.


Some surprises with aligned / unaligned arrays:

#-

import numpy as np

packed_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=False)
aligned_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=True)

packed_arr = np.ones((10**6,), dtype=packed_dt)
aligned_arr = np.ones((10**6,), dtype=aligned_dt)

print all(packed_arr['a'] == aligned_arr['a']),
np.all(packed_arr['a'] == aligned_arr['a']) # True
print all(packed_arr['b'] == aligned_arr['b']),
np.all(packed_arr['b'] == aligned_arr['b']) # True
print all(packed_arr == aligned_arr), np.all(packed_arr ==
aligned_arr) # False (!!)

#-

I can understand what's likely going on under the covers that makes
these arrays not compare equal, but I'd expect that if all columns of
two structured arrays are everywhere equal, then the arrays themselves
would be everywhere equal.  Bug?

And regarding performance, doing simple timings shows a 30%-ish
slowdown for unaligned operations:

In [36]: %timeit packed_arr['b']**2
100 loops, best of 3: 2.48 ms per loop

In [37]: %timeit aligned_arr['b']**2
1000 loops, best of 3: 1.9 ms per loop

Whereas summing shows just a 10%-ish slowdown:

In [38]: %timeit packed_arr['b'].sum()
1000 loops, best of 3: 1.29 ms per loop

In [39]: %timeit aligned_arr['b'].sum()
1000 loops, best of 3: 1.14 ms per loop
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)

2013-03-06 Thread Sebastian Berg
On Wed, 2013-03-06 at 12:42 -0600, Kurt Smith wrote:
 On Wed, Mar 6, 2013 at 12:12 PM, Kurt Smith kwmsm...@gmail.com wrote:
  On Wed, Mar 6, 2013 at 4:29 AM, Francesc Alted franc...@continuum.io 
  wrote:
 
  I would not run too much.  The example above takes 9 bytes to host the
  structure, while a `aligned=True` will take 16 bytes.  I'd rather let
  the default as it is, and in case performance is critical, you can
  always copy the unaligned field to a new (homogeneous) array.
 
  Yes, I can absolutely see the case you're making here, and I made my
  vote with the understanding that `aligned=False` will almost
  certainly stay the default.  Adding 'aligned=True' is simple for me to
  do, so no harm done.
 
  My case is based on what's the least surprising behavior: C structs /
  all C compilers, the builtin `struct` module, and ctypes `Structure`
  subclasses all use padding to ensure aligned fields by default.  You
  can turn this off to get packed structures, but the default behavior
  in these other places is alignment, which is why I was surprised when
  I first saw that NumPy structured dtypes are packed by default.
 
 
 Some surprises with aligned / unaligned arrays:
 
 #-
 
 import numpy as np
 
 packed_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=False)
 aligned_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=True)
 
 packed_arr = np.ones((10**6,), dtype=packed_dt)
 aligned_arr = np.ones((10**6,), dtype=aligned_dt)
 
 print all(packed_arr['a'] == aligned_arr['a']),
 np.all(packed_arr['a'] == aligned_arr['a']) # True
 print all(packed_arr['b'] == aligned_arr['b']),
 np.all(packed_arr['b'] == aligned_arr['b']) # True
 print all(packed_arr == aligned_arr), np.all(packed_arr ==
 aligned_arr) # False (!!)
 
 #-
 
 I can understand what's likely going on under the covers that makes
 these arrays not compare equal, but I'd expect that if all columns of
 two structured arrays are everywhere equal, then the arrays themselves
 would be everywhere equal.  Bug?
 

Yes and no... equal for structured types seems not implemented, you get
the same (wrong) False also with (packed_arr == packed_arr). But if the
types are equivalent but np.equal not implemented, just returning False
is a bit dangerous I agree. Not sure what the solution is exactly, I
think the == operator could really raise an error instead of eating them
all though probably...

- Sebastian

 And regarding performance, doing simple timings shows a 30%-ish
 slowdown for unaligned operations:
 
 In [36]: %timeit packed_arr['b']**2
 100 loops, best of 3: 2.48 ms per loop
 
 In [37]: %timeit aligned_arr['b']**2
 1000 loops, best of 3: 1.9 ms per loop
 
 Whereas summing shows just a 10%-ish slowdown:
 
 In [38]: %timeit packed_arr['b'].sum()
 1000 loops, best of 3: 1.29 ms per loop
 
 In [39]: %timeit aligned_arr['b'].sum()
 1000 loops, best of 3: 1.14 ms per loop
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion