Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread David Cournapeau
On Wed, Dec 19, 2012 at 6:03 PM, Francesc Alted  wrote:
> On 12/19/12 5:47 PM, Henry Gomersall wrote:
>> On Wed, 2012-12-19 at 15:57 +, Nathaniel Smith wrote:
>>> Not sure which interface is more useful to users. On the one hand,
>>> using funny dtypes makes regular non-SIMD access more cumbersome, and
>>> it forces your array size to be a multiple of the SIMD word size,
>>> which might be inconvenient if your code is smart enough to handle
>>> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
>>> SIMD for most of the array, and then a slow path to handle any partial
>>> word at the end). OTOH, if your code *is* that smart, you should
>>> probably just make it smart enough to handle a partial word at the
>>> beginning as well and then you won't need any special alignment in the
>>> first place, and representing each SIMD word as a single numpy scalar
>>> is an intuitively appealing model of how SIMD works. OTOOH, just
>>> adding a single argument np.array() is a much simpler to explain than
>>> some elaborate scheme involving the creation of special custom dtypes.
>> If it helps, my use-case is in wrapping the FFTW library. This _is_
>> smart enough to deal with unaligned arrays, but it just results in a
>> performance penalty. In the case of an FFT, there are clearly going to
>> be issues with the powers of two indices in the array not lying on a
>> suitable n-byte boundary (which would be the case with a misaligned
>> array), but I imagine it's not unique.
>>
>> The other point is that it's easy to create a suitable power of two
>> array that should always bypass any special case unaligned code (e.g.
>> with floats, any multiple of 4 array length will fill every 16-byte
>> word).
>>
>> Finally, I think there is significant value in auto-aligning the array
>> based on an appropriate inspection of the cpu capabilities (or
>> alternatively, a function that reports back the appropriate SIMD
>> alignment). Again, this makes it easier to wrap libraries that may
>> function with any alignment, but benefit from optimum alignment.
>
> Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on
> systems (Linux and Mac OSX):

Only by accident, at least on linux. The pointers returned by the  gnu
libc malloc are at least 8 bytes aligned, but they may not be 16 bytes
when you're above the threshold where mmap is used for malloc.

The difference between aligned and unaligned ram <-> sse registers
(e.g. movaps, movups) used to be significant. Don't know if that's
still the case for recent CPUs.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Nathaniel Smith
On Wed, Dec 19, 2012 at 6:25 PM, Henry Gomersall  wrote:
> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
> 
>> > Finally, I think there is significant value in auto-aligning the
>> array
>> > based on an appropriate inspection of the cpu capabilities (or
>> > alternatively, a function that reports back the appropriate SIMD
>> > alignment). Again, this makes it easier to wrap libraries that may
>> > function with any alignment, but benefit from optimum alignment.
>>
>> Hmm, NumPy seems to return data blocks that are aligned to 16 bytes
>> on
>> systems (Linux and Mac OSX):
> 
>
> That is not true at least under Windows 32-bit. I think also it's not
> true for Linux 32-bit from my vague recollections of testing in a
> virtual machine. (disclaimer: both those statements _may_ be out of
> date).
>
> But yes, under Linux 64-bit I always get my arrays aligned to 16 bytes.

Currently numpy just uses whatever the system malloc() returns, so the
alignment guarantees are entirely determined by your libc.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Henry Gomersall
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:

> > Finally, I think there is significant value in auto-aligning the
> array
> > based on an appropriate inspection of the cpu capabilities (or
> > alternatively, a function that reports back the appropriate SIMD
> > alignment). Again, this makes it easier to wrap libraries that may
> > function with any alignment, but benefit from optimum alignment.
> 
> Hmm, NumPy seems to return data blocks that are aligned to 16 bytes
> on 
> systems (Linux and Mac OSX):


That is not true at least under Windows 32-bit. I think also it's not
true for Linux 32-bit from my vague recollections of testing in a
virtual machine. (disclaimer: both those statements _may_ be out of
date).

But yes, under Linux 64-bit I always get my arrays aligned to 16 bytes.
> 
> The only scenario that I see that this would create unaligned arrays
> is 
> for machines having AVX.  But provided that the Intel architecture is 
> making great strides in fetching unaligned data, I'd be surprised
> that 
> the difference in performance would be even noticeable.
> 
> Can you tell us which difference in performance are you seeing for an 
> AVX-aligned array and other that is not AVX-aligned?  Just curious.

I don't know; I don't own a machine with AVX ;)

It might be that the difference is negligible, though I do think it
would be _nice_ to have the arrays properly aligned if it's not too
difficult.

Cheers,

Henry

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Francesc Alted
On 12/19/12 5:47 PM, Henry Gomersall wrote:
> On Wed, 2012-12-19 at 15:57 +, Nathaniel Smith wrote:
>> Not sure which interface is more useful to users. On the one hand,
>> using funny dtypes makes regular non-SIMD access more cumbersome, and
>> it forces your array size to be a multiple of the SIMD word size,
>> which might be inconvenient if your code is smart enough to handle
>> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
>> SIMD for most of the array, and then a slow path to handle any partial
>> word at the end). OTOH, if your code *is* that smart, you should
>> probably just make it smart enough to handle a partial word at the
>> beginning as well and then you won't need any special alignment in the
>> first place, and representing each SIMD word as a single numpy scalar
>> is an intuitively appealing model of how SIMD works. OTOOH, just
>> adding a single argument np.array() is a much simpler to explain than
>> some elaborate scheme involving the creation of special custom dtypes.
> If it helps, my use-case is in wrapping the FFTW library. This _is_
> smart enough to deal with unaligned arrays, but it just results in a
> performance penalty. In the case of an FFT, there are clearly going to
> be issues with the powers of two indices in the array not lying on a
> suitable n-byte boundary (which would be the case with a misaligned
> array), but I imagine it's not unique.
>
> The other point is that it's easy to create a suitable power of two
> array that should always bypass any special case unaligned code (e.g.
> with floats, any multiple of 4 array length will fill every 16-byte
> word).
>
> Finally, I think there is significant value in auto-aligning the array
> based on an appropriate inspection of the cpu capabilities (or
> alternatively, a function that reports back the appropriate SIMD
> alignment). Again, this makes it easier to wrap libraries that may
> function with any alignment, but benefit from optimum alignment.

Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on 
systems (Linux and Mac OSX):

In []: np.empty(1).data
Out[]: 

In []: np.empty(1).data
Out[]: 

In []: np.empty(1).data
Out[]: 

In []: np.empty(1).data
Out[]: 

[Check that the last digit in the addresses above is always 0]

The only scenario that I see that this would create unaligned arrays is 
for machines having AVX.  But provided that the Intel architecture is 
making great strides in fetching unaligned data, I'd be surprised that 
the difference in performance would be even noticeable.

Can you tell us which difference in performance are you seeing for an 
AVX-aligned array and other that is not AVX-aligned?  Just curious.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Henry Gomersall
On Wed, 2012-12-19 at 15:57 +, Nathaniel Smith wrote:
> Not sure which interface is more useful to users. On the one hand,
> using funny dtypes makes regular non-SIMD access more cumbersome, and
> it forces your array size to be a multiple of the SIMD word size,
> which might be inconvenient if your code is smart enough to handle
> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
> SIMD for most of the array, and then a slow path to handle any partial
> word at the end). OTOH, if your code *is* that smart, you should
> probably just make it smart enough to handle a partial word at the
> beginning as well and then you won't need any special alignment in the
> first place, and representing each SIMD word as a single numpy scalar
> is an intuitively appealing model of how SIMD works. OTOOH, just
> adding a single argument np.array() is a much simpler to explain than
> some elaborate scheme involving the creation of special custom dtypes.

If it helps, my use-case is in wrapping the FFTW library. This _is_
smart enough to deal with unaligned arrays, but it just results in a
performance penalty. In the case of an FFT, there are clearly going to
be issues with the powers of two indices in the array not lying on a
suitable n-byte boundary (which would be the case with a misaligned
array), but I imagine it's not unique.

The other point is that it's easy to create a suitable power of two
array that should always bypass any special case unaligned code (e.g.
with floats, any multiple of 4 array length will fill every 16-byte
word).

Finally, I think there is significant value in auto-aligning the array
based on an appropriate inspection of the cpu capabilities (or
alternatively, a function that reports back the appropriate SIMD
alignment). Again, this makes it easier to wrap libraries that may
function with any alignment, but benefit from optimum alignment.

Cheers,

Henry

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Nathaniel Smith
On Wed, Dec 19, 2012 at 3:27 PM, Charles R Harris
 wrote:
>
>
> On Wed, Dec 19, 2012 at 8:10 AM, Nathaniel Smith  wrote:
>> Right, my intuition is that it's like order="C" -- if you make a new
>> array by, say, indexing, then it may or may not have order="C", no
>> guarantees. So when you care, you call asarray(a, order="C") and that
>> either makes a copy or not as needed. Similarly for base alignment.
>>
>> I guess to push this analogy even further we could define a set of
>> array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
>> alignment matters, I think, so the number of flags would remain
>> manageable?) That would make the C API easier to deal with too, no
>> need to add PyArray_FromAnyAligned.
>>
>
> Another possibility is an aligned datatype, basically an aligned structured
> array with floats/ints in chunks of the appropriate size. IIRC, gcc support
> for sse is something like that.

True; right now it looks like structured dtypes have no special alignment:

In [13]: np.dtype("f4,f4").alignment
Out[13]: 1

So for this approach we'd need a way to create structured dtypes with
.alignment == .itemsize, and we'd need some way to request
dtype-aligned memory from array allocation functions. I guess existing
NPY_ALIGNED is a good enough public interface for the latter, but
AFAICT the current implementation is to just assume that whatever
malloc() returns will always be ALIGNED. This is true for all base C
types, but not for more exotic record types with larger alignment
requirements -- that would require some fancier allocation scheme.

Not sure which interface is more useful to users. On the one hand,
using funny dtypes makes regular non-SIMD access more cumbersome, and
it forces your array size to be a multiple of the SIMD word size,
which might be inconvenient if your code is smart enough to handle
arbitrary-sized arrays with partial SIMD acceleration (i.e., using
SIMD for most of the array, and then a slow path to handle any partial
word at the end). OTOH, if your code *is* that smart, you should
probably just make it smart enough to handle a partial word at the
beginning as well and then you won't need any special alignment in the
first place, and representing each SIMD word as a single numpy scalar
is an intuitively appealing model of how SIMD works. OTOOH, just
adding a single argument np.array() is a much simpler to explain than
some elaborate scheme involving the creation of special custom dtypes.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Charles R Harris
On Wed, Dec 19, 2012 at 8:10 AM, Nathaniel Smith  wrote:

> On Wed, Dec 19, 2012 at 2:57 PM, Charles R Harris
>  wrote:
> >
> >
> > On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith  wrote:
> >>
> >> On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall 
> wrote:
> >> > I've written a few simple cython routines for assisting in creating
> >> > byte-aligned numpy arrays. The point being for the arrays to work with
> >> > SSE/AVX code.
> >> >
> >> > https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
> >> >
> >> > The change recently has been to add a check on the CPU as to what
> flags
> >> > are supported (though it's not complete, I should make the default
> >> > return 0 or something).
> >> >
> >> > It occurred to me that this is something that (a) other people almost
> >> > certainly need and are solving themselves and (b) I lack the necessary
> >> > platforms to test all the possible CPU/OS combinations to make sure
> >> > something sensible happens in all cases.
> >> >
> >> > Is this something that can be rolled into Numpy (the feature, not my
> >> > particular implementation or interface - though I'd be happy for it to
> >> > be so)?
> >> >
> >> > Regarding (b), I've written a test case that works for Linux on x86-64
> >> > with GCC (my platform!). I can test it on 32-bit windows, but that's
> it.
> >> > Is ARM supported by Numpy? Neon would be great to include as well.
> What
> >> > other platforms might need this?
> >>
> >> Your code looks simple and portable to me (at least the alignment
> >> part). I can see a good argument for adding this sort of functionality
> >> directly to numpy with a nice interface, though, since these kind of
> >> requirements seem quite common these days. Maybe an interface like
> >>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
> >> bits or in bytes?
> >>   b = np.empty((10, 10), order="C", base_alignment=32)
> >>   # etc.
> >>   assert a.base_alignment == 32
> >> which underneath tries to use posix_memalign/_aligned_malloc when
> >> possible, or falls back on the overallocation trick otherwise?
> >>
> >
> > There is a thread about this from several years back. IIRC, David
> Cournapeau
> > was interested in the same problem. At first glance, the alignment
> keyword
> > looks interesting. One possible concern is keeping alignment for rows,
> > views, etc., which is probably not possible in any sensible way. But
> people
> > who need this most likely know what they are doing and just need memory
> > allocated on the proper boundary.
>
> Right, my intuition is that it's like order="C" -- if you make a new
> array by, say, indexing, then it may or may not have order="C", no
> guarantees. So when you care, you call asarray(a, order="C") and that
> either makes a copy or not as needed. Similarly for base alignment.
>
> I guess to push this analogy even further we could define a set of
> array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
> alignment matters, I think, so the number of flags would remain
> manageable?) That would make the C API easier to deal with too, no
> need to add PyArray_FromAnyAligned.
>
>
Another possibility is an aligned datatype, basically an aligned structured
array with floats/ints in chunks of the appropriate size. IIRC, gcc support
for sse is something like that.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Nathaniel Smith
On Wed, Dec 19, 2012 at 2:57 PM, Charles R Harris
 wrote:
>
>
> On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith  wrote:
>>
>> On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall  wrote:
>> > I've written a few simple cython routines for assisting in creating
>> > byte-aligned numpy arrays. The point being for the arrays to work with
>> > SSE/AVX code.
>> >
>> > https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
>> >
>> > The change recently has been to add a check on the CPU as to what flags
>> > are supported (though it's not complete, I should make the default
>> > return 0 or something).
>> >
>> > It occurred to me that this is something that (a) other people almost
>> > certainly need and are solving themselves and (b) I lack the necessary
>> > platforms to test all the possible CPU/OS combinations to make sure
>> > something sensible happens in all cases.
>> >
>> > Is this something that can be rolled into Numpy (the feature, not my
>> > particular implementation or interface - though I'd be happy for it to
>> > be so)?
>> >
>> > Regarding (b), I've written a test case that works for Linux on x86-64
>> > with GCC (my platform!). I can test it on 32-bit windows, but that's it.
>> > Is ARM supported by Numpy? Neon would be great to include as well. What
>> > other platforms might need this?
>>
>> Your code looks simple and portable to me (at least the alignment
>> part). I can see a good argument for adding this sort of functionality
>> directly to numpy with a nice interface, though, since these kind of
>> requirements seem quite common these days. Maybe an interface like
>>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
>> bits or in bytes?
>>   b = np.empty((10, 10), order="C", base_alignment=32)
>>   # etc.
>>   assert a.base_alignment == 32
>> which underneath tries to use posix_memalign/_aligned_malloc when
>> possible, or falls back on the overallocation trick otherwise?
>>
>
> There is a thread about this from several years back. IIRC, David Cournapeau
> was interested in the same problem. At first glance, the alignment keyword
> looks interesting. One possible concern is keeping alignment for rows,
> views, etc., which is probably not possible in any sensible way. But people
> who need this most likely know what they are doing and just need memory
> allocated on the proper boundary.

Right, my intuition is that it's like order="C" -- if you make a new
array by, say, indexing, then it may or may not have order="C", no
guarantees. So when you care, you call asarray(a, order="C") and that
either makes a copy or not as needed. Similarly for base alignment.

I guess to push this analogy even further we could define a set of
array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
alignment matters, I think, so the number of flags would remain
manageable?) That would make the C API easier to deal with too, no
need to add PyArray_FromAnyAligned.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Charles R Harris
On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith  wrote:

> On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall  wrote:
> > I've written a few simple cython routines for assisting in creating
> > byte-aligned numpy arrays. The point being for the arrays to work with
> > SSE/AVX code.
> >
> > https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
> >
> > The change recently has been to add a check on the CPU as to what flags
> > are supported (though it's not complete, I should make the default
> > return 0 or something).
> >
> > It occurred to me that this is something that (a) other people almost
> > certainly need and are solving themselves and (b) I lack the necessary
> > platforms to test all the possible CPU/OS combinations to make sure
> > something sensible happens in all cases.
> >
> > Is this something that can be rolled into Numpy (the feature, not my
> > particular implementation or interface - though I'd be happy for it to
> > be so)?
> >
> > Regarding (b), I've written a test case that works for Linux on x86-64
> > with GCC (my platform!). I can test it on 32-bit windows, but that's it.
> > Is ARM supported by Numpy? Neon would be great to include as well. What
> > other platforms might need this?
>
> Your code looks simple and portable to me (at least the alignment
> part). I can see a good argument for adding this sort of functionality
> directly to numpy with a nice interface, though, since these kind of
> requirements seem quite common these days. Maybe an interface like
>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
> bits or in bytes?
>   b = np.empty((10, 10), order="C", base_alignment=32)
>   # etc.
>   assert a.base_alignment == 32
> which underneath tries to use posix_memalign/_aligned_malloc when
> possible, or falls back on the overallocation trick otherwise?
>
>
There is a thread about this from several years back. IIRC, David
Cournapeau was interested in the same problem. At first glance, the
alignment keyword looks interesting. One possible concern is keeping
alignment for rows, views, etc., which is probably not possible in any
sensible way. But people who need this most likely know what they are doing
and just need memory allocated on the proper boundary.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Nathaniel Smith
On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall  wrote:
> I've written a few simple cython routines for assisting in creating
> byte-aligned numpy arrays. The point being for the arrays to work with
> SSE/AVX code.
>
> https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
>
> The change recently has been to add a check on the CPU as to what flags
> are supported (though it's not complete, I should make the default
> return 0 or something).
>
> It occurred to me that this is something that (a) other people almost
> certainly need and are solving themselves and (b) I lack the necessary
> platforms to test all the possible CPU/OS combinations to make sure
> something sensible happens in all cases.
>
> Is this something that can be rolled into Numpy (the feature, not my
> particular implementation or interface - though I'd be happy for it to
> be so)?
>
> Regarding (b), I've written a test case that works for Linux on x86-64
> with GCC (my platform!). I can test it on 32-bit windows, but that's it.
> Is ARM supported by Numpy? Neon would be great to include as well. What
> other platforms might need this?

Your code looks simple and portable to me (at least the alignment
part). I can see a good argument for adding this sort of functionality
directly to numpy with a nice interface, though, since these kind of
requirements seem quite common these days. Maybe an interface like
  a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
bits or in bytes?
  b = np.empty((10, 10), order="C", base_alignment=32)
  # etc.
  assert a.base_alignment == 32
which underneath tries to use posix_memalign/_aligned_malloc when
possible, or falls back on the overallocation trick otherwise?

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Henry Gomersall
I've written a few simple cython routines for assisting in creating
byte-aligned numpy arrays. The point being for the arrays to work with
SSE/AVX code.

https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi

The change recently has been to add a check on the CPU as to what flags
are supported (though it's not complete, I should make the default
return 0 or something).

It occurred to me that this is something that (a) other people almost
certainly need and are solving themselves and (b) I lack the necessary
platforms to test all the possible CPU/OS combinations to make sure
something sensible happens in all cases.

Is this something that can be rolled into Numpy (the feature, not my
particular implementation or interface - though I'd be happy for it to
be so)?

Regarding (b), I've written a test case that works for Linux on x86-64
with GCC (my platform!). I can test it on 32-bit windows, but that's it.
Is ARM supported by Numpy? Neon would be great to include as well. What
other platforms might need this?

Cheers,

Henry

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion