Re: [PyCUDA] Elementwise operations on noncontiguous arrays

2016-12-05 Thread Keegan Owsley
I've created the PR here:
https://gitlab.tiker.net/inducer/pycuda/merge_requests/1

Regarding the regex, I have two versions: one that's simpler and uses the
standard `re` module but has issues handling nested array access, and a
more robust one that uses the `regex` module that correctly handles nested
accesses. The following kernel fails with the former, but succeeds with the
latter:

z[i] = y[x[i]]

I didn't see any kernels in pycuda that use this kind of nested access, so
the first regex is sufficient for all of the examples in there, but the
second is necessary to support arbitrary kernels. It's possible that there
are other situations in which the regex(es) fail, but I haven't run into
them yet.

Keegan

On Sun, Dec 4, 2016 at 3:30 PM, Andreas Kloeckner 
wrote:

> Keegan Owsley  writes:
> > Something that I don't think I made clear before: the kernels generated
> by
> > get_elwise_module_noncontig are modified using regular expressions, so
> that
> > you don't need to change your code downstream to get strided array
> support.
> > I'm not convinced yet that this is the best approach, but it works okay
> so
> > far.
>
> To make it easier to see your changes and comment on them (such as
> exactly how robust the Regex stuff is), could you put this up as a pull
> request here?
>
> https://gitlab.tiker.net/inducer/pycuda
>
> This will automatically run tests, so it's easier for me to handle.
>
> Thanks!
> Andreas
>
___
PyCUDA mailing list
PyCUDA@tiker.net
https://lists.tiker.net/listinfo/pycuda


Re: [PyCUDA] Elementwise operations on noncontiguous arrays

2016-12-04 Thread Andreas Kloeckner
Keegan Owsley  writes:
> Something that I don't think I made clear before: the kernels generated by
> get_elwise_module_noncontig are modified using regular expressions, so that
> you don't need to change your code downstream to get strided array support.
> I'm not convinced yet that this is the best approach, but it works okay so
> far.

To make it easier to see your changes and comment on them (such as
exactly how robust the Regex stuff is), could you put this up as a pull
request here?

https://gitlab.tiker.net/inducer/pycuda

This will automatically run tests, so it's easier for me to handle.

Thanks!
Andreas

___
PyCUDA mailing list
PyCUDA@tiker.net
https://lists.tiker.net/listinfo/pycuda


Re: [PyCUDA] Elementwise operations on noncontiguous arrays

2016-12-01 Thread Keegan Owsley
Alright, I've updated the code in my repository. It now passes all of the
unit tests in test_gpuarray (at least, the ones that were passing before).
The old behavior is the default, so hopefully that should fix compatibility
with existing code (still largely untested). To create a noncontiguous
kernel, pass use_strides=True into get_elwise_kernel, and pass
gpuarray.device_shape_and_strides
into func.prepared_async_call. I've updated all of the methods in gpuarray.py
to check for any noncontiguous inputs/outputs, and call the appropriate
function.

Something that I don't think I made clear before: the kernels generated by
get_elwise_module_noncontig are modified using regular expressions, so that
you don't need to change your code downstream to get strided array support.
I'm not convinced yet that this is the best approach, but it works okay so
far.

So, per example...

Code like this:

@context_dependent_memoize
def get_binary_op_kernel(dtype_x, dtype_y, dtype_z, operator,
use_strides=False):
return get_elwise_kernel(
"%(tp_x)s *x, %(tp_y)s *y, %(tp_z)s *z" % {
"tp_x": dtype_to_ctype(dtype_x),
"tp_y": dtype_to_ctype(dtype_y),
"tp_z": dtype_to_ctype(dtype_z),
},
"z[i] = x[i] %s y[i]" % operator,
"multiply", use_strides=use_strides)


Generates kernels like this:

#include 


typedef struct
{
unsigned n[2];
long stride[2];
} dim;

__device__ unsigned i2m(unsigned i, dim d)
{
unsigned m = 0;
unsigned j = i;
for(int k = 0; k < 2; k++)
{
m += d.stride[k] * (j%d.n[k]);
j = j / d.n[k];
}
return m;
}




__global__ void multiply(double *x, double *y, double *z,
unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)
{

  unsigned tid = threadIdx.x;
  unsigned total_threads = gridDim.x*blockDim.x;
  unsigned cta_start = blockDim.x*blockIdx.x;
  unsigned i;

  ;

  for (i = cta_start + tid; i < n; i += total_threads)
  {
z[i2m(i,*__dim2)] = x[i2m(i,*__dim0)] / y[i2m(i,*__dim1)];
  }

  ;
}


and you use them like this:


def is_noncontig(*vecs):
r = False
for v in vecs: r = r or not v.flags.forc
return r

def _div(self, other, out, stream=None):
"""Divides an array by another array."""

assert self.shape == other.shape

noncontig = is_noncontig(self, other, out)

func = elementwise.get_binary_op_kernel(self.dtype, other.dtype,
out.dtype, "/", use_strides=noncontig)

if noncontig:
func.prepared_async_call(self._grid, self._block, stream,
self.gpudata, other.gpudata,
out.gpudata, self.mem_size,
self.device_shape_and_strides,
other.device_shape_and_strides,
out.device_shape_and_strides)
else:
func.prepared_async_call(self._grid, self._block, stream,
self.gpudata, other.gpudata,
out.gpudata, self.mem_size)


In principle, similar updates can be made to modules that depend on
pycuda.elementwise down the road, to also support strided arrays without
big rewrites.

I'm open to suggestions for improving performance, but I'd like to focus on
compatibility for now. Frédéric's suggestion is a good one, but I suspect
there's another source of slowdowns somewhere, because the slowdowns I'm
seeing don't scale with array size.

Keegan

On Thu, Dec 1, 2016 at 11:54 AM, Keegan Owsley  wrote:

> It seems striding in kernels can indeed incur a large penalty, though the
> result is dependent on the array size. I've just updated the code to only
> use the noncontiguous kernel if necessary.
>
> Using a GTX 970, I get the following results using the pow kernel (I'm
> using ipython's %timeit here, because for some reason python appears to
> hang on my machine if I use the standard timeit module):
>
> >>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray
> >>> b1 = gpuarray.arange(10, dtype='float64').reshape(1000, 100)
> >>> b2 = b1[::2, :-1].copy()
> >>> # force CUDA to compile the kernels before we time
> >>> b1[::2,:-1]**2
> >>> b2**2
>
> >>> %timeit b1[::2,:-1]**2
> 1000 loops, best of 3: 654 µs per loop
> >>> %timeit b2**2
> 1 loops, best of 3: 147 µs per loop
>
> >>> b1 = gpuarray.arange(100, dtype='float64').reshape(1000, 1000)
> >>> b2 = b1[::2, :-1].copy()
>
> >>> %timeit b1[::2,:-1]**2
> 100 loops, best of 3: 2.05 ms per loop
> >>> %timeit b2**2
> 1000 loops, best of 3: 1.66 ms per loop
>
> I'll try clean up the code and get it into my repo tpday.
>
> Keegan
>
>
> On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien <
> frederic.bast...@gmail.com> wrote:
>
>> I can share our experience in Theano related to that.
>>
>> I added code to "fuse" dimensions that are contiguous. So if you have a
>> 3d 

Re: [PyCUDA] Elementwise operations on noncontiguous arrays

2016-12-01 Thread Keegan Owsley
It seems striding in kernels can indeed incur a large penalty, though the
result is dependent on the array size. I've just updated the code to only
use the noncontiguous kernel if necessary.

Using a GTX 970, I get the following results using the pow kernel (I'm
using ipython's %timeit here, because for some reason python appears to
hang on my machine if I use the standard timeit module):

>>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray
>>> b1 = gpuarray.arange(10, dtype='float64').reshape(1000, 100)
>>> b2 = b1[::2, :-1].copy()
>>> # force CUDA to compile the kernels before we time
>>> b1[::2,:-1]**2
>>> b2**2

>>> %timeit b1[::2,:-1]**2
1000 loops, best of 3: 654 µs per loop
>>> %timeit b2**2
1 loops, best of 3: 147 µs per loop

>>> b1 = gpuarray.arange(100, dtype='float64').reshape(1000, 1000)
>>> b2 = b1[::2, :-1].copy()

>>> %timeit b1[::2,:-1]**2
100 loops, best of 3: 2.05 ms per loop
>>> %timeit b2**2
1000 loops, best of 3: 1.66 ms per loop

I'll try clean up the code and get it into my repo tpday.

Keegan


On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien <
frederic.bast...@gmail.com> wrote:

> I can share our experience in Theano related to that.
>
> I added code to "fuse" dimensions that are contiguous. So if you have a 3d
> tensor, and only 1 dimensions is non contiguous, you won't pay the indexing
> of a 3d tensor, but only of a 2d tensor.
>
> I saw 2x speed up in such cases. It was old gpu and old cuda version,
> maybe this changed.
>
> In the new Theano back-end, we have rewriting that in a more readable
> version:
>
> https://github.com/Theano/libgpuarray/blob/master/src/gpuarray_util.c#L153
>
> This also take into account the broadcasting.
>
> This could be done before call the kernel.
>
> On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner <
> li...@informa.tiker.net> wrote:
>
>> Keegan,
>>
>> Keegan Owsley  writes:
>> > I've just slapped together a patch to pycuda that makes most elementwise
>> > operations work with noncontiguous arrays. There are a bunch of hacks in
>> > there, and the code needs some reorg before it's ready to be considered
>> for
>> > upstream (I made these changes while learning the pycuda codebase, so
>> > there's a bunch of crud that can be cleaned out), but I figure I might
>> as
>> > well put it out there in its current state and see what you guys think.
>> > It's also not extremely well-tested (I have no idea if it interferes
>> with
>> > skcuda, for example), but all of the main functions appear to work.
>> >
>> > You can check out the code at https://bitbucket.org/owsleyk_
>> omega/pycuda.
>> >
>> > Briefly, this works by adding new parameters into elementwise kernels
>> that
>> > describe the stride and shape of your arrays, then using a function that
>> > computes the location in memory from the stride, shape, and index.
>> > Elementwise kernel ops are modified so that they use the proper
>> indexing.
>> > See an example of a kernel that's generated below:
>>
>> Thanks for putting this together and sharing it! I have one main
>> question about this, regarding performance:
>>
>> Modulo (especially variable-denominator modulo) has a habit of being
>> fantastically slow on GPUs. Could you time contiguous
>> vs. noncontiguous for various levels of "gappiness" and number of
>> axes? I'm asking this because I'd be OK with a 50% slowdown, but not
>> necessarily a factor of 5 slowdown on actual GPU hardware.
>>
>> Thanks!
>> Andreas
>>
>> ___
>> PyCUDA mailing list
>> PyCUDA@tiker.net
>> https://lists.tiker.net/listinfo/pycuda
>>
>
>
___
PyCUDA mailing list
PyCUDA@tiker.net
https://lists.tiker.net/listinfo/pycuda


Re: [PyCUDA] Elementwise operations on noncontiguous arrays

2016-11-30 Thread Andreas Kloeckner
Keegan,

Keegan Owsley  writes:
> I've just slapped together a patch to pycuda that makes most elementwise
> operations work with noncontiguous arrays. There are a bunch of hacks in
> there, and the code needs some reorg before it's ready to be considered for
> upstream (I made these changes while learning the pycuda codebase, so
> there's a bunch of crud that can be cleaned out), but I figure I might as
> well put it out there in its current state and see what you guys think.
> It's also not extremely well-tested (I have no idea if it interferes with
> skcuda, for example), but all of the main functions appear to work.
>
> You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda.
>
> Briefly, this works by adding new parameters into elementwise kernels that
> describe the stride and shape of your arrays, then using a function that
> computes the location in memory from the stride, shape, and index.
> Elementwise kernel ops are modified so that they use the proper indexing.
> See an example of a kernel that's generated below:

Thanks for putting this together and sharing it! I have one main
question about this, regarding performance:

Modulo (especially variable-denominator modulo) has a habit of being
fantastically slow on GPUs. Could you time contiguous
vs. noncontiguous for various levels of "gappiness" and number of
axes? I'm asking this because I'd be OK with a 50% slowdown, but not
necessarily a factor of 5 slowdown on actual GPU hardware.

Thanks!
Andreas

___
PyCUDA mailing list
PyCUDA@tiker.net
https://lists.tiker.net/listinfo/pycuda


[PyCUDA] Elementwise operations on noncontiguous arrays

2016-11-30 Thread Keegan Owsley
Hello,

I've just slapped together a patch to pycuda that makes most elementwise
operations work with noncontiguous arrays. There are a bunch of hacks in
there, and the code needs some reorg before it's ready to be considered for
upstream (I made these changes while learning the pycuda codebase, so
there's a bunch of crud that can be cleaned out), but I figure I might as
well put it out there in its current state and see what you guys think.
It's also not extremely well-tested (I have no idea if it interferes with
skcuda, for example), but all of the main functions appear to work.

You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda.

Briefly, this works by adding new parameters into elementwise kernels that
describe the stride and shape of your arrays, then using a function that
computes the location in memory from the stride, shape, and index.
Elementwise kernel ops are modified so that they use the proper indexing.
See an example of a kernel that's generated below:

#include 


typedef struct
{
unsigned n[2];
long stride[2];
} dim;

__device__ unsigned i2m(unsigned i, dim d)
{
unsigned m = 0;
unsigned j = i;
for(int k = 0; k < 2; k++)
{
m += d.stride[k] * (j%d.n[k]);
j = j / d.n[k];
}
return m;
}




__global__ void axpbyz(float a, float *x, float b, float *y, float
*z, unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)
{

  unsigned tid = threadIdx.x;
  unsigned total_threads = gridDim.x*blockDim.x;
  unsigned cta_start = blockDim.x*blockIdx.x;
  unsigned i;

  ;

  for (i = cta_start + tid; i < n; i += total_threads)
  {
z[i2m(i,*__dim2)] = a*x[i2m(i,*__dim0)] + b*y[i2m(i,*__dim1)];
  }

  ;
}

I've also attached a patch file that should take you from latest git to the
version in my repo. All of the changes are in elementwise.py and
gpuarray.py.


0001-Allow-noncontiguous-arrays-in-elementwise-ops.patch
Description: Binary data
___
PyCUDA mailing list
PyCUDA@tiker.net
https://lists.tiker.net/listinfo/pycuda