Re: [PyCUDA] Elementwise operations on noncontiguous arrays
I've created the PR here: https://gitlab.tiker.net/inducer/pycuda/merge_requests/1 Regarding the regex, I have two versions: one that's simpler and uses the standard `re` module but has issues handling nested array access, and a more robust one that uses the `regex` module that correctly handles nested accesses. The following kernel fails with the former, but succeeds with the latter: z[i] = y[x[i]] I didn't see any kernels in pycuda that use this kind of nested access, so the first regex is sufficient for all of the examples in there, but the second is necessary to support arbitrary kernels. It's possible that there are other situations in which the regex(es) fail, but I haven't run into them yet. Keegan On Sun, Dec 4, 2016 at 3:30 PM, Andreas Kloecknerwrote: > Keegan Owsley writes: > > Something that I don't think I made clear before: the kernels generated > by > > get_elwise_module_noncontig are modified using regular expressions, so > that > > you don't need to change your code downstream to get strided array > support. > > I'm not convinced yet that this is the best approach, but it works okay > so > > far. > > To make it easier to see your changes and comment on them (such as > exactly how robust the Regex stuff is), could you put this up as a pull > request here? > > https://gitlab.tiker.net/inducer/pycuda > > This will automatically run tests, so it's easier for me to handle. > > Thanks! > Andreas > ___ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda
Re: [PyCUDA] Elementwise operations on noncontiguous arrays
Keegan Owsleywrites: > Something that I don't think I made clear before: the kernels generated by > get_elwise_module_noncontig are modified using regular expressions, so that > you don't need to change your code downstream to get strided array support. > I'm not convinced yet that this is the best approach, but it works okay so > far. To make it easier to see your changes and comment on them (such as exactly how robust the Regex stuff is), could you put this up as a pull request here? https://gitlab.tiker.net/inducer/pycuda This will automatically run tests, so it's easier for me to handle. Thanks! Andreas ___ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda
Re: [PyCUDA] Elementwise operations on noncontiguous arrays
Alright, I've updated the code in my repository. It now passes all of the unit tests in test_gpuarray (at least, the ones that were passing before). The old behavior is the default, so hopefully that should fix compatibility with existing code (still largely untested). To create a noncontiguous kernel, pass use_strides=True into get_elwise_kernel, and pass gpuarray.device_shape_and_strides into func.prepared_async_call. I've updated all of the methods in gpuarray.py to check for any noncontiguous inputs/outputs, and call the appropriate function. Something that I don't think I made clear before: the kernels generated by get_elwise_module_noncontig are modified using regular expressions, so that you don't need to change your code downstream to get strided array support. I'm not convinced yet that this is the best approach, but it works okay so far. So, per example... Code like this: @context_dependent_memoize def get_binary_op_kernel(dtype_x, dtype_y, dtype_z, operator, use_strides=False): return get_elwise_kernel( "%(tp_x)s *x, %(tp_y)s *y, %(tp_z)s *z" % { "tp_x": dtype_to_ctype(dtype_x), "tp_y": dtype_to_ctype(dtype_y), "tp_z": dtype_to_ctype(dtype_z), }, "z[i] = x[i] %s y[i]" % operator, "multiply", use_strides=use_strides) Generates kernels like this: #include typedef struct { unsigned n[2]; long stride[2]; } dim; __device__ unsigned i2m(unsigned i, dim d) { unsigned m = 0; unsigned j = i; for(int k = 0; k < 2; k++) { m += d.stride[k] * (j%d.n[k]); j = j / d.n[k]; } return m; } __global__ void multiply(double *x, double *y, double *z, unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2) { unsigned tid = threadIdx.x; unsigned total_threads = gridDim.x*blockDim.x; unsigned cta_start = blockDim.x*blockIdx.x; unsigned i; ; for (i = cta_start + tid; i < n; i += total_threads) { z[i2m(i,*__dim2)] = x[i2m(i,*__dim0)] / y[i2m(i,*__dim1)]; } ; } and you use them like this: def is_noncontig(*vecs): r = False for v in vecs: r = r or not v.flags.forc return r def _div(self, other, out, stream=None): """Divides an array by another array.""" assert self.shape == other.shape noncontig = is_noncontig(self, other, out) func = elementwise.get_binary_op_kernel(self.dtype, other.dtype, out.dtype, "/", use_strides=noncontig) if noncontig: func.prepared_async_call(self._grid, self._block, stream, self.gpudata, other.gpudata, out.gpudata, self.mem_size, self.device_shape_and_strides, other.device_shape_and_strides, out.device_shape_and_strides) else: func.prepared_async_call(self._grid, self._block, stream, self.gpudata, other.gpudata, out.gpudata, self.mem_size) In principle, similar updates can be made to modules that depend on pycuda.elementwise down the road, to also support strided arrays without big rewrites. I'm open to suggestions for improving performance, but I'd like to focus on compatibility for now. Frédéric's suggestion is a good one, but I suspect there's another source of slowdowns somewhere, because the slowdowns I'm seeing don't scale with array size. Keegan On Thu, Dec 1, 2016 at 11:54 AM, Keegan Owsleywrote: > It seems striding in kernels can indeed incur a large penalty, though the > result is dependent on the array size. I've just updated the code to only > use the noncontiguous kernel if necessary. > > Using a GTX 970, I get the following results using the pow kernel (I'm > using ipython's %timeit here, because for some reason python appears to > hang on my machine if I use the standard timeit module): > > >>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray > >>> b1 = gpuarray.arange(10, dtype='float64').reshape(1000, 100) > >>> b2 = b1[::2, :-1].copy() > >>> # force CUDA to compile the kernels before we time > >>> b1[::2,:-1]**2 > >>> b2**2 > > >>> %timeit b1[::2,:-1]**2 > 1000 loops, best of 3: 654 µs per loop > >>> %timeit b2**2 > 1 loops, best of 3: 147 µs per loop > > >>> b1 = gpuarray.arange(100, dtype='float64').reshape(1000, 1000) > >>> b2 = b1[::2, :-1].copy() > > >>> %timeit b1[::2,:-1]**2 > 100 loops, best of 3: 2.05 ms per loop > >>> %timeit b2**2 > 1000 loops, best of 3: 1.66 ms per loop > > I'll try clean up the code and get it into my repo tpday. > > Keegan > > > On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien < > frederic.bast...@gmail.com> wrote: > >> I can share our experience in Theano related to that. >> >> I added code to "fuse" dimensions that are contiguous. So if you have a >> 3d
Re: [PyCUDA] Elementwise operations on noncontiguous arrays
It seems striding in kernels can indeed incur a large penalty, though the result is dependent on the array size. I've just updated the code to only use the noncontiguous kernel if necessary. Using a GTX 970, I get the following results using the pow kernel (I'm using ipython's %timeit here, because for some reason python appears to hang on my machine if I use the standard timeit module): >>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray >>> b1 = gpuarray.arange(10, dtype='float64').reshape(1000, 100) >>> b2 = b1[::2, :-1].copy() >>> # force CUDA to compile the kernels before we time >>> b1[::2,:-1]**2 >>> b2**2 >>> %timeit b1[::2,:-1]**2 1000 loops, best of 3: 654 µs per loop >>> %timeit b2**2 1 loops, best of 3: 147 µs per loop >>> b1 = gpuarray.arange(100, dtype='float64').reshape(1000, 1000) >>> b2 = b1[::2, :-1].copy() >>> %timeit b1[::2,:-1]**2 100 loops, best of 3: 2.05 ms per loop >>> %timeit b2**2 1000 loops, best of 3: 1.66 ms per loop I'll try clean up the code and get it into my repo tpday. Keegan On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien < frederic.bast...@gmail.com> wrote: > I can share our experience in Theano related to that. > > I added code to "fuse" dimensions that are contiguous. So if you have a 3d > tensor, and only 1 dimensions is non contiguous, you won't pay the indexing > of a 3d tensor, but only of a 2d tensor. > > I saw 2x speed up in such cases. It was old gpu and old cuda version, > maybe this changed. > > In the new Theano back-end, we have rewriting that in a more readable > version: > > https://github.com/Theano/libgpuarray/blob/master/src/gpuarray_util.c#L153 > > This also take into account the broadcasting. > > This could be done before call the kernel. > > On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner < > li...@informa.tiker.net> wrote: > >> Keegan, >> >> Keegan Owsleywrites: >> > I've just slapped together a patch to pycuda that makes most elementwise >> > operations work with noncontiguous arrays. There are a bunch of hacks in >> > there, and the code needs some reorg before it's ready to be considered >> for >> > upstream (I made these changes while learning the pycuda codebase, so >> > there's a bunch of crud that can be cleaned out), but I figure I might >> as >> > well put it out there in its current state and see what you guys think. >> > It's also not extremely well-tested (I have no idea if it interferes >> with >> > skcuda, for example), but all of the main functions appear to work. >> > >> > You can check out the code at https://bitbucket.org/owsleyk_ >> omega/pycuda. >> > >> > Briefly, this works by adding new parameters into elementwise kernels >> that >> > describe the stride and shape of your arrays, then using a function that >> > computes the location in memory from the stride, shape, and index. >> > Elementwise kernel ops are modified so that they use the proper >> indexing. >> > See an example of a kernel that's generated below: >> >> Thanks for putting this together and sharing it! I have one main >> question about this, regarding performance: >> >> Modulo (especially variable-denominator modulo) has a habit of being >> fantastically slow on GPUs. Could you time contiguous >> vs. noncontiguous for various levels of "gappiness" and number of >> axes? I'm asking this because I'd be OK with a 50% slowdown, but not >> necessarily a factor of 5 slowdown on actual GPU hardware. >> >> Thanks! >> Andreas >> >> ___ >> PyCUDA mailing list >> PyCUDA@tiker.net >> https://lists.tiker.net/listinfo/pycuda >> > > ___ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda
Re: [PyCUDA] Elementwise operations on noncontiguous arrays
Keegan, Keegan Owsleywrites: > I've just slapped together a patch to pycuda that makes most elementwise > operations work with noncontiguous arrays. There are a bunch of hacks in > there, and the code needs some reorg before it's ready to be considered for > upstream (I made these changes while learning the pycuda codebase, so > there's a bunch of crud that can be cleaned out), but I figure I might as > well put it out there in its current state and see what you guys think. > It's also not extremely well-tested (I have no idea if it interferes with > skcuda, for example), but all of the main functions appear to work. > > You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda. > > Briefly, this works by adding new parameters into elementwise kernels that > describe the stride and shape of your arrays, then using a function that > computes the location in memory from the stride, shape, and index. > Elementwise kernel ops are modified so that they use the proper indexing. > See an example of a kernel that's generated below: Thanks for putting this together and sharing it! I have one main question about this, regarding performance: Modulo (especially variable-denominator modulo) has a habit of being fantastically slow on GPUs. Could you time contiguous vs. noncontiguous for various levels of "gappiness" and number of axes? I'm asking this because I'd be OK with a 50% slowdown, but not necessarily a factor of 5 slowdown on actual GPU hardware. Thanks! Andreas ___ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda
[PyCUDA] Elementwise operations on noncontiguous arrays
Hello, I've just slapped together a patch to pycuda that makes most elementwise operations work with noncontiguous arrays. There are a bunch of hacks in there, and the code needs some reorg before it's ready to be considered for upstream (I made these changes while learning the pycuda codebase, so there's a bunch of crud that can be cleaned out), but I figure I might as well put it out there in its current state and see what you guys think. It's also not extremely well-tested (I have no idea if it interferes with skcuda, for example), but all of the main functions appear to work. You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda. Briefly, this works by adding new parameters into elementwise kernels that describe the stride and shape of your arrays, then using a function that computes the location in memory from the stride, shape, and index. Elementwise kernel ops are modified so that they use the proper indexing. See an example of a kernel that's generated below: #include typedef struct { unsigned n[2]; long stride[2]; } dim; __device__ unsigned i2m(unsigned i, dim d) { unsigned m = 0; unsigned j = i; for(int k = 0; k < 2; k++) { m += d.stride[k] * (j%d.n[k]); j = j / d.n[k]; } return m; } __global__ void axpbyz(float a, float *x, float b, float *y, float *z, unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2) { unsigned tid = threadIdx.x; unsigned total_threads = gridDim.x*blockDim.x; unsigned cta_start = blockDim.x*blockIdx.x; unsigned i; ; for (i = cta_start + tid; i < n; i += total_threads) { z[i2m(i,*__dim2)] = a*x[i2m(i,*__dim0)] + b*y[i2m(i,*__dim1)]; } ; } I've also attached a patch file that should take you from latest git to the version in my repo. All of the changes are in elementwise.py and gpuarray.py. 0001-Allow-noncontiguous-arrays-in-elementwise-ops.patch Description: Binary data ___ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda