Re: [ViennaCL-devel] zero-padding and ranges

Karl Rupp Mon, 02 Jun 2014 18:41:25 -0700

Hey,

 > The integration of the generator is going slowly, but "surely". All the
> OpenCL vector kernels (including plane rotations, multi-inner_prod,
> etc.) should be device-specific in a few days. This will be a good leap
> forward, in terms of maintainability, peak performance and
> performance-portability.


great, glad to hear that!


> However, I feel like this is gonna get much
> more complicated for BLAS3. Here's why.
>
> Ranges do not work well with zero-padding. The kernel may perform
> operations with nonzeros out-of-bound elements, thus giving an incorrect
> result ...! Of course, it is mandatory to have each work-group
> processing some (blocks) without any difficult size-checking, for
> performance. I think that we should drop zero-padding, because it forces
> us to treat differently vector and ranges, which we shouldn't have t do,
> since ranges are vectors, too.

Yes, it would be better to treat this with one single framework. More below.


I see two ways of handling GEMM without
> zero-padding, both of which have advantages and drawbacks:
>
> (1) Have an optimized kernel for ideal cases (size is a proper multiple
> of 64/128/Friday/Whatever ; stride{A,B,C} can be incorporated into
> LDA/LDB ; start{A,B,C} are multiple of the vector length used in the
> kernel), and a fallback kernel for all the other cases. This fallback is
> super-safe (size/checking, vector_length=1...)
>
> (2) The same as above, but the optimized kernel is always used for
> performing the large sub-matrix multiplication ( rounding the size to
> the best previous multiple), and the fallback is just used to finish the
> job.
>
> I would go for (2), as (1) is simpler to implement but disastrous for
> large odd matrices.

I also consider (2) to be the better approach.


> However, for small matrices (2) will have a large
> over-head, and it may be significantly worse than zero-padding in some
> corner cases (consider a matrix 60x100000 with either zero-padding to
> make it 64x100032 or a crappy kernel...). Do you have any idea of how
> typical BLAS implementations handle this issue with the offset? (strides
> are rare enough to require slow but safe kernel, I believe)

Even for such very tall/thin matrices the overhead isn't that bad for 
most use cases. Yes, in a worst-case scenario the overhead is 64x, but I 
claim that this is corner-case enough to not worry about. We could think 
about making the alignment customizable, but I doubt that it provides 
significant value for the efforts required (kernel selection, etc.).

I don't know how other BLAS implementations deal with this, yet I doubt 
that they worry much about it. This whole memory alignment stuff became 
only that important with GPGPU.

Best regrads,
Karli




------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their 
applications. Written by three acclaimed leaders in the field, 
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] zero-padding and ranges

Reply via email to