Hey, > The integration of the generator is going slowly, but "surely". All the > OpenCL vector kernels (including plane rotations, multi-inner_prod, > etc.) should be device-specific in a few days. This will be a good leap > forward, in terms of maintainability, peak performance and > performance-portability.
great, glad to hear that! > However, I feel like this is gonna get much > more complicated for BLAS3. Here's why. > > Ranges do not work well with zero-padding. The kernel may perform > operations with nonzeros out-of-bound elements, thus giving an incorrect > result ...! Of course, it is mandatory to have each work-group > processing some (blocks) without any difficult size-checking, for > performance. I think that we should drop zero-padding, because it forces > us to treat differently vector and ranges, which we shouldn't have t do, > since ranges are vectors, too. Yes, it would be better to treat this with one single framework. More below. I see two ways of handling GEMM without > zero-padding, both of which have advantages and drawbacks: > > (1) Have an optimized kernel for ideal cases (size is a proper multiple > of 64/128/Friday/Whatever ; stride{A,B,C} can be incorporated into > LDA/LDB ; start{A,B,C} are multiple of the vector length used in the > kernel), and a fallback kernel for all the other cases. This fallback is > super-safe (size/checking, vector_length=1...) > > (2) The same as above, but the optimized kernel is always used for > performing the large sub-matrix multiplication ( rounding the size to > the best previous multiple), and the fallback is just used to finish the > job. > > I would go for (2), as (1) is simpler to implement but disastrous for > large odd matrices. I also consider (2) to be the better approach. > However, for small matrices (2) will have a large > over-head, and it may be significantly worse than zero-padding in some > corner cases (consider a matrix 60x100000 with either zero-padding to > make it 64x100032 or a crappy kernel...). Do you have any idea of how > typical BLAS implementations handle this issue with the offset? (strides > are rare enough to require slow but safe kernel, I believe) Even for such very tall/thin matrices the overhead isn't that bad for most use cases. Yes, in a worst-case scenario the overhead is 64x, but I claim that this is corner-case enough to not worry about. We could think about making the alignment customizable, but I doubt that it provides significant value for the efforts required (kernel selection, etc.). I don't know how other BLAS implementations deal with this, yet I doubt that they worry much about it. This whole memory alignment stuff became only that important with GPGPU. Best regrads, Karli ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/NeoTech _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel