I forgot to add that right now, things are handled by using the
kernel-generator only when start1=start2=start3=0 &
stride1=stride2=stride3=1. Otherwise, we forward to the good old kernel.
I'd like to change this because I think that ranges are more common than
strides (ranges are part of the BLAS3 API, whereas strides are only
partially supported through LD{A,B,C}). If we want an efficient
viennacl_blas, we'll have to investigate the issue sooner or later :P

Philippe


2014-06-02 23:09 GMT+02:00 Philippe Tillet <phil.til...@gmail.com>:

> Hi,
>
> The integration of the generator is going slowly, but "surely". All the
> OpenCL vector kernels (including plane rotations, multi-inner_prod, etc.)
> should be device-specific in a few days. This will be a good leap forward,
> in terms of maintainability, peak performance and performance-portability.
> However, I feel like this is gonna get much more complicated for BLAS3.
> Here's why.
>
> Ranges do not work well with zero-padding. The kernel may perform
> operations with nonzeros out-of-bound elements, thus giving an incorrect
> result ...! Of course, it is mandatory to have each work-group processing
> some (blocks) without any difficult size-checking, for performance. I think
> that we should drop zero-padding, because it forces us to treat differently
> vector and ranges, which we shouldn't have t do, since ranges are vectors,
> too. I see two ways of handling GEMM without zero-padding, both of which
> have advantages and drawbacks:
>
> (1) Have an optimized kernel for ideal cases (size is a proper multiple of
> 64/128/Friday/Whatever ; stride{A,B,C} can be incorporated into LDA/LDB ;
> start{A,B,C} are multiple of the vector length used in the kernel), and a
> fallback kernel for all the other cases. This fallback is super-safe
> (size/checking, vector_length=1...)
>
> (2) The same as above, but the optimized kernel is always used for
> performing the large sub-matrix multiplication ( rounding the size to the
> best previous multiple), and the fallback is just used to finish the job.
>
> I would go for (2), as (1) is simpler to implement but disastrous for
> large odd matrices. However, for small matrices (2) will have a large
> over-head, and it may be significantly worse than zero-padding in some
> corner cases (consider a matrix 60x100000 with either zero-padding to make
> it 64x100032 or a crappy kernel...). Do you have any idea of how typical
> BLAS implementations handle this issue with the offset? (strides are rare
> enough to require slow but safe kernel, I believe)
>
> Philippe
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their 
applications. Written by three acclaimed leaders in the field, 
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to