Also, should we use multiple templates to test the portability of the
device-specific code? (testing all the local/global combinations should be
enough)


2014-08-14 21:07 GMT+02:00 Philippe Tillet <[email protected]>:

> Hey,
>
> The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks
> involved. This gets hard to test, so I thought it could be a good idea to
> discuss this. Basically, here is how it works:
>
> A = [A1 A2; A3 A4]
> B = [B1 B2; B3 B4]
> C = [C1 C2; C3 C4]
>
> Where each block is divided according to the corresponding block size of
> the template. For example; A1 is the closest multiple of the size tuple
> (ML, KL), where ML is the number of rows computed by each work group, and
> KL the "width step" for computing the inner products (If the kernel use
> local memories, it will load successive blocks of size ML*KL in each work
> group).
>
> A few kernels are enqueued so that:
> C1 = A1*B1 [optimized kernel]
> C1 += A2*B3 [fallback] if needed
> C2 = A1*B2 [fallback] if needed
> C2 += A2*B4 [fallback] if needed
> etc...
>
> Basically, one optimized kernel doing the bulk of the work, and the other
> ones doing the "clean-up". This works well for full matrices and ranges.
> When slices are involved, things get more complicated. If the stride is on
> the non-leading dimension (stride2 for column-major matrices), then it can
> be incorporated in the optimized kernel. (by appending ld *= stride2 at the
> beginning of the kernel). However, if stride1 > 1, then we need to use the
> fallback kernel. This is a reasonable thing to do : in most applications I
> know of, only one stride is accessed at the time (we want a set of the
> rows/columns of a given matrix).
>
> However, this becomes really messy to test! Basically, I think that, to
> have an exhaustive enough testing suite, then we should go for:
>
> - Matrices of complicated arbitrary sizes (143, 284, 395). It is important
> to space them by more than 128, to be sure that A1, B1 and C1 is not square.
> - Ranges of similar complicated sizes.
> - "Optimized" range: (128, 256, 384) for example
> - matrix row-wise slices, matrix col-wise slices, matrix slice in both
> directions.
>
> I am ready to rewrite the GEMM tests accordingly, but any thought on the
> procedure would be appreciated!
>
> Philippe
>
------------------------------------------------------------------------------
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to