Hi again, > Also, should we use multiple templates to test the portability of the > device-specific code? (testing all the local/global combinations should > be enough)
yes, it would be good to have command line options to select a different device set. On some older systems we can't run more than the default settings, but on newer (faster) machines there is enough power available to test multiple profiles (e.g. all NVIDIA profiles on the Tesla). Best regards, Karli > > > 2014-08-14 21:07 GMT+02:00 Philippe Tillet <phil.til...@gmail.com > <mailto:phil.til...@gmail.com>>: > > Hey, > > The GEMM kernel(s) are getting pretty tricky, with quite a few > fallbacks involved. This gets hard to test, so I thought it could be > a good idea to discuss this. Basically, here is how it works: > > A = [A1 A2; A3 A4] > B = [B1 B2; B3 B4] > C = [C1 C2; C3 C4] > > Where each block is divided according to the corresponding block > size of the template. For example; A1 is the closest multiple of the > size tuple (ML, KL), where ML is the number of rows computed by each > work group, and KL the "width step" for computing the inner products > (If the kernel use local memories, it will load successive blocks of > size ML*KL in each work group). > > A few kernels are enqueued so that: > C1 = A1*B1 [optimized kernel] > C1 += A2*B3 [fallback] if needed > C2 = A1*B2 [fallback] if needed > C2 += A2*B4 [fallback] if needed > etc... > > Basically, one optimized kernel doing the bulk of the work, and the > other ones doing the "clean-up". This works well for full matrices > and ranges. When slices are involved, things get more complicated. > If the stride is on the non-leading dimension (stride2 for > column-major matrices), then it can be incorporated in the optimized > kernel. (by appending ld *= stride2 at the beginning of the kernel). > However, if stride1 > 1, then we need to use the fallback kernel. > This is a reasonable thing to do : in most applications I know of, > only one stride is accessed at the time (we want a set of the > rows/columns of a given matrix). > > However, this becomes really messy to test! Basically, I think that, > to have an exhaustive enough testing suite, then we should go for: > > - Matrices of complicated arbitrary sizes (143, 284, 395). It is > important to space them by more than 128, to be sure that A1, B1 and > C1 is not square. > - Ranges of similar complicated sizes. > - "Optimized" range: (128, 256, 384) for example > - matrix row-wise slices, matrix col-wise slices, matrix slice in > both directions. > > I am ready to rewrite the GEMM tests accordingly, but any thought on > the procedure would be appreciated! > > Philippe > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > ViennaCL-devel mailing list > ViennaCL-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > ------------------------------------------------------------------------------ _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel