Hey, Actually, we could go even further and notice that matrix products are computed 2592 more times using ublas in the test blas3_prod_float-cpu. Could we consider using the filesystem for the tests? It would also allow us to get rid of boost in the test suite, wouldn't it?
Philippe 2014-08-14 22:20 GMT+02:00 Philippe Tillet <[email protected]>: > Hey, > > > 2014-08-14 22:10 GMT+02:00 Karl Rupp <[email protected]>: > > Hi, >> >> >> > The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks >> >>> involved. This gets hard to test, so I thought it could be a good idea >>> to discuss this. Basically, here is how it works: >>> >>> A = [A1 A2; A3 A4] >>> B = [B1 B2; B3 B4] >>> C = [C1 C2; C3 C4] >>> >>> Where each block is divided according to the corresponding block size of >>> the template. For example; A1 is the closest multiple of the size tuple >>> (ML, KL), where ML is the number of rows computed by each work group, >>> and KL the "width step" for computing the inner products (If the kernel >>> use local memories, it will load successive blocks of size ML*KL in each >>> work group). >>> >>> A few kernels are enqueued so that: >>> C1 = A1*B1 [optimized kernel] >>> C1 += A2*B3 [fallback] if needed >>> C2 = A1*B2 [fallback] if needed >>> C2 += A2*B4 [fallback] if needed >>> etc... >>> >>> Basically, one optimized kernel doing the bulk of the work, and the >>> other ones doing the "clean-up". This works well for full matrices and >>> ranges. When slices are involved, things get more complicated. If the >>> stride is on the non-leading dimension (stride2 for column-major >>> matrices), then it can be incorporated in the optimized kernel. (by >>> appending ld *= stride2 at the beginning of the kernel). However, if >>> stride1 > 1, then we need to use the fallback kernel. This is a >>> reasonable thing to do : in most applications I know of, only one stride >>> is accessed at the time (we want a set of the rows/columns of a given >>> matrix). >>> >>> However, this becomes really messy to test! Basically, I think that, to >>> have an exhaustive enough testing suite, then we should go for: >>> >>> - Matrices of complicated arbitrary sizes (143, 284, 395). It is >>> important to space them by more than 128, to be sure that A1, B1 and C1 >>> is not square. >>> - Ranges of similar complicated sizes. >>> - "Optimized" range: (128, 256, 384) for example >>> - matrix row-wise slices, matrix col-wise slices, matrix slice in both >>> directions. >>> >> >> As far as I can tell, all you need to do is to adjust the matrix sizes in >> the existing gemm tests? It covers all this already. What am I missing? > > > Well, essentially it's about reajusting the size, yes. But the tests > should be slightly different and allow for multiple passes on multiple size > tuples. > > >> >> >> I am ready to rewrite the GEMM tests accordingly, but any thought on the >>> procedure would be appreciated! >>> >> >> The GEMM tests are quite an issue already, because they consume a lot of >> time particularly on weaker systems. A substantial part of the problem is >> the verification on the CPU with uBLAS, which both adds an uBLAS dependency >> and is also rather slow. The current test sizes are pretty much the minimum >> possible, but still they take minutes to complete. Without a proper >> strategy to deal with this, chances are high that we make our test system >> almost unmanageable... Any clever approaches appreciated! >> >> > Well, with the current approach I've noticed that something a bit silly is > being done, in that products are computed many, many more times than > necessary. For all row/col layouts combination, A*B has to be computed only > once for full/range/stride. Then, C += A*B, C-=A*B can be tested on the GPU > without recomputing A*B on the CPU. > > Right now, the CPU product is computed something like 8*27*12 = 2592 > times. We could equally test our GEMM implementation with only 27*4 = 108 > clever computations (all the full/stride/range combination for all the > transposition possibilities). Also, the test file is like 800 lines long, > which is a bit discouraging to modify :-p I'll refurbish it using macros > and such. As a side note, most tests could be really benefit from using > macros. I've lost a couple of hours a few days ago because the vector tests > report a failure on dot product when plane rotation is faulty. There are a > couple of similar glitches here and there. Perhaps we should do this during > the large code refactoring session we've planed for a couple of weeks > already :-p > > Philippe > > Philippe > > Best regards, >> Karli >> >> >
------------------------------------------------------------------------------
_______________________________________________ ViennaCL-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/viennacl-devel
