Hey,

Actually, we could go even further and notice that matrix products are
computed 2592 more times using ublas in the test blas3_prod_float-cpu.
Could we consider using the filesystem for the tests? It would also allow
us to get rid of boost in the test suite, wouldn't it?

Philippe


2014-08-14 22:20 GMT+02:00 Philippe Tillet <[email protected]>:

> Hey,
>
>
> 2014-08-14 22:10 GMT+02:00 Karl Rupp <[email protected]>:
>
> Hi,
>>
>>
>> > The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks
>>
>>> involved. This gets hard to test, so I thought it could be a good idea
>>> to discuss this. Basically, here is how it works:
>>>
>>> A = [A1 A2; A3 A4]
>>> B = [B1 B2; B3 B4]
>>> C = [C1 C2; C3 C4]
>>>
>>> Where each block is divided according to the corresponding block size of
>>> the template. For example; A1 is the closest multiple of the size tuple
>>> (ML, KL), where ML is the number of rows computed by each work group,
>>> and KL the "width step" for computing the inner products (If the kernel
>>> use local memories, it will load successive blocks of size ML*KL in each
>>> work group).
>>>
>>> A few kernels are enqueued so that:
>>> C1 = A1*B1 [optimized kernel]
>>> C1 += A2*B3 [fallback] if needed
>>> C2 = A1*B2 [fallback] if needed
>>> C2 += A2*B4 [fallback] if needed
>>> etc...
>>>
>>> Basically, one optimized kernel doing the bulk of the work, and the
>>> other ones doing the "clean-up". This works well for full matrices and
>>> ranges. When slices are involved, things get more complicated. If the
>>> stride is on the non-leading dimension (stride2 for column-major
>>> matrices), then it can be incorporated in the optimized kernel. (by
>>> appending ld *= stride2 at the beginning of the kernel). However, if
>>> stride1 > 1, then we need to use the fallback kernel. This is a
>>> reasonable thing to do : in most applications I know of, only one stride
>>> is accessed at the time (we want a set of the rows/columns of a given
>>> matrix).
>>>
>>> However, this becomes really messy to test! Basically, I think that, to
>>> have an exhaustive enough testing suite, then we should go for:
>>>
>>> - Matrices of complicated arbitrary sizes (143, 284, 395). It is
>>> important to space them by more than 128, to be sure that A1, B1 and C1
>>> is not square.
>>> - Ranges of similar complicated sizes.
>>> - "Optimized" range: (128, 256, 384) for example
>>> - matrix row-wise slices, matrix col-wise slices, matrix slice in both
>>> directions.
>>>
>>
>> As far as I can tell, all you need to do is to adjust the matrix sizes in
>> the existing gemm tests? It covers all this already. What am I missing?
>
>
> Well, essentially it's about reajusting the size, yes. But the tests
> should be slightly different and allow for multiple passes on multiple size
> tuples.
>
>
>>
>>
>>  I am ready to rewrite the GEMM tests accordingly, but any thought on the
>>> procedure would be appreciated!
>>>
>>
>> The GEMM tests are quite an issue already, because they consume a lot of
>> time particularly on weaker systems. A substantial part of the problem is
>> the verification on the CPU with uBLAS, which both adds an uBLAS dependency
>> and is also rather slow. The current test sizes are pretty much the minimum
>> possible, but still they take minutes to complete. Without a proper
>> strategy to deal with this, chances are high that we make our test system
>> almost unmanageable... Any clever approaches appreciated!
>>
>>
> Well, with the current approach I've noticed that something a bit silly is
> being done, in that products are computed many, many more times than
> necessary. For all row/col layouts combination, A*B has to be computed only
> once for full/range/stride. Then, C += A*B, C-=A*B can be tested on the GPU
> without recomputing A*B on the CPU.
>
> Right now, the CPU product is computed something like 8*27*12 = 2592
> times. We could equally test our GEMM implementation with only 27*4 = 108
> clever computations (all the full/stride/range combination for all the
> transposition possibilities). Also, the test file is like 800 lines long,
> which is a bit discouraging to modify :-p I'll refurbish it using macros
> and such. As a side note, most tests could be really benefit from using
> macros. I've lost a couple of hours a few days ago because the vector tests
> report a failure on dot product when plane rotation is faulty. There are a
> couple of similar glitches here and there. Perhaps we should do this during
> the large code refactoring session we've planed for a couple of weeks
> already :-p
>
> Philippe
>
> Philippe
>
> Best regards,
>> Karli
>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to