Hi,

2013/7/28 Karl Rupp <r...@iue.tuwien.ac.at>

> Hey,
>
>
>  I'm proud to announce that after about 3weeks, I've recoded from scratch
>> the OpenCL code generator to integrate it fully with
>> viennacl::scheduler::**statement.
>>
>
> hurray :-) With the changes to the generator I pushed yesterday there is
> now a clear spot on where to hand the expression over to the generator.
>
>
>
>  That being said, I'm entering the point where I need to inquire your
>> opinion for (many) further design choices. Sorted by priority :
>>
>> 1 > How to handle padding? For example, the best kernels for a given
>> operation may use float4, in which case an alignment of 4 is required.
>> For GEMM, though, the kernel internally used blocking. Since the
>> iteration over the blocks is unrolled, I prefer to keep the loop
>> boundary static (known at the OpenCL compile time), so padding inside a
>> kernel is not really an option here. How to handle this?
>> Should we have a plethora of kernels optimized for a large number of
>> block-sizes?If yes, how to choose the block sizes?
>>
>
> My preferred option is to pad by default and either to make the padding a
> multiple of four or sixteen. However, we need to maintain a full set of
> unpadded operations, because user-provided buffers need not be padded (and
> a subsequent padding may be too expensive)


I think making it a multiple of 16 always is a good option, because we can
reasonably assume that optimal performance are rarely obtained when a work
item performs (unroll) more than 16*16 operations, on most of the kernels.
However, we have to have a clear and easily extensible dispatch mechanism
that dispatch some sizes to some specific kernel, which is what I was
talking about:
Best {m, k, n} big block sizes for the GEMM kernel:

GEMM Row-Major * Row-Major
AMD : 16 * 64 * 256
NVidia : 16 * 128 * 128
Intel CPU : 64 * 64 * 128.

Of course, it is bound to be device-specific rather than vendor specific,
and once the autotuning procedure works better we might have block sizes
such as 96, 112, etc... Furthermore, for the kernel to be correct, each
size has to be a multiple of the block size (3 constraints).We can never
expect the user to call the kernel on the proper sizes. Probem, the padding
on ViennaCL is static, while this block size is known at runtime... Should
we just write somewhere in the documentation what the best kernels are?



>
>
>
>  2 > For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number
>> of kernels can be generated. Designing a proper test suite in such a
>> situation is a challenging task. I've thought about testing a fixed
>> amount of randomly chosen kernel.
>>
>
> Please no random tests. This makes it awfully complicated to fix, because
> eventually one may not even be able to reproduce a previous failure.
>
> Even though the number of possible kernel variations is large (though
> finite), there's only a limited set which actually gives good performance.
> These are the important kernels to be tested thoroughly.


Yes, but this limited set is device/program - specific, and it is hard to
know (that's why autotuning is for). I don't think anyone could tell me
explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns,
use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And even
if I choose two values for each parameters, it leads to 2¹⁰ = 1024 test per
layout/transposition combination = 32 768 tests ..... which is ridiculously
high :D
What about integrating the test procedure into the autotuning procedure?
It's not intuitive but I see no better way.


 Finally, multiple operations can be packed together (multiple SAXPY,
> multiple scalar reduction/inner product, multiple vector
> reduction/gemv). If that number of packed operations is too high, the
> local memory usage will be too high and the OpenCL kernel may not
> *compile*. Should we provide a mechanism to evaluate this upper bound at
> runtime (doable) or just use a very conservative value for now (The
> OpenCL standards guarantees 16kB of local memory, the kernel generator
> guarantees an upperbound on the amount of local memory used.) ? I prefer
> the second option.
>

Sooner or later we will have to go for the runtime option anyway. I don't
> see any benefit of being overly pessimistic with 16kB if we have the true
> local memory available at runtime.


Right, it's not over-complicated to do. The problem is more about knowing
the right optimization profile used at runtime (the local memory used by
the to-be-compiled kernel). Ok, it means that this optimization profile
should not change (since I think we cannot really use global objects), so
that this local memory value is consistent over time. Only the autotuner
will be allowed to play with optimization profiles, then, which is fine for
me.


>
>
>
>  3 > There are several expression nodes that should be supported only by
>> the generator for now (even though not yet implemented):
>>     - reduce<op>(vector_expression)
>>     - reduce_rows<op>(matrix_**expression)
>>     - reduce_cols<op>(matrix_**expression)
>>     - elementwise relational operators : operator<, operator<=
>> operator>, operator >=, operator==, operator!=.
>>     - repmat(mat or vector, row_tiling, col_tiling)
>>     - vector expression : diag(Mat)
>>     - matrix expression : diag(vec)
>> My question is : how to provide access for the user to OpenCL-specific
>> content, not available (yet) for other backends?
>> Another possibility is to keep this issue for ViennaCL.version > 1.5
>>
>
> After the 1.5.0 release. There's too much other new functionality, so the
> release is already over-due. This gives us more time to design the API
> properly rather than coming up with some quick-fix.


Ok :) However, I need these for my research, so I'll make it work for
OpenCL just after the 1.5.0 release :)

>
>
>
>  4 > I want to maintain explicit specifications of the generator (apart
>> from the hard-coded bool-returning C++ function) : what operations it
>> supports, what it doesn't support. Are you interested? If yes, what
>> format would you prefer?
>>
>
> I'm not sure about what you mean by 'explicit specifications'. Could you
> please elaborate?
>

Hmm, something like a set of all the formal restrictions :

- nested {inner/mat-vec/mat-mat}-products are not allowed
- composite operations are not allowed as LHS or RHS of a matrix-matrix
product node.
- matrix-matrix product kernels can only take the standard GEMM form
...



> Best regards,
> Karli
>
>
Best regards,
Philippe
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to