Hello everybody,

I'm proud to announce that after about 3weeks, I've recoded from scratch
the OpenCL code generator to integrate it fully with

That being said, I'm entering the point where I need to inquire your
opinion for (many) further design choices. Sorted by priority :

1 > How to handle padding? For example, the best kernels for a given
operation may use float4, in which case an alignment of 4 is required. For
GEMM, though, the kernel internally used blocking. Since the iteration over
the blocks is unrolled, I prefer to keep the loop boundary static (known at
the OpenCL compile time), so padding inside a kernel is not really an
option here. How to handle this?
Should we have a plethora of kernels optimized for a large number of
block-sizes?If yes, how to choose the block sizes?

2 > For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number of
kernels can be generated. Designing a proper test suite in such a situation
is a challenging task. I've thought about testing a fixed amount of
randomly chosen kernel.
We also have to choose multiple sizes for the test (because of 1>)...
Finally, multiple operations can be packed together (multiple SAXPY,
multiple scalar reduction/inner product, multiple vector reduction/gemv).
If that number of packed operations is too high, the local memory usage
will be too high and the OpenCL kernel may not *compile*. Should we provide
a mechanism to evaluate this upper bound at runtime (doable) or just use a
very conservative value for now (The OpenCL standards guarantees 16kB of
local memory, the kernel generator guarantees an upperbound on the amount
of local memory used.) ? I prefer the second option.

3 > There are several expression nodes that should be supported only by the
generator for now (even though not yet implemented):
   - reduce<op>(vector_expression)
   - reduce_rows<op>(matrix_expression)
   - reduce_cols<op>(matrix_expression)
   - elementwise relational operators : operator<, operator<= operator>,
operator >=, operator==, operator!=.
   - repmat(mat or vector, row_tiling, col_tiling)
   - vector expression : diag(Mat)
   - matrix expression : diag(vec)
My question is : how to provide access for the user to OpenCL-specific
content, not available (yet) for other backends?
Another possibility is to keep this issue for ViennaCL.version > 1.5

4 > I want to maintain explicit specifications of the generator (apart from
the hard-coded bool-returning C++ function) : what operations it supports,
what it doesn't support. Are you interested? If yes, what format would you

Best regards,
