Hi, > I was in fact wondering why one passed reciprocal_alpha and flip_sign > into the kernel. After thinking more about it, I have noticed that this > permits us to do the corresponding inversion/multiplication within the > kernel, and therefore avoid one some latency penalty / kernel launch > overhead when the scalar is pointed out, that's smart! > On the other hand, modifying the generator to not actually generate a > specific kernel would be absurd imho. This brings another question, > then. How could ambm beneficiate from the auto-tuning environment? > I propose the following solution: > > check the size of the matrices/vector > > If the computation is dominated by the kernel launch time (say, less > than 100,000 elements), then we use the current ambm kernel. Otherwise, > we transfer the scalars to the CPU, perform the corresponding a' = +- OP > a, b' = +- OP b, and either generate the kernel or use a BLAS library. > This way, we beneficiate from kernel launch time optimization for small > data, and high-bandwidth for large data. Does this sounds good?
In terms of execution time, this is probably the best solution. On the other hand, it does not solve the problem of compilation overhead: If we only dispatch into the generator for large data, we still have to generate the respective kernels and go through the OpenCL jit-compiler each time. The compilation overhead of this is even likely to dominate any gains we get from a faster execution. Instead, what about opening up the generator a bit? It is enough if we have some mechanism to access a batch-generation of axpy-like operations, for all other operations the generator can remain as-is. Another option is to move only the axpy-template from the generator over to linalg/opencl/kernels/*, because the generation of these kernels is fairly light-weight. Sure, it is a little bit of code-duplication, but it will keep the generator clean. Another possible improvement is to separate operations on full vectors from operations on ranges and slices. For full vectors we can use the built-in vector-types in OpenCL, which allows further optimizations not possible with ranges and strides, where we cannot use vector types in general. What do you think? Best regards, Karli ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel