Hi,

> I prefer option 3. This would allow for something like :
>
> if(size(x)>1e5 && stride==1 && start==0){

Here we also need to check the internal_size to fit the vector width

>
>   //The following steps are costly for small vectors
>   NumericT cpu_alpha = alpha //copy back to host when the scalar is on
> global device memory)
>   if(alpha_flip) cpu_alpha*=-1;
>   if(reciprocal) cpu_alpha = 1/cpu_alpha;
>   //... same for beta
>
> //Optimized routines
>   if(external_blas)
>     call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
>   else{
>     generate_execute(x = cpu_alpha*y + cpu_beta*z);
> }
> else{
>    //fallback
> }
>
> This way, we at most generate two kernels, one for small vectors,
>   designed to optimize latency, and one for big vectors, designed to
> optimize bandwidth. Are we converging? :)

Convergence depends on what is inside generate_execute() ;-) How is the 
problem with alpha and beta residing on the GPU addressed? How will the 
batch-compilation look like? The important point is that for the default 
axpy kernels we really don't want to go through the jit-compiler for 
each of them individually.

Note to self: Collect some numbers on the costs of jit-compilation for 
different OpenCL SDKs.

Best regards,
Karli



------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to