Hi,

 > This makes the assumption that the 2-way reduction will always be the
> best way to compute an inner-product on any OpenCL device. We want the
> reduction-based programs to be device-specific, so these "sometimes
> truncated operations" will have to be forwarded somehow to the kernel
> generator, and therefore the expression tree. Does it mean that we need
> an additional parameter in the statement which basically says "don't
> execute the last kernel!". This would introduce a lot of complexity in
> the scheduler and the generator, for too little benefit imho.

You are right, this is indeed a bit tricky. There is preparation for 
this case already in the 'standard' vector kernels, where each GPU 
scalar argument may include an additional 'mini reduction' before 
computing the actual operation. However, this functionality is currently 
unused. The motivation for this were operations of type
  z = inner_prod(u,v) * w;
where the second reduction could go into the z <- alpha * w assignment.


> What about input-dependent kernels? For small inputs where the second
> kernel would not be negligible, we would actually be better off
> performing the full reduction computation in one, big, work group. I
> think that, for small vector, this is also more cache-efficient than the
> first kernel of the dual-reduction approach plus a final reduction on
> the CPU... This would preserve the benefit of saving one kernel launch,
> and at the same time more smoothly integrate within the
> scheduler/generator framework...

Yes, I thought about that already. I think we don't need separate 
kernels, only a proper kernel calling logic. What is quite tricky is to 
get the 'cross-over' point right, because that depends on not only the 
device performance, but also on the latency, which is OS-specific.

Best regards,
Karli


------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to