Hi,

 > Now that I'm back to some C++ coding, I want to finish the integration
> of viennacl::op_reduce<>.
> I've noticed a lot of different operator overloads for
> viennacl::scalar_expression<>, with basically different implicit
> conversions to raw scalar. I'm a bit skeptical here :)
> This allows to handle the (imho unpractical) cases such as :
> cpu_scal = inner_prod(x1,x2)*5.0

This is *very* practical. Without implicit conversion, this would
  a) not work at all and require instead
       gpu_scalar = inner_prod(x1, x2);
       copy(gpu_scalar, cpu_scalar);
       cpu_scal *= 0.5;
     Clearly, this would not result in generic code at all...
  b) be less efficient: With the above, there are two reductions on the 
GPU required in order to then copy a single value to the host. With the 
implicit conversion, this is just one reduction on the GPU, then copy 
the reduced values (no extra overhead, this is only latency limited) and 
finally run the reduction on the CPU at no significant cost. While the 
extra kernel launch does not really matter for large sizes, it is an 
issue for vector sizes between ~10k and ~500k, particularly for AMD and 
Intel accelerators where the latency is high(er).


> I think that such expressions should be forbidden. I think that every
> conversion involving host<->device data movement should be explicit,
> since they trigger a flush of the scheduler's queue. Furthermore, we are
> heading towards multi-devices computations, and these implicit
> conversions will then become even more troublesome : an implicit
> inner_prod<->scalar conversion would then need to sum the results
> obtained for each device...

Hmm, I don't see a reason why this should not work for multi-device 
scenarios...


> Basically, I think that we should forbid any other implicit conversions
> than the viennacl::scalar<T>  <=> T one... Do you agree?

I don't want to give away the benefit of saving one kernel launch for 
reduction operations when the result is needed on the host...



> This would force to rewrite the examples above :
>
> gpu_scal = (vcl_scal1 + vcl_scal2)*5.0;
> cpu_scal = gpu_scal;
>
> Which is I think more explicit and efficient than the previous approach :)

For pure scalar operations there is no chance in getting any efficiency 
out of it. Yes, it is more explicit, but at the same time less 
convenient. Overall, we would trade convenience for ... what? ;-) 
Simpler implementation?

Best regards,
Karli


------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to