Re: [ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression<>...

Philippe Tillet Sun, 27 Oct 2013 05:18:38 -0700

Hello,

I had not noticed that only the first reduction would be executed in this
case, so my arguments were indeed invalid :)
However, I am now even more worried than before ;)
This makes the assumption that the 2-way reduction will always be the best
way to compute an inner-product on any OpenCL device. We want the
reduction-based programs to be device-specific, so these "sometimes
truncated operations" will have to be forwarded somehow to the kernel
generator, and therefore the expression tree. Does it mean that we need an
additional parameter in the statement which basically says "don't execute
the last kernel!". This would introduce a lot of complexity in the
scheduler and the generator, for too little benefit imho.


What about input-dependent kernels? For small inputs where the second
kernel would not be negligible, we would actually be better off performing
the full reduction computation in one, big, work group. I think that, for
small vector, this is also more cache-efficient than the first kernel of
the dual-reduction approach plus a final reduction on the CPU... This would
preserve the benefit of saving one kernel launch, and at the same time more
smoothly integrate within the scheduler/generator framework...

Philippe



2013/10/27 Karl Rupp <r...@iue.tuwien.ac.at>

> Hi,
>
>
> > Now that I'm back to some C++ coding, I want to finish the integration
>
>> of viennacl::op_reduce<>.
>> I've noticed a lot of different operator overloads for
>> viennacl::scalar_expression<>, with basically different implicit
>> conversions to raw scalar. I'm a bit skeptical here :)
>> This allows to handle the (imho unpractical) cases such as :
>> cpu_scal = inner_prod(x1,x2)*5.0
>>
>
> This is *very* practical. Without implicit conversion, this would
>  a) not work at all and require instead
>       gpu_scalar = inner_prod(x1, x2);
>       copy(gpu_scalar, cpu_scalar);
>       cpu_scal *= 0.5;
>     Clearly, this would not result in generic code at all...
>  b) be less efficient: With the above, there are two reductions on the GPU
> required in order to then copy a single value to the host. With the
> implicit conversion, this is just one reduction on the GPU, then copy the
> reduced values (no extra overhead, this is only latency limited) and
> finally run the reduction on the CPU at no significant cost. While the
> extra kernel launch does not really matter for large sizes, it is an issue
> for vector sizes between ~10k and ~500k, particularly for AMD and Intel
> accelerators where the latency is high(er).
>
>
>
>  I think that such expressions should be forbidden. I think that every
>> conversion involving host<->device data movement should be explicit,
>> since they trigger a flush of the scheduler's queue. Furthermore, we are
>> heading towards multi-devices computations, and these implicit
>> conversions will then become even more troublesome : an implicit
>> inner_prod<->scalar conversion would then need to sum the results
>> obtained for each device...
>>
>
> Hmm, I don't see a reason why this should not work for multi-device
> scenarios...
>
>
>
>  Basically, I think that we should forbid any other implicit conversions
>> than the viennacl::scalar<T>  <=> T one... Do you agree?
>>
>
> I don't want to give away the benefit of saving one kernel launch for
> reduction operations when the result is needed on the host...
>
>
>
>
>  This would force to rewrite the examples above :
>>
>> gpu_scal = (vcl_scal1 + vcl_scal2)*5.0;
>> cpu_scal = gpu_scal;
>>
>> Which is I think more explicit and efficient than the previous approach :)
>>
>
> For pure scalar operations there is no chance in getting any efficiency
> out of it. Yes, it is more explicit, but at the same time less convenient.
> Overall, we would trade convenience for ... what? ;-) Simpler
> implementation?
>
> Best regards,
> Karli
>
>

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk

_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression<>...

Reply via email to