Hey,

for the records: Philippe and I had a chat on this in IRC some times 
back. To sum up, there are some remedies/ideas in the line to tackle the 
problems:

* Split the operations on {full vectors and suitably ranged vectors} 
from the general strided case. Would result in more OpenCL programs to 
be managed, but most users probably don't need strided vector operations 
anyway and won't be hit by this.

* For memory bandwidth limited operations, use shared memory to deal 
with the strided operations: For example, to operate on the indices 0, 
3, 6, 9 of a vector, load the indices 0, 3, 6, 9 into shared memory at 
locations 0,1,2,3 and then feed the data to the compute units using e.g. 
float4 from this shared memory address.

* The shared-memory trick won't necessarily work well with GEMM. It is 
presumably okay to provide a fallback kernel here and advise users to 
copy the strided submatrices to full (temporary) matrices if the 
operation is performance critical. Sounds like a good compromise to me, 
but maybe Philippe is more ambitious ;-)

I guess that Philippe will write some more on this when he's reasonably 
satisfied with his approach :-)

Best regards,
Karli



On 07/31/2014 06:40 PM, Philippe Tillet wrote:
> Hi,
>
> It's horrible! As soon as I want to introduce some vectorized types in
> an opencl template as simple as AXPY, everything starts exploding.
>
> Well, first things first, I probably need to justify why I think that we
> cannot do without double2, float4 in all of our dense kernel templates:
> - From my own experience, it turns out that some element-wise
> expressions can be easily compute-bound. In statistics it can be pretty
> easy to encounter complicated elementwise transforms when evaluating a
> probability density function. I've personally had to use SSE on my CPU a
> couple of times to alleviate this problem.
> - Some vendors explicitely state in their optimization guide that loads
> of 16 bytes will result in a better bandwidth.
>
> On the other hand, using stride!=1 will prevent the use of vectorized
> loads in any kernel (AXPY, GEMM, etc). We're definitely facing a
> dilemma, here, where we have to choose between higher JIT overhead (the
> programs can be cached, however) and potentially higher execution time.
> My belief is that we should provide a fallback program for stride!=1,
> which will be compiled only if strided accesses are used.
>
> Note that even this wouldn't solve all our problems. How to handle
> offsets that are not multiple of 4? How to handle sizes that are not
> multiple of 4. We could use the same fallback, or provide a different
> optimized kernel.
> http://paste.ubuntu.com/7915787/
> optimized_1 should be able to handle quite well the remaining cases,
> while optimized_0 should be faster because it doesn't have to check for
> the alignment contrary to vload4, and doesn't have to do any clean up.
> In the case of AXPY, I'd expect optimized_1 to be a better option. For
> GEMM, I'd however prefer the "cleanup" to be done in some other kernel
> calls.
>
> Seriously, what a headache !! But discarding vector types for everything
> but GEMM just sounds wrong to me...
>
> Philippe
>
>
> ------------------------------------------------------------------------------
> Infragistics Professional
> Build stunning WinForms apps today!
> Reboot your WinForms applications with our WinForms controls.
> Build a bridge from your legacy apps to the future.
> http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> ViennaCL-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>


------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to