Hey again,
Some data to back-up my claims. I've run the auto-tuner on a couple of more
or less intensive operations. I'll report the optimal effective GB/s found
by the auto-tuner, even though it is no longer bandwidth limited in some
cases. Reminder, HD5850 peaks at 123GB/s and my ram peaks at 21.5GB/s
===========================
Size: 10⁶
(1) z = x + y
--------------
vector-width = 1 | HD 5850 : 96GB/s
vector-width = 4 | HD 5850 : 96GB/s
vector-width = 1 | Core i7 4770 Intel SDK : 14GB/s
vector-width = 4 | Core i7 4770 Intel SDK : 14GB/s
vector-width = 8 | Core i7 4770 Intel SDK : 14GB/s
(2) z = element_exp(x + y)
--------------
vector-width = 1 | HD 5850 : 86GB/s
vector-width = 4 | HD 5850 : *96*GB/s
vector-width = 1 | Core i7 4770 Intel SDK : 14GB/s
vector-width = 4 | Core i7 4770 Intel SDK : 14GB/s
vector-width = 8 | Core i7 4770 Intel SDK : 14GB/s
(3) z = element_prod(element_exp(x + y), element_cos(x+y))
--------------
vector-width = 1 | HD 5850 : 45GB/s
vector-width = 4 | HD 5850 : *91*GB/s
vector-width = 1 | Core i7 4770 Intel SDK : 9.7GB/s
vector-width = 4 | Core i7 4770 Intel SDK : 12.5GB/s
vector-width = 8 | Core i7 4770 Intel SDK : *14*GB/s
==========================
Conclusion, even the Intel compiler, which has a pretty good reputation
about auto-vectorization, doesn't seem to translate the optimal kernel into
SSE. In the last case, which involves a mixture of exp() and cos() -- and
it's not that uncommon -- the performance is doubled thanks to the GPU's
vectorized unit, and the processor's AVX2 leads to a 50% increase in
performance.
It looks like we have to do something about it. I suggest that we auto-tune
axpy-like operations for intensive elementwise operations such as (3), and
choose a mechanism to somewhat fallback on a very poor profile when using
aligned memory transactions is not possible. I'll try to do some more
benchmarks to see whether vloadn/vstoren behaves similarly to floatn* when
the memory address is aligned. If not, I'd rather use floatn*, and fallback
when the offset is not a multiple of n.
Anyway, apart from this, it is pretty satisfying to see that (3) performs
at least as well as a specific expert function coded using OpenMP + AVX2. I
think that we can expect the same behavior for more complicated kernels
such as complicated row-wise reductions. Vector types are a real hassle to
handle, but doing it properly could really give the user awesome
performance.
Philippe
2014-07-31 18:40 GMT+02:00 Philippe Tillet <[email protected]>:
> Hi,
>
> It's horrible! As soon as I want to introduce some vectorized types in an
> opencl template as simple as AXPY, everything starts exploding.
>
> Well, first things first, I probably need to justify why I think that we
> cannot do without double2, float4 in all of our dense kernel templates:
> - From my own experience, it turns out that some element-wise expressions
> can be easily compute-bound. In statistics it can be pretty easy to
> encounter complicated elementwise transforms when evaluating a probability
> density function. I've personally had to use SSE on my CPU a couple of
> times to alleviate this problem.
> - Some vendors explicitely state in their optimization guide that loads of
> 16 bytes will result in a better bandwidth.
>
> On the other hand, using stride!=1 will prevent the use of vectorized
> loads in any kernel (AXPY, GEMM, etc). We're definitely facing a dilemma,
> here, where we have to choose between higher JIT overhead (the programs can
> be cached, however) and potentially higher execution time. My belief is
> that we should provide a fallback program for stride!=1, which will be
> compiled only if strided accesses are used.
>
> Note that even this wouldn't solve all our problems. How to handle offsets
> that are not multiple of 4? How to handle sizes that are not multiple of 4.
> We could use the same fallback, or provide a different optimized kernel.
> http://paste.ubuntu.com/7915787/
> optimized_1 should be able to handle quite well the remaining cases, while
> optimized_0 should be faster because it doesn't have to check for the
> alignment contrary to vload4, and doesn't have to do any clean up. In the
> case of AXPY, I'd expect optimized_1 to be a better option. For GEMM, I'd
> however prefer the "cleanup" to be done in some other kernel calls.
>
> Seriously, what a headache !! But discarding vector types for everything
> but GEMM just sounds wrong to me...
>
> Philippe
>
------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls.
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel