Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

Philippe Tillet Tue, 17 Dec 2013 23:00:21 -0800

Hi,

A short update : I've implemented linkage to CBlas and CuBlas with dynamic
selection.
If activated through VIENNACL_WITH_CUBLAS, one can go back and forth
between cublas and the original backend by doing:


A.blas().gemm(NULL);
A.blas().gemm(viennacl::backend::blas::cublas_functions<value_type>::gemm);

(and similarly for cblas.)

There is some trickery going on with transpositions and layout, but it
works for every transpose/layout combination. One can also link A's blas to
his own gemm function, provided a tiny wrapper (essentially to ensure
signature compatibility)
A very good news is that this allows viennacl to work very well on very
recent NVidia Hardware, until our autotuning engine is fully operational.
On my laptop, cublasSgemm is about 5 times faster than the current CUDA
implementation , and 20% faster than the OpenCL kernel found by the
autotuner (120GFLOPs vs 25GFLOPs vs 95GFLOPs). Also,linking with OpenBlas
leads to HUGE performance boost on the CPU ( 0.02GFLOP/s vs 70GFLOP/s)...!

A little question remains. For now, the behavior is really weird when one
defines both VIENNACL_WITH_CBLAS and VIENNACL_WITH_CUBLAS. How to handle
this? I am not very familiar with the multiple backends and I don't know to
which extent they can be combined. Therefore, I see multiple options, but
can't tell which one is better.

1 -> trigger a preprocessor error when both commands are defined together
2 -> slightly modify the API : A.cuda_blas(), A.host_blas(), A.cl_blas()

I think that option 2 is better, considering that there is already
cuda_handle(), opencl_handle(), cpu_handle() or something similar, if I'm
correct. Any advice?

Best regards,
Philippe


2013/12/15 Philippe Tillet <phil.til...@gmail.com>

> Hi,
>
>
>
>
> 2013/12/15 Karl Rupp <r...@iue.tuwien.ac.at>
>
>> Hi,
>>
>>
>> >     Yeah, it certainly is a bit tedious. Feel free to only do this for
>>
>>>     matrix-matrix multiplications for now, a full operation table is
>>>     presumably too much of a refactoring for ViennaCL 1.x.y, but much
>>>     better suited for 2.0.0.
>>>
>>>
>>> Yes. It's actually a pretty complicated problem, because of the
>>> different signatures of the different BLAS functions... It seems like
>>> the cleanest way to do it would be using std::function<>, and
>>> std::bind<>, which may indeed be widely available at the time ViennaCL
>>> 2.0.0 comes out. I hadn't seen this coming.
>>>
>>
>> The interfacing problem is just a matter of wrapping everything behind a
>> common function interface and then use function pointers appropriately.
>> C++11 is not an option for me for a few more years to come, mostly because
>> this is the usual timeframe on large-scale clusters. (Our test system now
>> includes a CentOS 5.10 machine with GCC 4.1.2...)
>
>
> Yap, sometimes reinventing the wheel makes sense because the car is too
> old :D
>
>
>>
>>  Wouldn't a classic preprocessor directive but with better BLAS support
>>> (as I have it implemented now : cpy, swap, asum, norm2, gemv, gemm) be
>>> more interesting feature-wise than a dynamic gemm only dispatch, in the
>>> end?
>>>
>>
>> How would that look like? Do you mean a classic #ifdef? If right now we
>> are only interested in GEMM, then yes, a simple static dispatch is enough.
>> It just shouldn't start growing if we don't see this as the right way to go
>> in the future.
>
>
> Oh, something like :
>
> #define VIENNACL_WITH_CBLAS
>
> or
>
> #define VIENNACL_WITH_CUDA
> #define VIENNACL_WITH_CUBLAS
>
> which would dispatch cpy, swap, asum, norm2, gemv, gemm (for the other
> one, I think that the temporary saving of ViennaCL is beneficial) for float
> and double, and when the non-leading dimension of a matrix is strided.  I
> can add a set of more specific switches if necessary:
>
> #define VIENNACL_WITH_CUBLAS_GEMV
> #define VIENNACL_WITH_CUBLAS_GEMM
> etc...
>
>
>>
>>
>>  Plus, it seems like the dynamic dispatch will be much more
>>> interesting in the context of ViennaCL 2.0.0 where more things will be
>>> dynamic, with possibly already kernel-dispatch for the generator based
>>> on the input sizes (I'm thinking about it)...
>>>
>>
>> Absolutely. I think it's important to have directions for the future
>> (being more dynamic is apparently one of them), but from the 1.5.0 delay I
>> have learned the hard way that one should not start too many changes at the
>> same time... ;-)
>>
>>
> Well, yes, I had the same problems on a couple of projects... However
> kernel generation should be the main topic of my internship and my
> (hopefully) Ph.D, so I hope I'll have time for these things!
>
> Best regards,
>> Karli
>>
>>
> Best regards,
> Philippe
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk

_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

Reply via email to