Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

Philippe Tillet Wed, 18 Dec 2013 08:29:53 -0800

2013/12/19 Philippe Tillet <[email protected]>

> Hey,
>
>
>
>
> 2013/12/18 Karl Rupp <[email protected]>
>
>> Hi.
>>
>>
>> > A short update : I've implemented linkage to CBlas and CuBlas with
>>
>>> dynamic selection.
>>> If activated through VIENNACL_WITH_CUBLAS, one can go back and forth
>>> between cublas and the original backend by doing:
>>>
>>> A.blas().gemm(NULL);
>>> A.blas().gemm(viennacl::backend::blas::cublas_
>>> functions<value_type>::gemm);
>>>
>>> (and similarly for cblas.)
>>>
>>
>> Nice, thanks! I think we can shorten the second call to something like
>>  A.blas().gemm(viennacl::backend::cublas);
>> for convenience.
>>
>>
>>
>>  There is some trickery going on with transpositions and layout, but it
>>> works for every transpose/layout combination. One can also link A's blas
>>> to his own gemm function, provided a tiny wrapper (essentially to ensure
>>> signature compatibility)
>>>
>>
>> Cool!
>
>
>
> It is actually interesting to point out that only 4 GEMM kernels are
> needed for any implementation : NN, NT, TN, TT . Then, one can use the
> equivalence  Row-Major+N <=> Col-Major+T , and C = AB <=> C^T = B^T.A^T.
>
>>
>>
>>  A very good news is that this allows viennacl to work very well on very
>>> recent NVidia Hardware, until our autotuning engine is fully operational.
>>> On my laptop, cublasSgemm is about 5 times faster than the current CUDA
>>> implementation , and 20% faster than the OpenCL kernel found by the
>>> autotuner (120GFLOPs vs 25GFLOPs vs 95GFLOPs). Also,linking with
>>> OpenBlas leads to HUGE performance boost on the CPU ( 0.02GFLOP/s vs
>>> 70GFLOP/s)...!
>>>
>>
>> For our native CUDA implementation it's probably only a matter of porting
>> the results from the OpenCL tuner over. Unfortunately I don't see a good
>> way of doing this with CUDA without a significant penalty on compilation
>> times, because there is no concept of runtime kernel selection in CUDA so
>> far. The performance difference for GEMM of our CPU backend is not
>> surprising, this was never subject to optimization ;-)
>
>
> That's exactly the point of this feature ! Optimizing GEMM for CPU is
> pretty complicated, and linking with external BLAS libraries allow us not
> to focus too much on these problems, and to just provide a fallback
> implementation for the sake of code portability
>
>>
>>
>>
>>
>>  A little question remains. For now, the behavior is really weird when
>>> one defines both VIENNACL_WITH_CBLAS and VIENNACL_WITH_CUBLAS. How to
>>> handle this? I am not very familiar with the multiple backends and I
>>> don't know to which extent they can be combined. Therefore, I see
>>> multiple options, but can't tell which one is better.
>>>
>>> 1 -> trigger a preprocessor error when both commands are defined together
>>> 2 -> slightly modify the API : A.cuda_blas(), A.host_blas(), A.cl_blas()
>>>
>>> I think that option 2 is better, considering that there is already
>>> cuda_handle(), opencl_handle(), cpu_handle() or something similar, if
>>> I'm correct. Any advice?
>>>
>>
>> The reason why cuda_handle(), opencl_handle() and cpu_handle() exists
>> under different names is that they return different types (i.e. the memory
>> buffer). For the BLAS backends I don't want to have different member names,
>> because this gets annoying for users. For example, if a user wants to cycle
>> through the backends for e.g. benchmark purposes, she would have to write
>>
>>   if (my_constant == CUDA)
>>     A.cuda_blas()...
>>   else if (my_constant == HOST)
>>     A.host_blas()...
>>   else
>>     A.cl_blas()...
>>
>
> Yes, you're right. However, the types for .blas() are as of different
> accross the backends. This is because I chose a low-level interface for the
> Blas wrappers, therefore the signature of the function are slightly
> different [ T const * A, vcl_size_t A_internal_size1... versus cl_mem const
> A, vcl_size_t A_internal_size1 ...). I can easily change the signature to a
> higher level one ( viennacl::matrix<T> A ... ). This is probably better,
> right ?
>
>>
>> so making the code longer than necessary. I suggest to query some central
>> registry where the backends are registered and then cycle through them:
>>
>>   SomeListType blas_list = viennacl::blas_implementations_available();
>>   for ( it = blas_list.begin(); ... )
>>   {
>>     A.blas(*it);
>>     do_something(A);
>>   }
>>
>> I don't know whether .blas() is the best name for this, because in the
>> future we might also have more non-BLAS operations such as sorting or FFT -
>> maybe we use .operations() to better reflect the operations table?
>>
>
> Yes, I also thought about it... I'm not sure how to handle the default
> case, A.operations().gemm(NULL), but I guess that
> A.operations().gemm(viennacl::backend::default()), where a proper overload
> would set the pointer to NULL internally.
>


It seems like I'm too tired to write proper sentences tonight. What I mean
is that the argument  of A.operations().gemm() should be a function
pointer, but since A.operations().gemm(NULL) is confusing, then we should
probably go for an additional overload :
A.operations().gemm(viennacl::backend::default()), shouldn't we?

>
>
>> ---
>>
>> It seems to me that this is going in a very fruitful directions. Any
>> objections in pushing and extending this for the 1.6.0 release? 1.5.0 is
>> essentially done, I'm currently writing the last bits of documentation and
>> resolve some minor warnings on Visual Studio..
>>
>
> Yes. This is already pushed in a feature branch, I can try to extended it
> to allow for the list implementation you suggested. There are also a couple
> of changes in the generator on another feature branch, so I'll have a lot
> of stuff to merge :P
>
>
>> Best regards,
>> Karli
>>
>>
> Best regards,
> Philippe
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk

_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

Reply via email to