Hey,



2013/12/18 Karl Rupp <r...@iue.tuwien.ac.at>

> Hi.
>
>
> > A short update : I've implemented linkage to CBlas and CuBlas with
>
>> dynamic selection.
>> If activated through VIENNACL_WITH_CUBLAS, one can go back and forth
>> between cublas and the original backend by doing:
>>
>> A.blas().gemm(NULL);
>> A.blas().gemm(viennacl::backend::blas::cublas_
>> functions<value_type>::gemm);
>>
>> (and similarly for cblas.)
>>
>
> Nice, thanks! I think we can shorten the second call to something like
>  A.blas().gemm(viennacl::backend::cublas);
> for convenience.
>
>
>
>  There is some trickery going on with transpositions and layout, but it
>> works for every transpose/layout combination. One can also link A's blas
>> to his own gemm function, provided a tiny wrapper (essentially to ensure
>> signature compatibility)
>>
>
> Cool!



It is actually interesting to point out that only 4 GEMM kernels are needed
for any implementation : NN, NT, TN, TT . Then, one can use the
equivalence  Row-Major+N <=> Col-Major+T , and C = AB <=> C^T = B^T.A^T.

>
>
>  A very good news is that this allows viennacl to work very well on very
>> recent NVidia Hardware, until our autotuning engine is fully operational.
>> On my laptop, cublasSgemm is about 5 times faster than the current CUDA
>> implementation , and 20% faster than the OpenCL kernel found by the
>> autotuner (120GFLOPs vs 25GFLOPs vs 95GFLOPs). Also,linking with
>> OpenBlas leads to HUGE performance boost on the CPU ( 0.02GFLOP/s vs
>> 70GFLOP/s)...!
>>
>
> For our native CUDA implementation it's probably only a matter of porting
> the results from the OpenCL tuner over. Unfortunately I don't see a good
> way of doing this with CUDA without a significant penalty on compilation
> times, because there is no concept of runtime kernel selection in CUDA so
> far. The performance difference for GEMM of our CPU backend is not
> surprising, this was never subject to optimization ;-)


That's exactly the point of this feature ! Optimizing GEMM for CPU is
pretty complicated, and linking with external BLAS libraries allow us not
to focus too much on these problems, and to just provide a fallback
implementation for the sake of code portability

>
>
>
>
>  A little question remains. For now, the behavior is really weird when
>> one defines both VIENNACL_WITH_CBLAS and VIENNACL_WITH_CUBLAS. How to
>> handle this? I am not very familiar with the multiple backends and I
>> don't know to which extent they can be combined. Therefore, I see
>> multiple options, but can't tell which one is better.
>>
>> 1 -> trigger a preprocessor error when both commands are defined together
>> 2 -> slightly modify the API : A.cuda_blas(), A.host_blas(), A.cl_blas()
>>
>> I think that option 2 is better, considering that there is already
>> cuda_handle(), opencl_handle(), cpu_handle() or something similar, if
>> I'm correct. Any advice?
>>
>
> The reason why cuda_handle(), opencl_handle() and cpu_handle() exists
> under different names is that they return different types (i.e. the memory
> buffer). For the BLAS backends I don't want to have different member names,
> because this gets annoying for users. For example, if a user wants to cycle
> through the backends for e.g. benchmark purposes, she would have to write
>
>   if (my_constant == CUDA)
>     A.cuda_blas()...
>   else if (my_constant == HOST)
>     A.host_blas()...
>   else
>     A.cl_blas()...
>

Yes, you're right. However, the types for .blas() are as of different
accross the backends. This is because I chose a low-level interface for the
Blas wrappers, therefore the signature of the function are slightly
different [ T const * A, vcl_size_t A_internal_size1... versus cl_mem const
A, vcl_size_t A_internal_size1 ...). I can easily change the signature to a
higher level one ( viennacl::matrix<T> A ... ). This is probably better,
right ?

>
> so making the code longer than necessary. I suggest to query some central
> registry where the backends are registered and then cycle through them:
>
>   SomeListType blas_list = viennacl::blas_implementations_available();
>   for ( it = blas_list.begin(); ... )
>   {
>     A.blas(*it);
>     do_something(A);
>   }
>
> I don't know whether .blas() is the best name for this, because in the
> future we might also have more non-BLAS operations such as sorting or FFT -
> maybe we use .operations() to better reflect the operations table?
>

Yes, I also thought about it... I'm not sure how to handle the default
case, A.operations().gemm(NULL), but I guess that
A.operations().gemm(viennacl::backend::default()), where a proper overload
would set the pointer to NULL internally.


> ---
>
> It seems to me that this is going in a very fruitful directions. Any
> objections in pushing and extending this for the 1.6.0 release? 1.5.0 is
> essentially done, I'm currently writing the last bits of documentation and
> resolve some minor warnings on Visual Studio..
>

Yes. This is already pushed in a feature branch, I can try to extended it
to allow for the list implementation you suggested. There are also a couple
of changes in the generator on another feature branch, so I'll have a lot
of stuff to merge :P


> Best regards,
> Karli
>
>
Best regards,
Philippe
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to