Thank you Karl,

I am glad that I can at least understand why I am seeing this difference.
I absolutely think the CUDA 'port' should be added to ViennaCL.  It
certainly may be preferable to some to call the direct cuBLAS routines but
I am in favor of trying to find a balance between speed and 'ease-of-use'.
>From my point of view, having both optimized OpenCL and CUDA kernels would
be a great selling point for ViennaCL.

Regards,
Charles

On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp <r...@iue.tuwien.ac.at> wrote:

> Hi Charles,
>
> > I was benchmarking 4096x4096 matrices (again, with my R bindings).  By
>
>> 'slower' I mean that I am observing OpenCL at this size beating the
>> OpenBLAS CPU implementation by over 2X but the CUDA implementation is
>> nearly 5X slower than the CPU.  This seemed odd to me that the CUDA
>> would be so much slower than the OpenCL, hence my initial thought to
>> invite others to review my code if I am making some sort of silly
>> mistake.  Otherwise I was intending to begin trying to pursue direct
>> cublas methods but I would very much prefer to use ViennaCL.
>>
>
> okay, in this case what Philippe was just the full answer. Our OpenCL
> kernels are highly GPU-specific and generate a 'good' kernel at runtime. We
> haven't 'ported' (i.e. a one-to-one translation from OpenCL to CUDA) these
> kernels to the CUDA backend yet, so only a fallback kernel is used for the
> CUDA backend. It should be possible to carry these over with not too much
> effort, but in such case it makes more sense to just call the cuBLAS
> routines instead. Adding this for ViennaCL 1.7.1 is certainly possible if
> that is what you would be happy with.
>
> Best regards,
> Karli
>
>
>
> On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp <r...@iue.tuwien.ac.at
>> <mailto:r...@iue.tuwien.ac.at>> wrote:
>>
>>     Hi Charles,
>>
>>     can you please quantify what you mean by 'slower'? How does 'slower'
>>     change as you increase the problem size? I would not be surprised if
>>     you see no performance gains below matrices of size 500-by-500. With
>>     the extra back-and-forth through PCI-Express you may even need
>>     matrices of at least 1000-by-1000.
>>
>>     Best regards,
>>     Karli
>>
>>
>>
>>     On 07/31/2015 09:04 PM, Charles Determan wrote:
>>
>>         Greetings,
>>
>>         Brief background, I am developing a series of R packages to bring
>>         ViennaCL to the R community.  I have had success with the
>>         development of
>>         my gpuR package (https://github.com/cdeterman/gpuR) which relies
>>         on the
>>         OpenCL backend of ViennaCL (which is housed in the package
>>         RViennaCL).
>>         I am hoping to submit to CRAN in the coming weeks now that the
>>         latest
>>         stable ViennaCL version has just been released.
>>
>>         Naturally, I wanted a companion package for a CUDA backend.
>>         This is now
>>         the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).
>>         This has
>>         appeared to work successfully as most of the code is the same.
>>         However,
>>         my initial benchmarks are showing very dismal performance with
>>         the CUDA
>>         backend.
>>
>>         I was wondering if someone from this list would be willing to
>> have a
>>         look at my code to see why the CUDA code would be so much
>>         worse.  I had
>>         thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
>>         provide improved speed but the benchmarks are showing performance
>> at
>>         least 5-fold slower than the CPU based R multiplication.  Even the
>>         'float' type matrix multiplication is slower than R (which only
>> has
>>         double type support!).
>>
>>         The sgemm CUDA file is
>>         (
>> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
>>         and
>>         the associated C++ file is
>>         (
>> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
>> ).
>>
>>         Other note, I have tried making the two packages completely
>>         independent
>>         and the performance is still very poor with CUDA.
>>
>>         I really appreciate any help others could provide
>>         troubleshooting this.
>>         I have truly run out of ideas as to why the code has such poor
>>         performance.
>>
>>         Regards,
>>         Charles
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>>         _______________________________________________
>>         ViennaCL-devel mailing list
>>         ViennaCL-devel@lists.sourceforge.net
>>         <mailto:ViennaCL-devel@lists.sourceforge.net>
>>         https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>>
>>
>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to