Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Karl Rupp

 I am glad that I can at least understand why I am seeing this
 difference.  I absolutely think the CUDA 'port' should be added to
 ViennaCL.  It certainly may be preferable to some to call the direct
 cuBLAS routines but I am in favor of trying to find a balance between
 speed and 'ease-of-use'.  From my point of view, having both optimized
 OpenCL and CUDA kernels would be a great selling point for ViennaCL.

well, we would actually call the cuBLAS routines internally, so a user 
would not get in touch with it at all. Performance *and* ease-of-use so 
to say ;-)

Best regards,
Karli




 On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at wrote:

 Hi Charles,

  I was benchmarking 4096x4096 matrices (again, with my R bindings).  By

 'slower' I mean that I am observing OpenCL at this size beating the
 OpenBLAS CPU implementation by over 2X but the CUDA
 implementation is
 nearly 5X slower than the CPU.  This seemed odd to me that the CUDA
 would be so much slower than the OpenCL, hence my initial thought to
 invite others to review my code if I am making some sort of silly
 mistake.  Otherwise I was intending to begin trying to pursue direct
 cublas methods but I would very much prefer to use ViennaCL.


 okay, in this case what Philippe was just the full answer. Our
 OpenCL kernels are highly GPU-specific and generate a 'good' kernel
 at runtime. We haven't 'ported' (i.e. a one-to-one translation from
 OpenCL to CUDA) these kernels to the CUDA backend yet, so only a
 fallback kernel is used for the CUDA backend. It should be possible
 to carry these over with not too much effort, but in such case it
 makes more sense to just call the cuBLAS routines instead. Adding
 this for ViennaCL 1.7.1 is certainly possible if that is what you
 would be happy with.

 Best regards,
 Karli



 On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at
 wrote:

  Hi Charles,

  can you please quantify what you mean by 'slower'? How does
 'slower'
  change as you increase the problem size? I would not be
 surprised if
  you see no performance gains below matrices of size
 500-by-500. With
  the extra back-and-forth through PCI-Express you may even need
  matrices of at least 1000-by-1000.

  Best regards,
  Karli



  On 07/31/2015 09:04 PM, Charles Determan wrote:

  Greetings,

  Brief background, I am developing a series of R
 packages to bring
  ViennaCL to the R community.  I have had success with the
  development of
  my gpuR package (https://github.com/cdeterman/gpuR)
 which relies
  on the
  OpenCL backend of ViennaCL (which is housed in the package
  RViennaCL).
  I am hoping to submit to CRAN in the coming weeks now
 that the
  latest
  stable ViennaCL version has just been released.

  Naturally, I wanted a companion package for a CUDA backend.
  This is now
  the gpuRcuda package
 (https://github.com/cdeterman/gpuRcuda).
  This has
  appeared to work successfully as most of the code is
 the same.
  However,
  my initial benchmarks are showing very dismal
 performance with
  the CUDA
  backend.

  I was wondering if someone from this list would be
 willing to have a
  look at my code to see why the CUDA code would be so much
  worse.  I had
  thought, given working a NVIDIA card (GeForce GTX 970),
 CUDA would
  provide improved speed but the benchmarks are showing
 performance at
  least 5-fold slower than the CPU based R
 multiplication.  Even the
  'float' type matrix multiplication is slower than R
 (which only has
  double type support!).

  The sgemm CUDA file is

 (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
  and
  the associated C++ file is

 
 (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp).

  Other note, I have tried making the two packages completely
  independent
  and the performance is still very poor with CUDA.

  I really appreciate any help others could provide
  

Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Charles Determan
Karl,

I was benchmarking 4096x4096 matrices (again, with my R bindings).  By
'slower' I mean that I am observing OpenCL at this size beating the
OpenBLAS CPU implementation by over 2X but the CUDA implementation is
nearly 5X slower than the CPU.  This seemed odd to me that the CUDA would
be so much slower than the OpenCL, hence my initial thought to invite
others to review my code if I am making some sort of silly mistake.
Otherwise I was intending to begin trying to pursue direct cublas methods
but I would very much prefer to use ViennaCL.

Regards,
Charles

On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at wrote:

 Hi Charles,

 can you please quantify what you mean by 'slower'? How does 'slower'
 change as you increase the problem size? I would not be surprised if you
 see no performance gains below matrices of size 500-by-500. With the extra
 back-and-forth through PCI-Express you may even need matrices of at least
 1000-by-1000.

 Best regards,
 Karli



 On 07/31/2015 09:04 PM, Charles Determan wrote:

 Greetings,

 Brief background, I am developing a series of R packages to bring
 ViennaCL to the R community.  I have had success with the development of
 my gpuR package (https://github.com/cdeterman/gpuR) which relies on the
 OpenCL backend of ViennaCL (which is housed in the package RViennaCL).
 I am hoping to submit to CRAN in the coming weeks now that the latest
 stable ViennaCL version has just been released.

 Naturally, I wanted a companion package for a CUDA backend.  This is now
 the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).  This has
 appeared to work successfully as most of the code is the same.  However,
 my initial benchmarks are showing very dismal performance with the CUDA
 backend.

 I was wondering if someone from this list would be willing to have a
 look at my code to see why the CUDA code would be so much worse.  I had
 thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
 provide improved speed but the benchmarks are showing performance at
 least 5-fold slower than the CPU based R multiplication.  Even the
 'float' type matrix multiplication is slower than R (which only has
 double type support!).

 The sgemm CUDA file is
 (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and
 the associated C++ file is
 (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
 ).

 Other note, I have tried making the two packages completely independent
 and the performance is still very poor with CUDA.

 I really appreciate any help others could provide troubleshooting this.
 I have truly run out of ideas as to why the code has such poor
 performance.

 Regards,
 Charles



 --



 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel



--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Charles Determan
Thank you Karl,

I am glad that I can at least understand why I am seeing this difference.
I absolutely think the CUDA 'port' should be added to ViennaCL.  It
certainly may be preferable to some to call the direct cuBLAS routines but
I am in favor of trying to find a balance between speed and 'ease-of-use'.
From my point of view, having both optimized OpenCL and CUDA kernels would
be a great selling point for ViennaCL.

Regards,
Charles

On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp r...@iue.tuwien.ac.at wrote:

 Hi Charles,

  I was benchmarking 4096x4096 matrices (again, with my R bindings).  By

 'slower' I mean that I am observing OpenCL at this size beating the
 OpenBLAS CPU implementation by over 2X but the CUDA implementation is
 nearly 5X slower than the CPU.  This seemed odd to me that the CUDA
 would be so much slower than the OpenCL, hence my initial thought to
 invite others to review my code if I am making some sort of silly
 mistake.  Otherwise I was intending to begin trying to pursue direct
 cublas methods but I would very much prefer to use ViennaCL.


 okay, in this case what Philippe was just the full answer. Our OpenCL
 kernels are highly GPU-specific and generate a 'good' kernel at runtime. We
 haven't 'ported' (i.e. a one-to-one translation from OpenCL to CUDA) these
 kernels to the CUDA backend yet, so only a fallback kernel is used for the
 CUDA backend. It should be possible to carry these over with not too much
 effort, but in such case it makes more sense to just call the cuBLAS
 routines instead. Adding this for ViennaCL 1.7.1 is certainly possible if
 that is what you would be happy with.

 Best regards,
 Karli



 On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at wrote:

 Hi Charles,

 can you please quantify what you mean by 'slower'? How does 'slower'
 change as you increase the problem size? I would not be surprised if
 you see no performance gains below matrices of size 500-by-500. With
 the extra back-and-forth through PCI-Express you may even need
 matrices of at least 1000-by-1000.

 Best regards,
 Karli



 On 07/31/2015 09:04 PM, Charles Determan wrote:

 Greetings,

 Brief background, I am developing a series of R packages to bring
 ViennaCL to the R community.  I have had success with the
 development of
 my gpuR package (https://github.com/cdeterman/gpuR) which relies
 on the
 OpenCL backend of ViennaCL (which is housed in the package
 RViennaCL).
 I am hoping to submit to CRAN in the coming weeks now that the
 latest
 stable ViennaCL version has just been released.

 Naturally, I wanted a companion package for a CUDA backend.
 This is now
 the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).
 This has
 appeared to work successfully as most of the code is the same.
 However,
 my initial benchmarks are showing very dismal performance with
 the CUDA
 backend.

 I was wondering if someone from this list would be willing to
 have a
 look at my code to see why the CUDA code would be so much
 worse.  I had
 thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
 provide improved speed but the benchmarks are showing performance
 at
 least 5-fold slower than the CPU based R multiplication.  Even the
 'float' type matrix multiplication is slower than R (which only
 has
 double type support!).

 The sgemm CUDA file is
 (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
 and
 the associated C++ file is
 (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
 ).

 Other note, I have tried making the two packages completely
 independent
 and the performance is still very poor with CUDA.

 I really appreciate any help others could provide
 troubleshooting this.
 I have truly run out of ideas as to why the code has such poor
 performance.

 Regards,
 Charles



 --



 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 mailto:ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel





--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Charles Determan
Ah, that works, I was thinking you were going to use some form of
auto-tuned CUDA code as you do with the OpenCL.  Calling the cuBLAS
routines is just fine.  Having them 'behind the scenes' sounds good to me :)

Cheers,
Charles

On Mon, Aug 3, 2015 at 8:06 AM, Karl Rupp r...@iue.tuwien.ac.at wrote:


 I am glad that I can at least understand why I am seeing this
 difference.  I absolutely think the CUDA 'port' should be added to
 ViennaCL.  It certainly may be preferable to some to call the direct
 cuBLAS routines but I am in favor of trying to find a balance between
 speed and 'ease-of-use'.  From my point of view, having both optimized
 OpenCL and CUDA kernels would be a great selling point for ViennaCL.


 well, we would actually call the cuBLAS routines internally, so a user
 would not get in touch with it at all. Performance *and* ease-of-use so to
 say ;-)

 Best regards,
 Karli




 On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at wrote:

 Hi Charles,

  I was benchmarking 4096x4096 matrices (again, with my R bindings).
 By

 'slower' I mean that I am observing OpenCL at this size beating
 the
 OpenBLAS CPU implementation by over 2X but the CUDA
 implementation is
 nearly 5X slower than the CPU.  This seemed odd to me that the
 CUDA
 would be so much slower than the OpenCL, hence my initial thought
 to
 invite others to review my code if I am making some sort of silly
 mistake.  Otherwise I was intending to begin trying to pursue
 direct
 cublas methods but I would very much prefer to use ViennaCL.


 okay, in this case what Philippe was just the full answer. Our
 OpenCL kernels are highly GPU-specific and generate a 'good' kernel
 at runtime. We haven't 'ported' (i.e. a one-to-one translation from
 OpenCL to CUDA) these kernels to the CUDA backend yet, so only a
 fallback kernel is used for the CUDA backend. It should be possible
 to carry these over with not too much effort, but in such case it
 makes more sense to just call the cuBLAS routines instead. Adding
 this for ViennaCL 1.7.1 is certainly possible if that is what you
 would be happy with.

 Best regards,
 Karli



 On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at

 wrote:

  Hi Charles,

  can you please quantify what you mean by 'slower'? How does
 'slower'
  change as you increase the problem size? I would not be
 surprised if
  you see no performance gains below matrices of size
 500-by-500. With
  the extra back-and-forth through PCI-Express you may even
 need
  matrices of at least 1000-by-1000.

  Best regards,
  Karli



  On 07/31/2015 09:04 PM, Charles Determan wrote:

  Greetings,

  Brief background, I am developing a series of R
 packages to bring
  ViennaCL to the R community.  I have had success with the
  development of
  my gpuR package (https://github.com/cdeterman/gpuR)
 which relies
  on the
  OpenCL backend of ViennaCL (which is housed in the
 package
  RViennaCL).
  I am hoping to submit to CRAN in the coming weeks now
 that the
  latest
  stable ViennaCL version has just been released.

  Naturally, I wanted a companion package for a CUDA
 backend.
  This is now
  the gpuRcuda package
 (https://github.com/cdeterman/gpuRcuda).
  This has
  appeared to work successfully as most of the code is
 the same.
  However,
  my initial benchmarks are showing very dismal
 performance with
  the CUDA
  backend.

  I was wondering if someone from this list would be
 willing to have a
  look at my code to see why the CUDA code would be so much
  worse.  I had
  thought, given working a NVIDIA card (GeForce GTX 970),
 CUDA would
  provide improved speed but the benchmarks are showing
 performance at
  least 5-fold slower than the CPU based R
 multiplication.  Even the
  'float' type matrix multiplication is slower than R
 (which only has
  double type support!).

  The sgemm CUDA file is

 (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
  and
  the associated C++ file is

 (
 

Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-01 Thread Karl Rupp
Hi Charles,

can you please quantify what you mean by 'slower'? How does 'slower' 
change as you increase the problem size? I would not be surprised if you 
see no performance gains below matrices of size 500-by-500. With the 
extra back-and-forth through PCI-Express you may even need matrices of 
at least 1000-by-1000.

Best regards,
Karli


On 07/31/2015 09:04 PM, Charles Determan wrote:
 Greetings,

 Brief background, I am developing a series of R packages to bring
 ViennaCL to the R community.  I have had success with the development of
 my gpuR package (https://github.com/cdeterman/gpuR) which relies on the
 OpenCL backend of ViennaCL (which is housed in the package RViennaCL).
 I am hoping to submit to CRAN in the coming weeks now that the latest
 stable ViennaCL version has just been released.

 Naturally, I wanted a companion package for a CUDA backend.  This is now
 the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).  This has
 appeared to work successfully as most of the code is the same.  However,
 my initial benchmarks are showing very dismal performance with the CUDA
 backend.

 I was wondering if someone from this list would be willing to have a
 look at my code to see why the CUDA code would be so much worse.  I had
 thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
 provide improved speed but the benchmarks are showing performance at
 least 5-fold slower than the CPU based R multiplication.  Even the
 'float' type matrix multiplication is slower than R (which only has
 double type support!).

 The sgemm CUDA file is
 (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and
 the associated C++ file is
 (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp).

 Other note, I have tried making the two packages completely independent
 and the performance is still very poor with CUDA.

 I really appreciate any help others could provide troubleshooting this.
 I have truly run out of ideas as to why the code has such poor performance.

 Regards,
 Charles


 --



 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel



--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-07-31 Thread Philippe Tillet
Hi Charles :)

The BLAS kernels for CUDA and OpenCL are entirely different, actually.
OpenCL kernels rely on a code-generator, and have been auto-tuned. As far
as I know, the CUDA kernels have not been auto-tuned, and don't rely on the
same generation engine as the OpenCL ones. While for BLAS1-2, the
difference should not be so significant, for GEMM it's totally possible to
observe a huge difference.

Philippe

2015-07-31 12:04 GMT-07:00 Charles Determan cdeterma...@gmail.com:

 Greetings,

 Brief background, I am developing a series of R packages to bring ViennaCL
 to the R community.  I have had success with the development of my gpuR
 package (https://github.com/cdeterman/gpuR) which relies on the OpenCL
 backend of ViennaCL (which is housed in the package RViennaCL).  I am
 hoping to submit to CRAN in the coming weeks now that the latest stable
 ViennaCL version has just been released.

 Naturally, I wanted a companion package for a CUDA backend.  This is now
 the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).  This has
 appeared to work successfully as most of the code is the same.  However, my
 initial benchmarks are showing very dismal performance with the CUDA
 backend.

 I was wondering if someone from this list would be willing to have a look
 at my code to see why the CUDA code would be so much worse.  I had thought,
 given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved
 speed but the benchmarks are showing performance at least 5-fold slower
 than the CPU based R multiplication.  Even the 'float' type matrix
 multiplication is slower than R (which only has double type support!).

 The sgemm CUDA file is (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and
 the associated C++ file is (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
 ).

 Other note, I have tried making the two packages completely independent
 and the performance is still very poor with CUDA.

 I really appreciate any help others could provide troubleshooting this.  I
 have truly run out of ideas as to why the code has such poor performance.

 Regards,
 Charles


 --

 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel


--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel