Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?
I am glad that I can at least understand why I am seeing this difference. I absolutely think the CUDA 'port' should be added to ViennaCL. It certainly may be preferable to some to call the direct cuBLAS routines but I am in favor of trying to find a balance between speed and 'ease-of-use'. From my point of view, having both optimized OpenCL and CUDA kernels would be a great selling point for ViennaCL. well, we would actually call the cuBLAS routines internally, so a user would not get in touch with it at all. Performance *and* ease-of-use so to say ;-) Best regards, Karli On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at wrote: Hi Charles, I was benchmarking 4096x4096 matrices (again, with my R bindings). By 'slower' I mean that I am observing OpenCL at this size beating the OpenBLAS CPU implementation by over 2X but the CUDA implementation is nearly 5X slower than the CPU. This seemed odd to me that the CUDA would be so much slower than the OpenCL, hence my initial thought to invite others to review my code if I am making some sort of silly mistake. Otherwise I was intending to begin trying to pursue direct cublas methods but I would very much prefer to use ViennaCL. okay, in this case what Philippe was just the full answer. Our OpenCL kernels are highly GPU-specific and generate a 'good' kernel at runtime. We haven't 'ported' (i.e. a one-to-one translation from OpenCL to CUDA) these kernels to the CUDA backend yet, so only a fallback kernel is used for the CUDA backend. It should be possible to carry these over with not too much effort, but in such case it makes more sense to just call the cuBLAS routines instead. Adding this for ViennaCL 1.7.1 is certainly possible if that is what you would be happy with. Best regards, Karli On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at wrote: Hi Charles, can you please quantify what you mean by 'slower'? How does 'slower' change as you increase the problem size? I would not be surprised if you see no performance gains below matrices of size 500-by-500. With the extra back-and-forth through PCI-Express you may even need matrices of at least 1000-by-1000. Best regards, Karli On 07/31/2015 09:04 PM, Charles Determan wrote: Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released. Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp). Other note, I have tried making the two packages completely independent and the performance is still very poor with CUDA. I really appreciate any help others could provide
Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?
Karl, I was benchmarking 4096x4096 matrices (again, with my R bindings). By 'slower' I mean that I am observing OpenCL at this size beating the OpenBLAS CPU implementation by over 2X but the CUDA implementation is nearly 5X slower than the CPU. This seemed odd to me that the CUDA would be so much slower than the OpenCL, hence my initial thought to invite others to review my code if I am making some sort of silly mistake. Otherwise I was intending to begin trying to pursue direct cublas methods but I would very much prefer to use ViennaCL. Regards, Charles On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at wrote: Hi Charles, can you please quantify what you mean by 'slower'? How does 'slower' change as you increase the problem size? I would not be surprised if you see no performance gains below matrices of size 500-by-500. With the extra back-and-forth through PCI-Express you may even need matrices of at least 1000-by-1000. Best regards, Karli On 07/31/2015 09:04 PM, Charles Determan wrote: Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released. Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp ). Other note, I have tried making the two packages completely independent and the performance is still very poor with CUDA. I really appreciate any help others could provide troubleshooting this. I have truly run out of ideas as to why the code has such poor performance. Regards, Charles -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?
Thank you Karl, I am glad that I can at least understand why I am seeing this difference. I absolutely think the CUDA 'port' should be added to ViennaCL. It certainly may be preferable to some to call the direct cuBLAS routines but I am in favor of trying to find a balance between speed and 'ease-of-use'. From my point of view, having both optimized OpenCL and CUDA kernels would be a great selling point for ViennaCL. Regards, Charles On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp r...@iue.tuwien.ac.at wrote: Hi Charles, I was benchmarking 4096x4096 matrices (again, with my R bindings). By 'slower' I mean that I am observing OpenCL at this size beating the OpenBLAS CPU implementation by over 2X but the CUDA implementation is nearly 5X slower than the CPU. This seemed odd to me that the CUDA would be so much slower than the OpenCL, hence my initial thought to invite others to review my code if I am making some sort of silly mistake. Otherwise I was intending to begin trying to pursue direct cublas methods but I would very much prefer to use ViennaCL. okay, in this case what Philippe was just the full answer. Our OpenCL kernels are highly GPU-specific and generate a 'good' kernel at runtime. We haven't 'ported' (i.e. a one-to-one translation from OpenCL to CUDA) these kernels to the CUDA backend yet, so only a fallback kernel is used for the CUDA backend. It should be possible to carry these over with not too much effort, but in such case it makes more sense to just call the cuBLAS routines instead. Adding this for ViennaCL 1.7.1 is certainly possible if that is what you would be happy with. Best regards, Karli On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at wrote: Hi Charles, can you please quantify what you mean by 'slower'? How does 'slower' change as you increase the problem size? I would not be surprised if you see no performance gains below matrices of size 500-by-500. With the extra back-and-forth through PCI-Express you may even need matrices of at least 1000-by-1000. Best regards, Karli On 07/31/2015 09:04 PM, Charles Determan wrote: Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released. Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp ). Other note, I have tried making the two packages completely independent and the performance is still very poor with CUDA. I really appreciate any help others could provide troubleshooting this. I have truly run out of ideas as to why the code has such poor performance. Regards, Charles -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net mailto:ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?
Ah, that works, I was thinking you were going to use some form of auto-tuned CUDA code as you do with the OpenCL. Calling the cuBLAS routines is just fine. Having them 'behind the scenes' sounds good to me :) Cheers, Charles On Mon, Aug 3, 2015 at 8:06 AM, Karl Rupp r...@iue.tuwien.ac.at wrote: I am glad that I can at least understand why I am seeing this difference. I absolutely think the CUDA 'port' should be added to ViennaCL. It certainly may be preferable to some to call the direct cuBLAS routines but I am in favor of trying to find a balance between speed and 'ease-of-use'. From my point of view, having both optimized OpenCL and CUDA kernels would be a great selling point for ViennaCL. well, we would actually call the cuBLAS routines internally, so a user would not get in touch with it at all. Performance *and* ease-of-use so to say ;-) Best regards, Karli On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at wrote: Hi Charles, I was benchmarking 4096x4096 matrices (again, with my R bindings). By 'slower' I mean that I am observing OpenCL at this size beating the OpenBLAS CPU implementation by over 2X but the CUDA implementation is nearly 5X slower than the CPU. This seemed odd to me that the CUDA would be so much slower than the OpenCL, hence my initial thought to invite others to review my code if I am making some sort of silly mistake. Otherwise I was intending to begin trying to pursue direct cublas methods but I would very much prefer to use ViennaCL. okay, in this case what Philippe was just the full answer. Our OpenCL kernels are highly GPU-specific and generate a 'good' kernel at runtime. We haven't 'ported' (i.e. a one-to-one translation from OpenCL to CUDA) these kernels to the CUDA backend yet, so only a fallback kernel is used for the CUDA backend. It should be possible to carry these over with not too much effort, but in such case it makes more sense to just call the cuBLAS routines instead. Adding this for ViennaCL 1.7.1 is certainly possible if that is what you would be happy with. Best regards, Karli On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at wrote: Hi Charles, can you please quantify what you mean by 'slower'? How does 'slower' change as you increase the problem size? I would not be surprised if you see no performance gains below matrices of size 500-by-500. With the extra back-and-forth through PCI-Express you may even need matrices of at least 1000-by-1000. Best regards, Karli On 07/31/2015 09:04 PM, Charles Determan wrote: Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released. Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is (
Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?
Hi Charles, can you please quantify what you mean by 'slower'? How does 'slower' change as you increase the problem size? I would not be surprised if you see no performance gains below matrices of size 500-by-500. With the extra back-and-forth through PCI-Express you may even need matrices of at least 1000-by-1000. Best regards, Karli On 07/31/2015 09:04 PM, Charles Determan wrote: Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released. Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp). Other note, I have tried making the two packages completely independent and the performance is still very poor with CUDA. I really appreciate any help others could provide troubleshooting this. I have truly run out of ideas as to why the code has such poor performance. Regards, Charles -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?
Hi Charles :) The BLAS kernels for CUDA and OpenCL are entirely different, actually. OpenCL kernels rely on a code-generator, and have been auto-tuned. As far as I know, the CUDA kernels have not been auto-tuned, and don't rely on the same generation engine as the OpenCL ones. While for BLAS1-2, the difference should not be so significant, for GEMM it's totally possible to observe a huge difference. Philippe 2015-07-31 12:04 GMT-07:00 Charles Determan cdeterma...@gmail.com: Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released. Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp ). Other note, I have tried making the two packages completely independent and the performance is still very poor with CUDA. I really appreciate any help others could provide troubleshooting this. I have truly run out of ideas as to why the code has such poor performance. Regards, Charles -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel