Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Charles Determan
Ah, that works, I was thinking you were going to use some form of
auto-tuned CUDA code as you do with the OpenCL.  Calling the cuBLAS
routines is just fine.  Having them 'behind the scenes' sounds good to me :)

Cheers,
Charles

On Mon, Aug 3, 2015 at 8:06 AM, Karl Rupp  wrote:

>
> I am glad that I can at least understand why I am seeing this
>> difference.  I absolutely think the CUDA 'port' should be added to
>> ViennaCL.  It certainly may be preferable to some to call the direct
>> cuBLAS routines but I am in favor of trying to find a balance between
>> speed and 'ease-of-use'.  From my point of view, having both optimized
>> OpenCL and CUDA kernels would be a great selling point for ViennaCL.
>>
>
> well, we would actually call the cuBLAS routines internally, so a user
> would not get in touch with it at all. Performance *and* ease-of-use so to
> say ;-)
>
> Best regards,
> Karli
>
>
>
>
>> On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp > > wrote:
>>
>> Hi Charles,
>>
>> > I was benchmarking 4096x4096 matrices (again, with my R bindings).
>> By
>>
>> 'slower' I mean that I am observing OpenCL at this size beating
>> the
>> OpenBLAS CPU implementation by over 2X but the CUDA
>> implementation is
>> nearly 5X slower than the CPU.  This seemed odd to me that the
>> CUDA
>> would be so much slower than the OpenCL, hence my initial thought
>> to
>> invite others to review my code if I am making some sort of silly
>> mistake.  Otherwise I was intending to begin trying to pursue
>> direct
>> cublas methods but I would very much prefer to use ViennaCL.
>>
>>
>> okay, in this case what Philippe was just the full answer. Our
>> OpenCL kernels are highly GPU-specific and generate a 'good' kernel
>> at runtime. We haven't 'ported' (i.e. a one-to-one translation from
>> OpenCL to CUDA) these kernels to the CUDA backend yet, so only a
>> fallback kernel is used for the CUDA backend. It should be possible
>> to carry these over with not too much effort, but in such case it
>> makes more sense to just call the cuBLAS routines instead. Adding
>> this for ViennaCL 1.7.1 is certainly possible if that is what you
>> would be happy with.
>>
>> Best regards,
>> Karli
>>
>>
>>
>> On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp > 
>> >>
>>
>> wrote:
>>
>>  Hi Charles,
>>
>>  can you please quantify what you mean by 'slower'? How does
>> 'slower'
>>  change as you increase the problem size? I would not be
>> surprised if
>>  you see no performance gains below matrices of size
>> 500-by-500. With
>>  the extra back-and-forth through PCI-Express you may even
>> need
>>  matrices of at least 1000-by-1000.
>>
>>  Best regards,
>>  Karli
>>
>>
>>
>>  On 07/31/2015 09:04 PM, Charles Determan wrote:
>>
>>  Greetings,
>>
>>  Brief background, I am developing a series of R
>> packages to bring
>>  ViennaCL to the R community.  I have had success with the
>>  development of
>>  my gpuR package (https://github.com/cdeterman/gpuR)
>> which relies
>>  on the
>>  OpenCL backend of ViennaCL (which is housed in the
>> package
>>  RViennaCL).
>>  I am hoping to submit to CRAN in the coming weeks now
>> that the
>>  latest
>>  stable ViennaCL version has just been released.
>>
>>  Naturally, I wanted a companion package for a CUDA
>> backend.
>>  This is now
>>  the gpuRcuda package
>> (https://github.com/cdeterman/gpuRcuda).
>>  This has
>>  appeared to work successfully as most of the code is
>> the same.
>>  However,
>>  my initial benchmarks are showing very dismal
>> performance with
>>  the CUDA
>>  backend.
>>
>>  I was wondering if someone from this list would be
>> willing to have a
>>  look at my code to see why the CUDA code would be so much
>>  worse.  I had
>>  thought, given working a NVIDIA card (GeForce GTX 970),
>> CUDA would
>>  provide improved speed but the benchmarks are showing
>> performance at
>>  least 5-fold slower than the CPU based R
>> multiplication.  Even the
>>  'float' type matrix multiplication is slower than R
>> (which only has
>>  double type support!).
>>
>>  The sgemm CUDA file is
>

Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Karl Rupp

> I am glad that I can at least understand why I am seeing this
> difference.  I absolutely think the CUDA 'port' should be added to
> ViennaCL.  It certainly may be preferable to some to call the direct
> cuBLAS routines but I am in favor of trying to find a balance between
> speed and 'ease-of-use'.  From my point of view, having both optimized
> OpenCL and CUDA kernels would be a great selling point for ViennaCL.

well, we would actually call the cuBLAS routines internally, so a user 
would not get in touch with it at all. Performance *and* ease-of-use so 
to say ;-)

Best regards,
Karli



>
> On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp  > wrote:
>
> Hi Charles,
>
> > I was benchmarking 4096x4096 matrices (again, with my R bindings).  By
>
> 'slower' I mean that I am observing OpenCL at this size beating the
> OpenBLAS CPU implementation by over 2X but the CUDA
> implementation is
> nearly 5X slower than the CPU.  This seemed odd to me that the CUDA
> would be so much slower than the OpenCL, hence my initial thought to
> invite others to review my code if I am making some sort of silly
> mistake.  Otherwise I was intending to begin trying to pursue direct
> cublas methods but I would very much prefer to use ViennaCL.
>
>
> okay, in this case what Philippe was just the full answer. Our
> OpenCL kernels are highly GPU-specific and generate a 'good' kernel
> at runtime. We haven't 'ported' (i.e. a one-to-one translation from
> OpenCL to CUDA) these kernels to the CUDA backend yet, so only a
> fallback kernel is used for the CUDA backend. It should be possible
> to carry these over with not too much effort, but in such case it
> makes more sense to just call the cuBLAS routines instead. Adding
> this for ViennaCL 1.7.1 is certainly possible if that is what you
> would be happy with.
>
> Best regards,
> Karli
>
>
>
> On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp  
> >>
> wrote:
>
>  Hi Charles,
>
>  can you please quantify what you mean by 'slower'? How does
> 'slower'
>  change as you increase the problem size? I would not be
> surprised if
>  you see no performance gains below matrices of size
> 500-by-500. With
>  the extra back-and-forth through PCI-Express you may even need
>  matrices of at least 1000-by-1000.
>
>  Best regards,
>  Karli
>
>
>
>  On 07/31/2015 09:04 PM, Charles Determan wrote:
>
>  Greetings,
>
>  Brief background, I am developing a series of R
> packages to bring
>  ViennaCL to the R community.  I have had success with the
>  development of
>  my gpuR package (https://github.com/cdeterman/gpuR)
> which relies
>  on the
>  OpenCL backend of ViennaCL (which is housed in the package
>  RViennaCL).
>  I am hoping to submit to CRAN in the coming weeks now
> that the
>  latest
>  stable ViennaCL version has just been released.
>
>  Naturally, I wanted a companion package for a CUDA backend.
>  This is now
>  the gpuRcuda package
> (https://github.com/cdeterman/gpuRcuda).
>  This has
>  appeared to work successfully as most of the code is
> the same.
>  However,
>  my initial benchmarks are showing very dismal
> performance with
>  the CUDA
>  backend.
>
>  I was wondering if someone from this list would be
> willing to have a
>  look at my code to see why the CUDA code would be so much
>  worse.  I had
>  thought, given working a NVIDIA card (GeForce GTX 970),
> CUDA would
>  provide improved speed but the benchmarks are showing
> performance at
>  least 5-fold slower than the CPU based R
> multiplication.  Even the
>  'float' type matrix multiplication is slower than R
> (which only has
>  double type support!).
>
>  The sgemm CUDA file is
>
> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
>  and
>  the associated C++ file is
>
> 
> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp).
>
>  Other note, I have tried making the two packages completely
>  independent
>  and the performance is still very poor with CUDA.
>

Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Charles Determan
Thank you Karl,

I am glad that I can at least understand why I am seeing this difference.
I absolutely think the CUDA 'port' should be added to ViennaCL.  It
certainly may be preferable to some to call the direct cuBLAS routines but
I am in favor of trying to find a balance between speed and 'ease-of-use'.
>From my point of view, having both optimized OpenCL and CUDA kernels would
be a great selling point for ViennaCL.

Regards,
Charles

On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp  wrote:

> Hi Charles,
>
> > I was benchmarking 4096x4096 matrices (again, with my R bindings).  By
>
>> 'slower' I mean that I am observing OpenCL at this size beating the
>> OpenBLAS CPU implementation by over 2X but the CUDA implementation is
>> nearly 5X slower than the CPU.  This seemed odd to me that the CUDA
>> would be so much slower than the OpenCL, hence my initial thought to
>> invite others to review my code if I am making some sort of silly
>> mistake.  Otherwise I was intending to begin trying to pursue direct
>> cublas methods but I would very much prefer to use ViennaCL.
>>
>
> okay, in this case what Philippe was just the full answer. Our OpenCL
> kernels are highly GPU-specific and generate a 'good' kernel at runtime. We
> haven't 'ported' (i.e. a one-to-one translation from OpenCL to CUDA) these
> kernels to the CUDA backend yet, so only a fallback kernel is used for the
> CUDA backend. It should be possible to carry these over with not too much
> effort, but in such case it makes more sense to just call the cuBLAS
> routines instead. Adding this for ViennaCL 1.7.1 is certainly possible if
> that is what you would be happy with.
>
> Best regards,
> Karli
>
>
>
> On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp > > wrote:
>>
>> Hi Charles,
>>
>> can you please quantify what you mean by 'slower'? How does 'slower'
>> change as you increase the problem size? I would not be surprised if
>> you see no performance gains below matrices of size 500-by-500. With
>> the extra back-and-forth through PCI-Express you may even need
>> matrices of at least 1000-by-1000.
>>
>> Best regards,
>> Karli
>>
>>
>>
>> On 07/31/2015 09:04 PM, Charles Determan wrote:
>>
>> Greetings,
>>
>> Brief background, I am developing a series of R packages to bring
>> ViennaCL to the R community.  I have had success with the
>> development of
>> my gpuR package (https://github.com/cdeterman/gpuR) which relies
>> on the
>> OpenCL backend of ViennaCL (which is housed in the package
>> RViennaCL).
>> I am hoping to submit to CRAN in the coming weeks now that the
>> latest
>> stable ViennaCL version has just been released.
>>
>> Naturally, I wanted a companion package for a CUDA backend.
>> This is now
>> the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).
>> This has
>> appeared to work successfully as most of the code is the same.
>> However,
>> my initial benchmarks are showing very dismal performance with
>> the CUDA
>> backend.
>>
>> I was wondering if someone from this list would be willing to
>> have a
>> look at my code to see why the CUDA code would be so much
>> worse.  I had
>> thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
>> provide improved speed but the benchmarks are showing performance
>> at
>> least 5-fold slower than the CPU based R multiplication.  Even the
>> 'float' type matrix multiplication is slower than R (which only
>> has
>> double type support!).
>>
>> The sgemm CUDA file is
>> (
>> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
>> and
>> the associated C++ file is
>> (
>> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
>> ).
>>
>> Other note, I have tried making the two packages completely
>> independent
>> and the performance is still very poor with CUDA.
>>
>> I really appreciate any help others could provide
>> troubleshooting this.
>> I have truly run out of ideas as to why the code has such poor
>> performance.
>>
>> Regards,
>> Charles
>>
>>
>>
>> --
>>
>>
>>
>> ___
>> ViennaCL-devel mailing list
>> ViennaCL-devel@lists.sourceforge.net
>> 
>> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>>
>>
>>
>>
>
--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/

Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Karl Rupp
Hi Charles,

 > I was benchmarking 4096x4096 matrices (again, with my R bindings).  By
> 'slower' I mean that I am observing OpenCL at this size beating the
> OpenBLAS CPU implementation by over 2X but the CUDA implementation is
> nearly 5X slower than the CPU.  This seemed odd to me that the CUDA
> would be so much slower than the OpenCL, hence my initial thought to
> invite others to review my code if I am making some sort of silly
> mistake.  Otherwise I was intending to begin trying to pursue direct
> cublas methods but I would very much prefer to use ViennaCL.

okay, in this case what Philippe was just the full answer. Our OpenCL 
kernels are highly GPU-specific and generate a 'good' kernel at runtime. 
We haven't 'ported' (i.e. a one-to-one translation from OpenCL to CUDA) 
these kernels to the CUDA backend yet, so only a fallback kernel is used 
for the CUDA backend. It should be possible to carry these over with not 
too much effort, but in such case it makes more sense to just call the 
cuBLAS routines instead. Adding this for ViennaCL 1.7.1 is certainly 
possible if that is what you would be happy with.

Best regards,
Karli



> On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp  > wrote:
>
> Hi Charles,
>
> can you please quantify what you mean by 'slower'? How does 'slower'
> change as you increase the problem size? I would not be surprised if
> you see no performance gains below matrices of size 500-by-500. With
> the extra back-and-forth through PCI-Express you may even need
> matrices of at least 1000-by-1000.
>
> Best regards,
> Karli
>
>
>
> On 07/31/2015 09:04 PM, Charles Determan wrote:
>
> Greetings,
>
> Brief background, I am developing a series of R packages to bring
> ViennaCL to the R community.  I have had success with the
> development of
> my gpuR package (https://github.com/cdeterman/gpuR) which relies
> on the
> OpenCL backend of ViennaCL (which is housed in the package
> RViennaCL).
> I am hoping to submit to CRAN in the coming weeks now that the
> latest
> stable ViennaCL version has just been released.
>
> Naturally, I wanted a companion package for a CUDA backend.
> This is now
> the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).
> This has
> appeared to work successfully as most of the code is the same.
> However,
> my initial benchmarks are showing very dismal performance with
> the CUDA
> backend.
>
> I was wondering if someone from this list would be willing to have a
> look at my code to see why the CUDA code would be so much
> worse.  I had
> thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
> provide improved speed but the benchmarks are showing performance at
> least 5-fold slower than the CPU based R multiplication.  Even the
> 'float' type matrix multiplication is slower than R (which only has
> double type support!).
>
> The sgemm CUDA file is
> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
> and
> the associated C++ file is
> 
> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp).
>
> Other note, I have tried making the two packages completely
> independent
> and the performance is still very poor with CUDA.
>
> I really appreciate any help others could provide
> troubleshooting this.
> I have truly run out of ideas as to why the code has such poor
> performance.
>
> Regards,
> Charles
>
>
> 
> --
>
>
>
> ___
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> 
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>
>
>


--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-03 Thread Charles Determan
Karl,

I was benchmarking 4096x4096 matrices (again, with my R bindings).  By
'slower' I mean that I am observing OpenCL at this size beating the
OpenBLAS CPU implementation by over 2X but the CUDA implementation is
nearly 5X slower than the CPU.  This seemed odd to me that the CUDA would
be so much slower than the OpenCL, hence my initial thought to invite
others to review my code if I am making some sort of silly mistake.
Otherwise I was intending to begin trying to pursue direct cublas methods
but I would very much prefer to use ViennaCL.

Regards,
Charles

On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp  wrote:

> Hi Charles,
>
> can you please quantify what you mean by 'slower'? How does 'slower'
> change as you increase the problem size? I would not be surprised if you
> see no performance gains below matrices of size 500-by-500. With the extra
> back-and-forth through PCI-Express you may even need matrices of at least
> 1000-by-1000.
>
> Best regards,
> Karli
>
>
>
> On 07/31/2015 09:04 PM, Charles Determan wrote:
>
>> Greetings,
>>
>> Brief background, I am developing a series of R packages to bring
>> ViennaCL to the R community.  I have had success with the development of
>> my gpuR package (https://github.com/cdeterman/gpuR) which relies on the
>> OpenCL backend of ViennaCL (which is housed in the package RViennaCL).
>> I am hoping to submit to CRAN in the coming weeks now that the latest
>> stable ViennaCL version has just been released.
>>
>> Naturally, I wanted a companion package for a CUDA backend.  This is now
>> the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).  This has
>> appeared to work successfully as most of the code is the same.  However,
>> my initial benchmarks are showing very dismal performance with the CUDA
>> backend.
>>
>> I was wondering if someone from this list would be willing to have a
>> look at my code to see why the CUDA code would be so much worse.  I had
>> thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
>> provide improved speed but the benchmarks are showing performance at
>> least 5-fold slower than the CPU based R multiplication.  Even the
>> 'float' type matrix multiplication is slower than R (which only has
>> double type support!).
>>
>> The sgemm CUDA file is
>> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and
>> the associated C++ file is
>> (
>> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
>> ).
>>
>> Other note, I have tried making the two packages completely independent
>> and the performance is still very poor with CUDA.
>>
>> I really appreciate any help others could provide troubleshooting this.
>> I have truly run out of ideas as to why the code has such poor
>> performance.
>>
>> Regards,
>> Charles
>>
>>
>>
>> --
>>
>>
>>
>> ___
>> ViennaCL-devel mailing list
>> ViennaCL-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>>
>>
>
--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-08-01 Thread Karl Rupp
Hi Charles,

can you please quantify what you mean by 'slower'? How does 'slower' 
change as you increase the problem size? I would not be surprised if you 
see no performance gains below matrices of size 500-by-500. With the 
extra back-and-forth through PCI-Express you may even need matrices of 
at least 1000-by-1000.

Best regards,
Karli


On 07/31/2015 09:04 PM, Charles Determan wrote:
> Greetings,
>
> Brief background, I am developing a series of R packages to bring
> ViennaCL to the R community.  I have had success with the development of
> my gpuR package (https://github.com/cdeterman/gpuR) which relies on the
> OpenCL backend of ViennaCL (which is housed in the package RViennaCL).
> I am hoping to submit to CRAN in the coming weeks now that the latest
> stable ViennaCL version has just been released.
>
> Naturally, I wanted a companion package for a CUDA backend.  This is now
> the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).  This has
> appeared to work successfully as most of the code is the same.  However,
> my initial benchmarks are showing very dismal performance with the CUDA
> backend.
>
> I was wondering if someone from this list would be willing to have a
> look at my code to see why the CUDA code would be so much worse.  I had
> thought, given working a NVIDIA card (GeForce GTX 970), CUDA would
> provide improved speed but the benchmarks are showing performance at
> least 5-fold slower than the CPU based R multiplication.  Even the
> 'float' type matrix multiplication is slower than R (which only has
> double type support!).
>
> The sgemm CUDA file is
> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and
> the associated C++ file is
> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp).
>
> Other note, I have tried making the two packages completely independent
> and the performance is still very poor with CUDA.
>
> I really appreciate any help others could provide troubleshooting this.
> I have truly run out of ideas as to why the code has such poor performance.
>
> Regards,
> Charles
>
>
> --
>
>
>
> ___
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>


--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-07-31 Thread Philippe Tillet
Hi Charles :)

The BLAS kernels for CUDA and OpenCL are entirely different, actually.
OpenCL kernels rely on a code-generator, and have been auto-tuned. As far
as I know, the CUDA kernels have not been auto-tuned, and don't rely on the
same generation engine as the OpenCL ones. While for BLAS1-2, the
difference should not be so significant, for GEMM it's totally possible to
observe a huge difference.

Philippe

2015-07-31 12:04 GMT-07:00 Charles Determan :

> Greetings,
>
> Brief background, I am developing a series of R packages to bring ViennaCL
> to the R community.  I have had success with the development of my gpuR
> package (https://github.com/cdeterman/gpuR) which relies on the OpenCL
> backend of ViennaCL (which is housed in the package RViennaCL).  I am
> hoping to submit to CRAN in the coming weeks now that the latest stable
> ViennaCL version has just been released.
>
> Naturally, I wanted a companion package for a CUDA backend.  This is now
> the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).  This has
> appeared to work successfully as most of the code is the same.  However, my
> initial benchmarks are showing very dismal performance with the CUDA
> backend.
>
> I was wondering if someone from this list would be willing to have a look
> at my code to see why the CUDA code would be so much worse.  I had thought,
> given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved
> speed but the benchmarks are showing performance at least 5-fold slower
> than the CPU based R multiplication.  Even the 'float' type matrix
> multiplication is slower than R (which only has double type support!).
>
> The sgemm CUDA file is (
> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and
> the associated C++ file is (
> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
> ).
>
> Other note, I have tried making the two packages completely independent
> and the performance is still very poor with CUDA.
>
> I really appreciate any help others could provide troubleshooting this.  I
> have truly run out of ideas as to why the code has such poor performance.
>
> Regards,
> Charles
>
>
> --
>
> ___
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>
>
--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel