Hey,

 > So I expect ViennaCL 1.6 to offer some really good performance on CPUs
> with the OpenCL backend -- possibly 80% of OpenBLAS / MKL on a Core i7
> 4770, for example. As the OpenCL kernel generator and the auto-tuner
> will get better, we can hope for further improvements.
>
> This will create a huge gap with the fallback OpenMP version, which
> hardly reaches 0.5 GFLOP/s. What would you thinking about extracting the
> assembly output of the Intel OpenCL compiler? I'm not familiar *at all*
> with assembly code. How would we handle multi-threading in such a setting?

I have a bunch of simple improvements for OpenMP in the 1.6.0 release in 
mind, which should boost the performance to about 10-20 GFLOP/s. That's 
still not top-notch, but already much more usable than what we have.

This week there was an announcement regarding this course on optimizing 
GEMM on the NA-Digest mailinglist:
  https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
I'll extract some ideas from there, but ultimately I don't want to 
reimplement BLIS, but instead rely on using external BLAS 
implementations. Using inline-assembly is pretty painful and leads to 
quite a number of problems in terms of compatibility and portability 
across different compilers, so I don't want to use that as long as 
ViennaCL uses a header-only approach.

Best regards,
Karli


------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to