Hey, > So I expect ViennaCL 1.6 to offer some really good performance on CPUs > with the OpenCL backend -- possibly 80% of OpenBLAS / MKL on a Core i7 > 4770, for example. As the OpenCL kernel generator and the auto-tuner > will get better, we can hope for further improvements. > > This will create a huge gap with the fallback OpenMP version, which > hardly reaches 0.5 GFLOP/s. What would you thinking about extracting the > assembly output of the Intel OpenCL compiler? I'm not familiar *at all* > with assembly code. How would we handle multi-threading in such a setting?
I have a bunch of simple improvements for OpenMP in the 1.6.0 release in mind, which should boost the performance to about 10-20 GFLOP/s. That's still not top-notch, but already much more usable than what we have. This week there was an announcement regarding this course on optimizing GEMM on the NA-Digest mailinglist: https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html I'll extract some ideas from there, but ultimately I don't want to reimplement BLIS, but instead rely on using external BLAS implementations. Using inline-assembly is pretty painful and leads to quite a number of problems in terms of compatibility and portability across different compilers, so I don't want to use that as long as ViennaCL uses a header-only approach. Best regards, Karli ------------------------------------------------------------------------------ Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel