> Why do you expect to beat OpenBLAS? Their kernels are really well > optimized, and for lare dense matrix-matrix you are always FLOP-limited. > > > I don't expect, i experiment. I don't know why, current results are such > that stock ubuntu blas takes about 88 seconds for dense 10k > multiplication test (with R which is setup to use it, perhaps they also > take long time to convert to blas, but nevertheless it pins cpu 100%). > If i compile Vienna with -march=haswell and -ffast-math then i get about > 35 seconds. What's purplexing, the same test in bidmat's MatD matrices > takes less than 10 seconds on my computer -- and they don't even > saturate my cpu 100%. Something is fishy about bidmat. I don't have a > super-beafy cpu, only a 6-core/12threads haswell-e. I know that even mkl > takes in the area of 16 seconds on 24 threads in xeons, so 88 seconds > for openblas on my platform looks plausible. 10 or even 8 seconds > (BidMat+supposedly MKL) does not -- something is fishy there.
it shouldn't be too hard to directly verify correctness of the results :-) > Multiplication of 10k-by-10k matrices amounts to 200 GFLOP of > compute in double precision. A Haswell-E machine provides that > within a few seconds, depending on the number of cores (2.4 GHz * 4 > doubles with AVX * 2 for FMA = 19.2 GFLOP/sec per core. MKL achieves > about 15 GFLOP/sec per core). > > > So this sounds like a validation of the BidMat's results. Interesting. > Why R+openblas is so slow then? What is the expected output for ViennaCL > + OpenMP then compared to MKL rates? I don't know the internals of R+OpenBLAS. Maybe there is extensive debugging going, or OpenBLAS is only used with a single thread. ViennaCL+OpenMP vs. MKL is hard to answer in general. It all depends a lot on compiler flags, the underlying CPU, etc. > How much of improvement do you observe/expect from a new pull request, > is there any hope to get closer to MKL dense dgemm? The student reported about 50 percent of MKL on a laptop CPU. More importantly, though, is that the new code provides a good infrastructure for further improvements for different architectures, e.g. ARM-based CPUs. > The primary reason against blas/mkl are that they are yet another > platform which, most importantly, we cannot redistribute being an > apache2 licensed. So we'd have to ask people to install a particular > commercial product, but if ViennaCL would cover our sparse algorithm > needs, we'd rather just have it all in one package (or at least leverage > hardware/software support in steps). We are very limited in resources, > that's why reason we are trying to get working with ViennaCL: > > -- it has sparse algorithms > -- it supports host/OpenCL/cuda with need for new apis/conversions > -- it does not require installation of any shared libraries beyond what > javacpp already does for us automagically. So we basically can drop a > jar with javacpp in it into a spark application and having it running on > ViennaCL. Even netlib (blas) or netlib-java api does not make it quite > as easy (which btw we cannot redistribute either becaause of their > licenses). ah, makes sense! > This is hard to beat, especially if ViennaCL becomes well-rounded in > performance in most areas of interest, we don't need to depend on a > particular flavor of libblas.so to be present (or any libblas.so for > that matter). Is DGEMM your performance-critical operation? Are there any other performance-critical operations? > One more question: is it possible to copy one matrix into an openCL > device while solving another? > thank you! yes, that is possible using async_copy(). I recommend to copy before the solver is started. You can also achieve a similar effect through a second OpenCL command queue. (Needless to say, you should first profile in order to find out whether it is worth the effort) Best regards, Karli ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports.http://sdm.link/zohodev2dev _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel