>     Why do you expect to beat OpenBLAS? Their kernels are really well
>     optimized, and for lare dense matrix-matrix you are always FLOP-limited.
>
>
> I don't expect, i experiment. I don't know why, current results are such
> that stock ubuntu blas takes about 88 seconds for dense 10k
> multiplication test (with R which is setup to use it, perhaps they also
> take long time to convert to blas, but nevertheless it pins cpu 100%).
> If i compile Vienna with -march=haswell and -ffast-math then i get about
> 35 seconds. What's purplexing, the same test in bidmat's MatD matrices
> takes less than 10 seconds on my computer -- and they don't even
> saturate my cpu 100%. Something is fishy about bidmat. I don't have a
> super-beafy cpu, only a 6-core/12threads haswell-e. I know that even mkl
> takes in the area of 16 seconds on 24 threads in xeons, so 88 seconds
> for openblas on my platform looks plausible. 10 or even 8 seconds
> (BidMat+supposedly MKL) does not -- something is fishy there.

it shouldn't be too hard to directly verify correctness of the results :-)


>     Multiplication of 10k-by-10k matrices amounts to 200 GFLOP of
>     compute in double precision. A Haswell-E machine provides that
>     within a few seconds, depending on the number of cores (2.4 GHz * 4
>     doubles with AVX * 2 for FMA = 19.2 GFLOP/sec per core. MKL achieves
>     about 15 GFLOP/sec per core).
>
>
> So this sounds like a validation of the BidMat's results. Interesting.
> Why R+openblas is so slow then? What is the expected output for ViennaCL
> + OpenMP then compared to MKL rates?

I don't know the internals of R+OpenBLAS. Maybe there is extensive 
debugging going, or OpenBLAS is only used with a single thread. 
ViennaCL+OpenMP vs. MKL is hard to answer in general. It all depends a 
lot on compiler flags, the underlying CPU, etc.


> How much of improvement do you observe/expect from a new pull request,
> is there any hope to get closer to MKL dense dgemm?

The student reported about 50 percent of MKL on a laptop CPU. More 
importantly, though, is that the new code provides a good infrastructure 
for further improvements for different architectures, e.g. ARM-based CPUs.


> The primary reason against blas/mkl are that they are yet another
> platform which, most importantly, we cannot redistribute being an
> apache2 licensed. So we'd have to ask people to install a particular
> commercial product, but if ViennaCL would cover our sparse algorithm
> needs, we'd rather just have it all in one package (or at least leverage
> hardware/software support in steps). We are very limited in resources,
> that's why reason we are trying to get working with ViennaCL:
>
> -- it has sparse algorithms
> -- it supports host/OpenCL/cuda with need for new apis/conversions
> -- it does not require installation of any shared libraries beyond what
> javacpp already does for us automagically.  So we basically can drop a
> jar with javacpp in it into a spark application and having it running on
> ViennaCL. Even netlib (blas) or netlib-java api does not make it quite
> as easy (which btw we cannot redistribute either becaause of their
> licenses).

ah, makes sense!


> This is hard to beat, especially if ViennaCL becomes well-rounded in
> performance in most areas of interest, we don't need to depend on a
> particular flavor of libblas.so to be present (or any libblas.so for
> that matter).

Is DGEMM your performance-critical operation? Are there any other 
performance-critical operations?


> One more question: is it possible to copy one matrix into an openCL
> device while solving another?
> thank you!

yes, that is possible using async_copy(). I recommend to copy before the 
solver is started. You can also achieve a similar effect through a 
second OpenCL command queue.

(Needless to say, you should first profile in order to find out whether 
it is worth the effort)

Best regards,
Karli


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to