Re: [ViennaCL-devel] Initializing matrices via direct serialization (row major or CCS)

Dmitriy Lyubimov Thu, 14 Jul 2016 11:22:48 -0700

Karl, thank you for your reply!

On Thu, Jul 14, 2016 at 1:45 AM, Karl Rupp <r...@iue.tuwien.ac.at> wrote:


> Hi again,
>
>
>>
> 15 seconds of copying for a 10k-by-10k matrix looks way too much.
> 10k-by-10k is 800 MB of data for double precision, so this should not take
> much more than 100 ms on a low-range laptop (10 GB/sec memory bandwidth).
> Even with multiple matrices and copies you should stay in the 1 second
> regime.


You are right. i don't see a significant variation weither i use fast_copy
or constructor. The time at this point is mostly consumed by moving mahout
data structures into RM or CCS format and it is really a POC now so we are
working to get it faster. But java is really slow, especially when working
 with native buffers naively -- we will have to improve that. For the
record, 15 seconds probably include loading all necessary classes, and to
serialize 2 10k-x-10k matrices in and 1 10k x k matrix out, including all
these scala-side conversions.


> Why do you expect to beat OpenBLAS? Their kernels are really well
> optimized, and for lare dense matrix-matrix you are always FLOP-limited.


I don't expect, i experiment. I don't know why, current results are such
that stock ubuntu blas takes about 88 seconds for dense 10k multiplication
test (with R which is setup to use it, perhaps they also take long time to
convert to blas, but nevertheless it pins cpu 100%). If i compile Vienna
with -march=haswell and -ffast-math then i get about 35 seconds. What's
purplexing, the same test in bidmat's MatD matrices takes less than 10
seconds on my computer -- and they don't even saturate my cpu 100%.
Something is fishy about bidmat. I don't have a super-beafy cpu, only a
6-core/12threads haswell-e. I know that even mkl takes in the area of 16
seconds on 24 threads in xeons, so 88 seconds for openblas on my platform
looks plausible. 10 or even 8 seconds (BidMat+supposedly MKL) does not --
something is fishy there.


>
>
>
> On the other hand, bidmat (which allegedly uses mkl) does the same test,
>> double precision, in under 10 seconds. I can't fathom how, but it does.
>> I have a haswell-E platform.
>>
>
> Multiplication of 10k-by-10k matrices amounts to 200 GFLOP of compute in
> double precision. A Haswell-E machine provides that within a few seconds,
> depending on the number of cores (2.4 GHz * 4 doubles with AVX * 2 for FMA
> = 19.2 GFLOP/sec per core. MKL achieves about 15 GFLOP/sec per core).
>

So this sounds like a validation of the BidMat's results. Interesting. Why
R+openblas is so slow then? What is the expected output for ViennaCL +
OpenMP then compared to MKL rates?

How much of improvement do you observe/expect from a new pull request, is
there any hope to get closer to MKL dense dgemm?

The primary reason against blas/mkl are that they are yet another platform
which, most importantly, we cannot redistribute being an apache2 licensed.
So we'd have to ask people to install a particular commercial product, but
if ViennaCL would cover our sparse algorithm needs, we'd rather just have
it all in one package (or at least leverage hardware/software support in
steps). We are very limited in resources, that's why reason we are trying
to get working with ViennaCL:

-- it has sparse algorithms
-- it supports host/OpenCL/cuda with need for new apis/conversions
-- it does not require installation of any shared libraries beyond what
javacpp already does for us automagically.  So we basically can drop a jar
with javacpp in it into a spark application and having it running on
ViennaCL. Even netlib (blas) or netlib-java api does not make it quite as
easy (which btw we cannot redistribute either becaause of their licenses).

This is hard to beat, especially if ViennaCL becomes well-rounded in
performance in most areas of interest, we don't need to depend on a
particular flavor of libblas.so to be present (or any libblas.so for that
matter).

One more question: is it possible to copy one matrix into an openCL device
while solving another?
thank you!


>
>
>

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev

_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Initializing matrices via direct serialization (row major or CCS)

Reply via email to