Re: Sparse Matrix Storage Consumption Issue

2017-05-08 Thread Matthias Boehm
quick update: The poor runtime of scenario (3) is now fixed in master. The reasons were unnecessary shuffle and load imbalance for spark rexpand operations with small input vector and large, ultra-sparse output matrix. Thanks for pointing this out Mingyang. Regards, Matthias On Mon, May 8, 2017

Re: Sparse Matrix Storage Consumption Issue

2017-05-08 Thread Matthias Boehm
ok thanks for sharing - I'll have a look later this week. Regards, Matthias On Mon, May 8, 2017 at 2:20 PM, Mingyang Wang wrote: > Hi Matthias, > > With a driver memory of 10GB, all operations were executed on CP, and I did > observe that the version of reading FK as a

Re: Sparse Matrix Storage Consumption Issue

2017-05-08 Thread Mingyang Wang
Hi Matthias, With a driver memory of 10GB, all operations were executed on CP, and I did observe that the version of reading FK as a vector and then converting it was faster, which took 8.337s (6.246s on GC) while the version of reading FK as a matrix took 31.680s (26.256s on GC). For the

Re: Sparse Matrix Storage Consumption Issue

2017-05-06 Thread Matthias Boehm
yes, even with the previous patch for improved memory efficiency of ultra-sparse matrices in MCSR format, there is still some unnecessary overhead that leads to garbage collection. For this reason, I would recommend to read it as vector and convert it in memory to an ultra-sparse matrix. I also

Re: Sparse Matrix Storage Consumption Issue

2017-05-04 Thread Mingyang Wang
Out of curiosity, I increased the driver memory to 10GB, and then all operations were executed on CP. It took 37.166s but JVM GC took 30.534s. I was wondering whether this is the expected behavior? Total elapsed time: 38.093 sec. Total compilation time: 0.926 sec. Total execution time: 37.166

Re: Sparse Matrix Storage Consumption Issue

2017-05-04 Thread Mingyang Wang
Hi Matthias, Thanks for the patch. I have re-run the experiment and observed that there was indeed no more memory pressure, but it still took ~90s for this simple script. I was wondering what is the bottleneck for this case? Total elapsed time: 94.800 sec. Total compilation time: 1.826 sec.

Re: Sparse Matrix Storage Consumption Issue

2017-05-03 Thread Matthias Boehm
to summarize, this was an issue of selecting serialized representations for large ultra-sparse matrices. Thanks again for sharing your feedback with us. 1) In-memory representation: In CSR every non-zero will require 12 bytes - this is 240MB in your case. The overall memory consumption,