Re: DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

Christopher Nguyen Thu, 19 Dec 2013 10:48:19 -0800

Guillaume seemed to be able to do this on a per-iteration basis, so it is
reasonable to expect that it can be done once. So it's a 50-50 call that it
may indeed be something that was "unknowingly changed". Also, are you
reading the data and parsing in on the slaves, or really serializing it
from one driver?


Guillaume can you post the relevant code so we can help stare at it and
consider what's happening where. We've done a lot of Spark-JBLAS code so
are reasonably familiar with the memory utilization patterns. It may also
be relevant whether you doing scalar, vector, or matrix-matrix operations,
although that bears more directly on native memory.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Dec 19, 2013 at 10:33 AM, Matei Zaharia <[email protected]>wrote:

> Hi Guillaume,
>
> I haven’t looked at the serialization of DoubleMatrix but I believe it
> just creates one big Array[Double] instead of many ones, and stores all the
> rows contiguously in that. I don’t think that would be slower to serialize.
> However, because the object is bigger overall, it might need to get
> allocated in another part of the heap (e.g. instead of in the new
> generation), which causes more GC and may cause out-of-memory sooner. How
> big are these matrices? You might want to calculate what exactly is taking
> up memory.
>
> Matei
>
> On Dec 19, 2013, at 2:22 AM, Guillaume Pitel <[email protected]>
> wrote:
>
>  Hi Sparkers,
>
> A bit of context :
>
> I'm working on a Fast SVD method for inclusion in Mllib (to perform a
> Latent semantic analysis on a large corpora).
>
> I started using the same approach that the ALS algorithm, but this
> approach was unable to cope with the kind of data size I want to process
> (at least not on my little cluster of 5 nodes with 32GB ram). For now I'm
> working with the English Wikipedia corpus, which produces sparse matrices
> of 4.5M documents x 2.5M terms. I think that with the ALS approach it
> didn't even manage to finish a half-iteration, and simply preparing the
> blocked sparse matrix
>
> So I've rewritten the whole thing, changed the approach, and I've reached
> interesting performance (about 2 iterations in one hour).
>
> Then I realized that going from Array[Array[Double]] to jblas.DoubleMatrix
> actually created a copy of the Array, so I thought that I could gain a lot
> of memory and GC time by just working with DoubleMatrix and never going
> back and forth from/to Array[Array[Double]]
>
> But with this approach, the performance seems to be seriously degraded,
> and OOM errors happens while it didn't before. So my question is : is it
> possible that serializing DoubleMatrix instead of Array[Array[Double]]
> could really degrade the performance, or did I unknowingly changed
> something in my code ?
>
> How can I debug the size and time of the serialization ? I general, are
> there some guidelines right choices for datatypes used in outputs of RDD
> maps/reduce ?
>
> In case it can help, here is a stacktrace of the OOM error I got :
>
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
>       at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
>       at 
> org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:132)
>       at 
> org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:202)
>       at 
> org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
>       at 
> org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
>       at 
> org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
>       at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:74)
>       at 
> org.apache.spark.storage.BlockManager.liftedTree1$1(BlockManager.scala:608)
>       at org.apache.spark.storage.BlockManager.put(BlockManager.scala:604)
>       at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:75)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:224)
>       at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
>       at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
>
>
> Thanks in advance for your time & help
> Guillaume
> --
>    <exensa_logo_mail.png>
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>
>
>

Re: DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

Reply via email to