Re: DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

Matei Zaharia Thu, 19 Dec 2013 10:37:16 -0800

Hi Guillaume,

I haven’t looked at the serialization of DoubleMatrix but I believe it just 
creates one big Array[Double] instead of many ones, and stores all the rows 
contiguously in that. I don’t think that would be slower to serialize. However, 
because the object is bigger overall, it might need to get allocated in another 
part of the heap (e.g. instead of in the new generation), which causes more GC 
and may cause out-of-memory sooner. How big are these matrices? You might want 
to calculate what exactly is taking up memory.


Matei

On Dec 19, 2013, at 2:22 AM, Guillaume Pitel <guillaume.pi...@exensa.com> wrote:

> Hi Sparkers,
> 
> A bit of context :
> 
> I'm working on a Fast SVD method for inclusion in Mllib (to perform a Latent 
> semantic analysis on a large corpora). 
> 
> I started using the same approach that the ALS algorithm, but this approach 
> was unable to cope with the kind of data size I want to process (at least not 
> on my little cluster of 5 nodes with 32GB ram). For now I'm working with the 
> English Wikipedia corpus, which produces sparse matrices of 4.5M documents x 
> 2.5M terms. I think     that with the ALS approach it didn't even manage to 
> finish a half-iteration, and simply preparing the blocked sparse matrix
> 
> So I've rewritten the whole thing, changed the approach, and I've reached 
> interesting performance (about 2 iterations in one hour). 
> 
> Then I realized that going from Array[Array[Double]] to jblas.DoubleMatrix 
> actually created a copy of the Array, so I thought that I could gain a lot of 
> memory and GC time by just working with DoubleMatrix and never going back and 
> forth from/to Array[Array[Double]]
> 
> But with this approach, the performance seems to be seriously degraded, and 
> OOM errors happens while it didn't before. So my question is : is it possible 
> that serializing DoubleMatrix instead of Array[Array[Double]] could really 
> degrade the performance, or did I unknowingly changed something in my code ?
> 
> How can I debug the size and time of the serialization ? I general, are there 
> some guidelines right choices for datatypes used in outputs of RDD 
> maps/reduce ? 
> 
> In case it can help, here is a stacktrace of the OOM error I got :
> 
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
>       at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
>       at 
> org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:132)
>       at 
> org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:202)
>       at 
> org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
>       at 
> org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
>       at 
> org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
>       at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:74)
>       at 
> org.apache.spark.storage.BlockManager.liftedTree1$1(BlockManager.scala:608)
>       at org.apache.spark.storage.BlockManager.put(BlockManager.scala:604)
>       at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:75)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:224)
>       at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
>       at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> 
> Thanks in advance for your time & help
> Guillaume
> -- 
> <exensa_logo_mail.png>
> Guillaume PITEL, Président 
> +33(0)6 25 48 86 80
> 
> eXenSa S.A.S. 
> 41, rue Périer - 92120 Montrouge - FRANCE 
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Re: DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

Reply via email to