DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

Guillaume Pitel Thu, 19 Dec 2013 02:23:34 -0800

Hi Sparkers,

A bit of context :

I'm working on a Fast SVD method for inclusion in Mllib (to perform a Latent semantic analysis on a large corpora).

I started using the same approach that the ALS algorithm, but this approach was unable to cope with the kind of data size I want to process (at least not on my little cluster of 5 nodes with 32GB ram). For now I'm working with the English Wikipedia corpus, which produces sparse matrices of 4.5M documents x 2.5M terms. I think that with the ALS approach it didn't even manage to finish a half-iteration, and simply preparing the blocked sparse matrix

So I've rewritten the whole thing, changed the approach, and I've reached interesting performance (about 2 iterations in one hour).

Then I realized that going from Array[Array[Double]] to jblas.DoubleMatrix actually created a copy of the Array, so I thought that I could gain a lot of memory and GC time by just working with DoubleMatrix and never going back and forth from/to Array[Array[Double]]

But with this approach, the performance seems to be seriously degraded, and OOM errors happens while it didn't before. So my question is : is it possible that serializing DoubleMatrix instead of Array[Array[Double]] could really degrade the performance, or did I unknowingly changed something in my code ?

How can I debug the size and time of the serialization ? I general, are there some guidelines right choices for datatypes used in outputs of RDD maps/reduce ?

In case it can help, here is a stacktrace of the OOM error I got :

java.lang.OutOfMemoryError: Java heap space
	at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
	at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
	at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:132)
	at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:202)
	at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
	at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
	at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
	at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:74)
	at org.apache.spark.storage.BlockManager.liftedTree1$1(BlockManager.scala:608)
	at org.apache.spark.storage.BlockManager.put(BlockManager.scala:604)
	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:75)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:224)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
	at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149)
	at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)

Thanks in advance for your time & help
Guillaume

Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

Reply via email to