|
Hi Sparkers, A bit of context : I'm working on a Fast SVD method for inclusion in Mllib (to perform a Latent semantic analysis on a large corpora). I started using the same approach that the ALS algorithm, but this approach was unable to cope with the kind of data size I want to process (at least not on my little cluster of 5 nodes with 32GB ram). For now I'm working with the English Wikipedia corpus, which produces sparse matrices of 4.5M documents x 2.5M terms. I think that with the ALS approach it didn't even manage to finish a half-iteration, and simply preparing the blocked sparse matrix So I've rewritten the whole thing, changed the approach, and I've reached interesting performance (about 2 iterations in one hour). Then I realized that going from Array[Array[Double]] to jblas.DoubleMatrix actually created a copy of the Array, so I thought that I could gain a lot of memory and GC time by just working with DoubleMatrix and never going back and forth from/to Array[Array[Double]] But with this approach, the performance seems to be seriously degraded, and OOM errors happens while it didn't before. So my question is : is it possible that serializing DoubleMatrix instead of Array[Array[Double]] could really degrade the performance, or did I unknowingly changed something in my code ? How can I debug the size and time of the serialization ? I general, are there some guidelines right choices for datatypes used in outputs of RDD maps/reduce ? In case it can help, here is a stacktrace of the OOM error I got : java.lang.OutOfMemoryError: Java heap space at java.util.IdentityHashMap.resize(IdentityHashMap.java:469) at java.util.IdentityHashMap.put(IdentityHashMap.java:445) at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:132) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:202) at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:74) at org.apache.spark.storage.BlockManager.liftedTree1$1(BlockManager.scala:608) at org.apache.spark.storage.BlockManager.put(BlockManager.scala:604) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:75) at org.apache.spark.rdd.RDD.iterator(RDD.scala:224) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149) at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Thanks in advance for your time & help Guillaume --
|
- DoubleMatrix vs Array[Array[Double]] : Question about d... Guillaume Pitel
- Re: DoubleMatrix vs Array[Array[Double]] : Questio... Matei Zaharia
- Re: DoubleMatrix vs Array[Array[Double]] : Que... Christopher Nguyen
- Re: DoubleMatrix vs Array[Array[Double]] :... Guillaume Pitel
- Re: DoubleMatrix vs Array[Array[Double]] : Que... Guillaume Pitel

