The Gramian is 8000 x 8000, dense, and full of 8-byte doubles. It's symmetric so can get away with storing it in ~256MB. The catch is that it's going to send around copies of this 256MB array. You may easily be running your driver out of memory given all the overheads and copies, or your executors, of which there are probably 2 by default splitting 1GB.
i think the answer is that this isn't meant for computing big PCA at the moment. On Mon, Oct 13, 2014 at 8:10 PM, Yang <teddyyyy...@gmail.com> wrote: > I got this error when trying to perform PCA on a sparse matrix, each row has > a nominal length of 8000, and there are 36k rows. each row has on average 3 > elements being non-zero. > I guess the total size is not that big. > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) > at > org.apache.spark.mllib.rdd.RDDFunctions.treeAggregate(RDDFunctions.scala:100) > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:112) > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:314) > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:349) > at SimpleApp$.main(SimpleApp.scala:50) > at SimpleApp.main(SimpleApp.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > > > the command I used was > ~/tools/spark-1.1.0-bin-hadoop2.4/bin/spark-submit --executor-memory 1G > --driver-memory 1g --conf spark.executor.memory=1G --class SimpleApp > --master spark://VirtualBox:7077 > target/scala-test-1.0-SNAPSHOT-jar-with-dependencies.jar > > > I tried using 2G or 3G, but my virtual box crashed. > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org