The Gramian is 8000 x 8000, dense, and full of 8-byte doubles. It's
symmetric so can get away with storing it in ~256MB. The catch is that
it's going to send around copies of this 256MB array. You may easily
be running your driver out of memory given all the overheads and
copies, or your executors, of which there are probably 2 by default
splitting 1GB.

i think the answer is that this isn't meant for computing big PCA at the moment.


On Mon, Oct 13, 2014 at 8:10 PM, Yang <teddyyyy...@gmail.com> wrote:
> I got this error when trying to perform PCA on a sparse matrix, each row has
> a nominal length of 8000, and there are 36k rows. each row has on average 3
> elements being non-zero.
> I guess the total size is not that big.
>
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
> at
> org.apache.spark.mllib.rdd.RDDFunctions.treeAggregate(RDDFunctions.scala:100)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:112)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:314)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:349)
> at SimpleApp$.main(SimpleApp.scala:50)
> at SimpleApp.main(SimpleApp.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> the command I used was
>  ~/tools/spark-1.1.0-bin-hadoop2.4/bin/spark-submit --executor-memory 1G
> --driver-memory 1g --conf spark.executor.memory=1G --class SimpleApp
> --master  spark://VirtualBox:7077
> target/scala-test-1.0-SNAPSHOT-jar-with-dependencies.jar
>
>
> I tried using 2G or 3G, but my virtual box crashed.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to