Did you cache `features`? Without caching it is slow because we need
O(k) iterations. The storage requirement on the driver is about 2 * n
* k = 2 * 3 million * 200 ~= 9GB, not considering any overhead.
Computing U is also an expensive task in your case. We should use some
randomized SVD implementation for your data, but this is not available
now. I would recommend setting driver-memory 25g, caching `features`,
and using a smaller k. -Xiangrui

On Thu, Sep 18, 2014 at 1:02 PM, Glitch <atremb...@datacratic.com> wrote:
> I have a matrix of about 2 millions+ rows with 3 millions + columns in svm
> format* and it's sparse. As I understand it, running SVD on such a matrix
> shouldn't be a problem since version 1.1.
>
> I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was
> able to compute the SVD for 20 singular values, but it fails with a Java
> Heap Size error for 200 singular values. I'm currently trying 100.
>
> So my question is this, what kind of cluster do you need to perform this
> task?
> As I do not have any measurable experience with Spark I can't say if this is
> normal: my test for 100 singular values has been running for over an hour.
>
> I'm using this dataset
> http://archive.ics.uci.edu/ml/datasets/URL+Reputation
>
> I'm using the spark-shell with --executor-memory 15G --driver-memory 15G
>
>
> And the few lines of codes are
> /import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.mllib.util.MLUtils
> val data = MLUtils.loadLibSVMFile(sc, "all.svm",3231961)
> val features = data.map(line => line.features)
> val mat = new RowMatrix(features)
> val svd = mat.computeSVD(200, computeU= true)/
>
>
> svm format: <label> <column number>:value
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to