I have a matrix of about 2 millions+ rows with 3 millions + columns in svm
format* and it's sparse. As I understand it, running SVD on such a matrix
shouldn't be a problem since version 1.1.

I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was
able to compute the SVD for 20 singular values, but it fails with a Java
Heap Size error for 200 singular values. I'm currently trying 100. 

So my question is this, what kind of cluster do you need to perform this
task? 
As I do not have any measurable experience with Spark I can't say if this is
normal: my test for 100 singular values has been running for over an hour.

I'm using this dataset
http://archive.ics.uci.edu/ml/datasets/URL+Reputation

I'm using the spark-shell with --executor-memory 15G --driver-memory 15G


And the few lines of codes are
/import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.util.MLUtils
val data = MLUtils.loadLibSVMFile(sc, "all.svm",3231961)
val features = data.map(line => line.features)
val mat = new RowMatrix(features)
val svd = mat.computeSVD(200, computeU= true)/


svm format: <label> <column number>:value



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to