Hi All, I hit performance issues with running PCA for matrix with greater number of features (2.5k x 15k):
import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.linalg.DenseVector import org.apache.spark.mllib.linalg.Vectors val sampleCnt = 2504 val featureCnt = 15000 val gen = sc.parallelize( (1 to sampleCnt).map{r=>val rnd = new scala.util.Random(); Vectors.dense ((1 to featureCnt).map(k=>rnd.nextInt(2).toDouble).toArray ) } ) val rowMat = new RowMatrix(gen) val pc: Matrix = rowMat.computePrincipalComponents(10) I'm running the above code on standalone Spark cluster of 4 nodes and 128cores in total. According to what I observed there is a final stage of the algorithm that is executed on the Driver using 1 thread that seems to be a bottleneck here - is there any way of tuning it? It takes ages (actually I was forced to kill it after 30min or so) whereas the same code written in R executes in ~6.5 minutes on my laptop(1-thread): > a<-replicate(2504, rnorm(5000)) > nrow(a) [1] 5000 > ncol(a) [1] 2504 > system.time(b<-prcomp(a)) user system elapsed 190.284 0.392 191.150 > a<-replicate(2504, rnorm(15000)) > system.time(b<-prcomp(a)) user system elapsed 386.520 0.384 386.933 I've compiled Spark with the support for native matrix libs uisng -Pnetlib-lgpl switch. Has anyone experienced such problems with mllib version of PCA? Thanks, Marek