Hi, Which data structure are you using to train the model? If you haven't tried yet, you should consider the SparseVector
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <daniel.siegm...@teamaol.com > wrote: > I recently tried to a model using > org.apache.spark.ml.classification.LogisticRegression on a data set where > the feature vector size was around ~20 million. It did *not* go well. It > took around 10 hours to train on a substantial cluster. Additionally, it > pulled a lot data back to the driver - I eventually set --conf > spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when > submitting. > > Attempting the same application on the same cluster with the feature > vector size reduced to 100k took only ~ 9 minutes. Clearly there is an > issue with scaling to large numbers of features. I'm not doing anything > fancy in my app, here's the relevant code: > > val lr = new LogisticRegression().setRegParam(1) > val model = lr.fit(trainingSet.toDF()) > > In comparison, a coworker trained a logistic regression model on her > *laptop* using the Java library liblinear in just a few minutes. That's > with the ~20 million-sized feature vectors. This suggests to me there is > some issue with Spark ML's implementation of logistic regression which is > limiting its scalability. > > Note that my feature vectors are *very* sparse. The maximum feature is > around 20 million, but I think there are only 10's of thousands of features. > > Has anyone run into this? Any idea where the bottleneck is or how this > problem might be solved? > > One solution of course is to implement some dimensionality reduction. I'd > really like to avoid this, as it's just another thing to deal with - not so > hard to put it into the trainer, but then anything doing scoring will need > the same logic. Unless Spark ML supports this out of the box? An easy way > to save / load a model along with the dimensionality reduction logic so > when transform is called on the model it will handle the dimensionality > reduction transparently? > > Any advice would be appreciated. > > ~Daniel Siegmann >