Yes, it is a SparseVector. Most rows only have a few features, and all the rows together only have tens of thousands of features, but the vector size is ~ 20 million because that is the largest feature.
On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones <devin.jo...@columbia.edu> wrote: > Hi, > > Which data structure are you using to train the model? If you haven't > tried yet, you should consider the SparseVector > > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector > > > On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann < > daniel.siegm...@teamaol.com> wrote: > >> I recently tried to a model using >> org.apache.spark.ml.classification.LogisticRegression on a data set >> where the feature vector size was around ~20 million. It did *not* go >> well. It took around 10 hours to train on a substantial cluster. >> Additionally, it pulled a lot data back to the driver - I eventually set >> --conf >> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when >> submitting. >> >> Attempting the same application on the same cluster with the feature >> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an >> issue with scaling to large numbers of features. I'm not doing anything >> fancy in my app, here's the relevant code: >> >> val lr = new LogisticRegression().setRegParam(1) >> val model = lr.fit(trainingSet.toDF()) >> >> In comparison, a coworker trained a logistic regression model on her >> *laptop* using the Java library liblinear in just a few minutes. That's >> with the ~20 million-sized feature vectors. This suggests to me there is >> some issue with Spark ML's implementation of logistic regression which is >> limiting its scalability. >> >> Note that my feature vectors are *very* sparse. The maximum feature is >> around 20 million, but I think there are only 10's of thousands of features. >> >> Has anyone run into this? Any idea where the bottleneck is or how this >> problem might be solved? >> >> One solution of course is to implement some dimensionality reduction. I'd >> really like to avoid this, as it's just another thing to deal with - not so >> hard to put it into the trainer, but then anything doing scoring will need >> the same logic. Unless Spark ML supports this out of the box? An easy way >> to save / load a model along with the dimensionality reduction logic so >> when transform is called on the model it will handle the dimensionality >> reduction transparently? >> >> Any advice would be appreciated. >> >> ~Daniel Siegmann >> > >