Just for the heck of it I tried the old MLlib implementation, but it had the same scalability problem.
Anyone familiar with the logistic regression implementation who could weigh in? On Mon, Mar 7, 2016 at 5:35 PM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > We're using SparseVector columns in a DataFrame, so they are definitely > supported. But maybe for LR some implicit magic is happening inside. > > On 7 March 2016 at 23:04, Devin Jones <devin.jo...@columbia.edu> wrote: > >> I could be wrong but its possible that toDF populates a dataframe which I >> understand do not support sparsevectors at the moment. >> >> If you use the MlLib logistic regression implementation (not ml) you can >> pass the RDD[LabeledPoint] data type directly to the learner. >> >> >> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS >> >> Only downside is that you can't use the pipeline framework from spark ml. >> >> Cheers, >> Devin >> >> >> >> On Mon, Mar 7, 2016 at 4:54 PM, Daniel Siegmann < >> daniel.siegm...@teamaol.com> wrote: >> >>> Yes, it is a SparseVector. Most rows only have a few features, and all >>> the rows together only have tens of thousands of features, but the vector >>> size is ~ 20 million because that is the largest feature. >>> >>> On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones <devin.jo...@columbia.edu> >>> wrote: >>> >>>> Hi, >>>> >>>> Which data structure are you using to train the model? If you haven't >>>> tried yet, you should consider the SparseVector >>>> >>>> >>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector >>>> >>>> >>>> On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann < >>>> daniel.siegm...@teamaol.com> wrote: >>>> >>>>> I recently tried to a model using >>>>> org.apache.spark.ml.classification.LogisticRegression on a data set >>>>> where the feature vector size was around ~20 million. It did *not* go >>>>> well. It took around 10 hours to train on a substantial cluster. >>>>> Additionally, it pulled a lot data back to the driver - I eventually set >>>>> --conf >>>>> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when >>>>> submitting. >>>>> >>>>> Attempting the same application on the same cluster with the feature >>>>> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an >>>>> issue with scaling to large numbers of features. I'm not doing anything >>>>> fancy in my app, here's the relevant code: >>>>> >>>>> val lr = new LogisticRegression().setRegParam(1) >>>>> val model = lr.fit(trainingSet.toDF()) >>>>> >>>>> In comparison, a coworker trained a logistic regression model on her >>>>> *laptop* using the Java library liblinear in just a few minutes. >>>>> That's with the ~20 million-sized feature vectors. This suggests to me >>>>> there is some issue with Spark ML's implementation of logistic regression >>>>> which is limiting its scalability. >>>>> >>>>> Note that my feature vectors are *very* sparse. The maximum feature >>>>> is around 20 million, but I think there are only 10's of thousands of >>>>> features. >>>>> >>>>> Has anyone run into this? Any idea where the bottleneck is or how this >>>>> problem might be solved? >>>>> >>>>> One solution of course is to implement some dimensionality reduction. >>>>> I'd really like to avoid this, as it's just another thing to deal with - >>>>> not so hard to put it into the trainer, but then anything doing scoring >>>>> will need the same logic. Unless Spark ML supports this out of the box? An >>>>> easy way to save / load a model along with the dimensionality reduction >>>>> logic so when transform is called on the model it will handle the >>>>> dimensionality reduction transparently? >>>>> >>>>> Any advice would be appreciated. >>>>> >>>>> ~Daniel Siegmann >>>>> >>>> >>>> >>> >> >