I'm attempting to run a logistic regression on a small data set: about 350 documents, 30 features.
I am using this toy data set for two reasons: 1) Confirm that my Mahout vector representation is sensible; 2) Confirm that Mahout logistic regression provides sensible results. My end goal is to run the same procedure on a very large data set: potentially billions of documents. I began my investigation with OnlineLogisticRegression. The results were poor (described in greater detail below), and I then stepped over to AdaptiveLogisticRegression (again, poor results). For validation, I am using comparing the Mahout results to those obtained using R glm (family=binomial). (Note: I previously validated the R results with other methods -- and, I have a consensus on what is reasonable). Because I have so few documents, I run the set of documents through train() in epochs -- up to 1000 times, shuffling the order of the documents on each epoch. The Mahout results are poor. Mahout does a reasonable job at identifying the features positive weights (the top-third of the features). However, it does a very poor job of assigning weights to the features in the middle-third and bottom-third of the weight rankings. My questions: 1) Are these results surprising to you? Or, should they be expected given the small size of my data set? 2) How might I tweak the OnlineLogisticRegression settings to accommodate my small data set? Thank you for your feedback. -- View this message in context: http://lucene.472066.n3.nabble.com/Logistic-Regression-poor-results-on-small-data-set-tp3149694p3149694.html Sent from the Mahout User List mailing list archive at Nabble.com.
