I'm attempting to run a logistic regression on a small data set: about 350
documents, 30 features.

I am using this toy data set for two reasons:
1) Confirm that my Mahout vector representation is sensible;
2) Confirm that Mahout logistic regression provides sensible results.

My end goal is to run the same procedure on a very large data set:
potentially billions of documents.

I began my investigation with OnlineLogisticRegression. The results were
poor (described in greater detail below), and I then stepped over to
AdaptiveLogisticRegression (again, poor results).

For validation, I am using comparing the Mahout results to those obtained
using R glm (family=binomial). (Note: I previously validated the R results
with other methods -- and, I have a consensus on what is reasonable).

Because I have so few documents, I run the set of documents through train()
in epochs -- up to 1000 times, shuffling the order of the documents on each
epoch.

The Mahout results are poor. Mahout does a reasonable job at identifying the
features positive weights (the top-third of the features). However, it does
a very poor job of assigning weights to the features in the middle-third and
bottom-third of the weight rankings.

My questions:
1) Are these results surprising to you? Or, should they be expected given
the small size of my data set?
2) How might I tweak the OnlineLogisticRegression settings to accommodate my
small data set?



Thank you for your feedback.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Logistic-Regression-poor-results-on-small-data-set-tp3149694p3149694.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Reply via email to