Try looking into Stochastic Gradient Descent (SGD), you could use AdaptiveLogisticRegression to simultaneously create multiple training models and try running your tests with the best model as spewed out by AdapativeLogisticRegression.
________________________________ From: Yuval Feinstein <[email protected]> To: [email protected] Sent: Monday, November 14, 2011 2:11 AM Subject: Terminology Extraction Hi all. I am trying to use Mahout for terminology extraction: I have ~140 classes, each of which contains ~100 text documents. The class categories are distinct but may overlap a bit. I want to extract terms related to the label, for example if I have a "dogs" category, the terms "canine", "German Sheppard", "bone" may be related to the category. What I have come up with in the meantime was: 1. Learn a classifier using Mahout. 2. Look at term weights for the classifier - terms with high weights are suspect as representing the category. I currently only use Naive Bayes, with ng=1. My questions are: a. Is this a good setting for the problem at hand? Or does Mahout have a better algorithm for this? b. Which Mahout classifier is best for this? I chose Naive Bayes first because its parameters have a simple interpretation. Which other (stronger) classifiers also have this property? TIA, Yuval
