Look also at the ModelDissector class. The idea is that all of the hashed vector encoders allow you to pass in a so-called trace dictionary. This records which terms are in which locations. Then you can explain model weightings using the ModelDissector. Significantly, you can (with a bit of extra work) pass the ModelDissector the internal state of the classifier *after* multiplying by your input. That will tell you which features contributed to the particular classification the current document has.
This will be a lot slower than normal classification, mostly due to the overhead of tracing the hashed feature encoding, but it can be made to work. On Sun, Nov 13, 2011 at 11:57 PM, Suneel Marthi <[email protected]>wrote: > Try looking into Stochastic Gradient Descent (SGD), you could use > AdaptiveLogisticRegression to simultaneously create multiple training > models and try running your tests with the best model as spewed out by > AdapativeLogisticRegression. > > > > ________________________________ > From: Yuval Feinstein <[email protected]> > To: [email protected] > Sent: Monday, November 14, 2011 2:11 AM > Subject: Terminology Extraction > > Hi all. > I am trying to use Mahout for terminology extraction: > I have ~140 classes, each of which contains ~100 text documents. > The class categories are distinct but may overlap a bit. > I want to extract terms related to the label, for example if I have a > "dogs" category, > the terms "canine", "German Sheppard", "bone" may be related to the > category. > What I have come up with in the meantime was: > 1. Learn a classifier using Mahout. > 2. Look at term weights for the classifier - terms with high weights are > suspect as representing the category. > I currently only use Naive Bayes, with ng=1. > My questions are: > a. Is this a good setting for the problem at hand? Or does Mahout have a > better algorithm for this? > b. Which Mahout classifier is best for this? I chose Naive Bayes first > because its parameters have a simple interpretation. > Which other (stronger) classifiers also have this property? > TIA, > Yuval >
