Thanks Sunil and Ted. @Ted - IIUC, the ModelDissector explains the classification *of an individual document. *I only need general terms - i.e. word i is a strong predictor of label j. Therefore, to get aggregated results, sounds like I have to run ModelDissector for all the documents and then average. Please tell me whether this is correct or I missed the point. Cheers, Yuval
On Mon, Nov 14, 2011 at 5:33 PM, Ted Dunning <[email protected]> wrote: > Look also at the ModelDissector class. > > The idea is that all of the hashed vector encoders allow you to pass in a > so-called trace dictionary. This records which terms are in which > locations. Then you can explain model weightings using the ModelDissector. > Significantly, you can (with a bit of extra work) pass the ModelDissector > the internal state of the classifier *after* multiplying by your input. > That will tell you which features contributed to the particular > classification the current document has. > > This will be a lot slower than normal classification, mostly due to the > overhead of tracing the hashed feature encoding, but it can be made to > work. > > On Sun, Nov 13, 2011 at 11:57 PM, Suneel Marthi <[email protected] > >wrote: > > > Try looking into Stochastic Gradient Descent (SGD), you could use > > AdaptiveLogisticRegression to simultaneously create multiple training > > models and try running your tests with the best model as spewed out by > > AdapativeLogisticRegression. > > > > > > > > ________________________________ > > From: Yuval Feinstein <[email protected]> > > To: [email protected] > > Sent: Monday, November 14, 2011 2:11 AM > > Subject: Terminology Extraction > > > > Hi all. > > I am trying to use Mahout for terminology extraction: > > I have ~140 classes, each of which contains ~100 text documents. > > The class categories are distinct but may overlap a bit. > > I want to extract terms related to the label, for example if I have a > > "dogs" category, > > the terms "canine", "German Sheppard", "bone" may be related to the > > category. > > What I have come up with in the meantime was: > > 1. Learn a classifier using Mahout. > > 2. Look at term weights for the classifier - terms with high weights are > > suspect as representing the category. > > I currently only use Naive Bayes, with ng=1. > > My questions are: > > a. Is this a good setting for the problem at hand? Or does Mahout have a > > better algorithm for this? > > b. Which Mahout classifier is best for this? I chose Naive Bayes first > > because its parameters have a simple interpretation. > > Which other (stronger) classifiers also have this property? > > TIA, > > Yuval > > >
