Re: Terminology Extraction

Ted Dunning Tue, 15 Nov 2011 07:32:33 -0800

Actually, the ModelDissector is intended to explain the model in general as
you would like.  I was jumping through hoops in my explanation to show how
it could be used for a single document.


For the use you want, you can just encode some moderate number of documents
(not all by any means) using the same trace dictionary.  The dissect the
model in the standard way.

This will not give you some of the more rare features but should give you a
general outline of what the model is doing.

On Mon, Nov 14, 2011 at 10:44 PM, Yuval Feinstein <[email protected]>wrote:

> Thanks Sunil and Ted.
> @Ted - IIUC, the ModelDissector explains the classification *of an
> individual document.
> *I only need general terms - i.e. word i is a strong predictor of label j.
> Therefore, to get aggregated results, sounds like I have to run
> ModelDissector for all the documents
> and then average.
> Please tell me whether this is correct or I missed the point.
> Cheers,
> Yuval
>
> On Mon, Nov 14, 2011 at 5:33 PM, Ted Dunning <[email protected]>
> wrote:
>
> > Look also at the ModelDissector class.
> >
> > The idea is that all of the hashed vector encoders allow you to pass in a
> > so-called trace dictionary.  This records which terms are in which
> > locations.  Then you can explain model weightings using the
> ModelDissector.
> >  Significantly, you can (with a bit of extra work) pass the
> ModelDissector
> > the internal state of the classifier *after* multiplying by your input.
> >  That will tell you which features contributed to the particular
> > classification the current document has.
> >
> > This will be a lot slower than normal classification, mostly due to the
> > overhead of tracing the hashed feature encoding, but it can be made to
> > work.
> >
> > On Sun, Nov 13, 2011 at 11:57 PM, Suneel Marthi <[email protected]
> > >wrote:
> >
> > > Try looking into Stochastic Gradient Descent (SGD), you could use
> > > AdaptiveLogisticRegression to simultaneously create multiple training
> > > models and try running your tests with the best model as spewed out by
> > > AdapativeLogisticRegression.
> > >
> > >
> > >
> > > ________________________________
> > > From: Yuval Feinstein <[email protected]>
> > > To: [email protected]
> > > Sent: Monday, November 14, 2011 2:11 AM
> > > Subject: Terminology Extraction
> > >
> > > Hi all.
> > > I am trying to use Mahout for terminology extraction:
> > > I have ~140 classes, each of which contains ~100 text documents.
> > > The class categories are distinct but may overlap a bit.
> > > I want to extract terms related to the label, for example if I have a
> > > "dogs" category,
> > > the terms "canine", "German Sheppard", "bone" may be related to the
> > > category.
> > > What I have come up with in the meantime was:
> > > 1. Learn a classifier using Mahout.
> > > 2. Look at term weights for the classifier - terms with high weights
> are
> > > suspect as representing the category.
> > > I currently only use Naive Bayes, with ng=1.
> > > My questions are:
> > > a. Is this a good setting for the problem at hand? Or does Mahout have
> a
> > > better algorithm for this?
> > > b. Which Mahout classifier is best for this? I chose Naive Bayes first
> > > because its parameters have a simple interpretation.
> > > Which other (stronger) classifiers also have this property?
> > > TIA,
> > > Yuval
> > >
> >
>

Re: Terminology Extraction

Reply via email to