Re: Terminology Extraction

Yuval Feinstein Mon, 14 Nov 2011 22:45:23 -0800

Thanks Sunil and Ted.
@Ted - IIUC, the ModelDissector explains the classification *of an
individual document.
*I only need general terms - i.e. word i is a strong predictor of label j.
Therefore, to get aggregated results, sounds like I have to run
ModelDissector for all the documents
and then average.
Please tell me whether this is correct or I missed the point.
Cheers,
Yuval


On Mon, Nov 14, 2011 at 5:33 PM, Ted Dunning <[email protected]> wrote:

> Look also at the ModelDissector class.
>
> The idea is that all of the hashed vector encoders allow you to pass in a
> so-called trace dictionary.  This records which terms are in which
> locations.  Then you can explain model weightings using the ModelDissector.
>  Significantly, you can (with a bit of extra work) pass the ModelDissector
> the internal state of the classifier *after* multiplying by your input.
>  That will tell you which features contributed to the particular
> classification the current document has.
>
> This will be a lot slower than normal classification, mostly due to the
> overhead of tracing the hashed feature encoding, but it can be made to
> work.
>
> On Sun, Nov 13, 2011 at 11:57 PM, Suneel Marthi <[email protected]
> >wrote:
>
> > Try looking into Stochastic Gradient Descent (SGD), you could use
> > AdaptiveLogisticRegression to simultaneously create multiple training
> > models and try running your tests with the best model as spewed out by
> > AdapativeLogisticRegression.
> >
> >
> >
> > ________________________________
> > From: Yuval Feinstein <[email protected]>
> > To: [email protected]
> > Sent: Monday, November 14, 2011 2:11 AM
> > Subject: Terminology Extraction
> >
> > Hi all.
> > I am trying to use Mahout for terminology extraction:
> > I have ~140 classes, each of which contains ~100 text documents.
> > The class categories are distinct but may overlap a bit.
> > I want to extract terms related to the label, for example if I have a
> > "dogs" category,
> > the terms "canine", "German Sheppard", "bone" may be related to the
> > category.
> > What I have come up with in the meantime was:
> > 1. Learn a classifier using Mahout.
> > 2. Look at term weights for the classifier - terms with high weights are
> > suspect as representing the category.
> > I currently only use Naive Bayes, with ng=1.
> > My questions are:
> > a. Is this a good setting for the problem at hand? Or does Mahout have a
> > better algorithm for this?
> > b. Which Mahout classifier is best for this? I chose Naive Bayes first
> > because its parameters have a simple interpretation.
> > Which other (stronger) classifiers also have this property?
> > TIA,
> > Yuval
> >
>

Re: Terminology Extraction

Reply via email to