Re: confidence values of one (or more) feature(s)

Ted Dunning Thu, 03 Nov 2011 11:59:07 -0700

I am sorry for being dense, but I don't really understand what you are
trying to do.


As I see it,

- the input is documents

- the output is a category

You want one or more of the following,

- to have the model explain why it classified documents a certain way

or

- to classify non-document phrases a certain way

or

- to have the model show its internal structure to you

or

- something else entirely

Can you say what you want in these terms?

On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <[email protected]>wrote:

> Hi Ted,
>
> thank you for the explanation.
> For example imagine a term cloud, in which terms are presented. Some terms
> are bigger than other, because they are more likely than the other terms. I
> would need those results for analysis. We want to compare different
> ML-algorithms and methods and/or compinations of them. And first I have to
> gain some basic knowledge about Mahout.
>
> For example, when I take the word 'social' as input I'd like to have that
> result:
>
> social                    1.0
> social media           0.8
> social networking    0.65
> social news            0.6
> facebook                0.5
> ...
>
> (ignore those values, it's not correct, but it should show what I need)
>
> The 20Newsgroup-example shows with the summary(int n) method the most
> likely categorisation of a term (--> the most important feature). I would
> like to have a list with the second, third, and so on important feature. I
> imagine, while computing the features, only the most import ones are added
> to the list and the less important features are rejected.
>
> Thanks and regards,
> David
>
> 2011/11/3 Ted Dunning <[email protected]>
>
> > There are no confidence values per se in the models computed by Mahout at
> > this time.
> >
> > There are several issues here,
> >
> > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> >
> > 2) SGD logistic regresssion could compute confidence intervals, but I am
> > not quite sure how to do that with stochastic gradient descent.
> >
> > 3) in most uses of Mahout's logistic regression, the issues are data size
> > and feature set size.  Confidence values are typically used for selecting
> > features which is typically not a viable strategy for problems with very
> > large feature sets.  That is what the L1 regularization is all about.
> >
> > 4) with an extremely large number features, the noise on confidence
> > intervals makes them very hard to understand
> >
> > 5) with hashed features and feature collisions it is hard enough to
> > understand which feature is doing what, much less what the confidence
> > interval means.
> >
> > Can you say more about your problem?  Is it small enough to use bayesglm
> in
> > R?
> >
> > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <[email protected]
> > >wrote:
> >
> > > Me again,
> > >
> > > can someone point me to right direction? How can I access these
> features?
> > > I looked into the summary(int n) -method located in
> > > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I
> don't
> > > understand how it works.
> > >
> > > Could someone explain to me how it works? As I understand it, it
> returns
> > > just the max-value of a feature.
> > >
> > > Thanks and regards,
> > > David
> > >
> > > 2011/10/20 David Rahman <[email protected]>
> > >
> > > > Hi,
> > > >
> > > > how can I access the confidence values of one (or more) feature(s)
> with
> > > > its possibilities?
> > > >
> > > > In the 20Newsgroup-example, there is the dissect method, within there
> > is
> > > > used summary(int n), which returns the n most important features with
> > > their
> > > > weights. I want also the features which are placed second or third
> (or
> > > > more). How can I access those?
> > > >
> > > > Regards,
> > > > David
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Reply via email to