Re: confidence values of one (or more) feature(s)

Ted Dunning Thu, 03 Nov 2011 14:03:37 -0700

Here are some hints.

https://cwiki.apache.org/MAHOUT/how-to-contribute.html


It is really easy and we would happy to help.

On Thu, Nov 3, 2011 at 1:48 PM, David Rahman <[email protected]>wrote:

> Never done that before, but I will look into it. As an alternative I could
> send it to your email. But first I have to implement it successfully.
>
> Thanks again and regards,
> David
>
> 2011/11/3 Ted Dunning <[email protected]>
>
> > If you do get to that, could you write up a JIRA and attach a patch?
> >
> > On Thu, Nov 3, 2011 at 1:33 PM, David Rahman <[email protected]
> > >wrote:
> >
> > > Thank you Ted,
> > >
> > > I will test the methods next week, when I'm back in the office and let
> > you
> > > know how it went.
> > >
> > > Thank you and best regards,
> > > David
> > >
> > > 2011/11/3 Ted Dunning <[email protected]>
> > >
> > > > OK.
> > > >
> > > > So the simplest design in Mahout terms is a binary classifier for
> each
> > > > keyword (if the keywords are not mutually exclusive).  If you can
> > define
> > > a
> > > > useful ordering for terms or have some logical entailment, you may
> want
> > > to
> > > > allow the presence of some terms to be features for certain other
> > terms.
> > > >
> > > > So the question boils down to how to ask a binary logistic regression
> > how
> > > > it came to its conclusion.
> > > >
> > > > You are correct to look to the model dissector for the function you
> > want,
> > > > but you will have to call it in a little bit unusual way because it
> is
> > > > really intended to describe a model rather than a single decision.
>  The
> > > > logistic regression functions in Mahout don't actually expose quite
> as
> > > much
> > > > information as you need for this, but if you add this method, you
> > should
> > > > get the basic information you need:
> > > >
> > > >        /**
> > > >   * Return the element-wise product of the feature vector versus each
> > > > column
> > > >   * of the beta matrix.  This can then be used to extract the most
> > > > interesting
> > > >   * features for a decision for each alternative output.
> > > >   * @param instance  A feature vector
> > > >   * @return   A matrix like beta but with each column multiplied by
> > > > instance.
> > > >   */
> > > >  public Matrix explain(Vector instance) {
> > > >    regularize(instance);
> > > >    Matrix r = beta.like().assign(beta);
> > > >    for (int column = 0; column < r.columnSize(); column++) {
> > > >      r.viewColumn(column).assign(instance, Functions.MULT);
> > > >    }
> > > >    return r;
> > > >  }
> > > >
> > > >
> > > > Then to explain your binary model, you probably want some code like
> > this:
> > > >
> > > >   Map<String, Set<Integer>> traceDictionary = Maps.newHashSet();
> > > >   Vector instance = encode(data, traceDictionary)
> > > >   Matrix b = model.explain(instance);
> > > >
> > > >   ModelDissector md = new ModelDissector();
> > > >   // get positive terms
> > > >   ModelDissector.update(b.getColumn(0), td, model);
> > > >   // scan through the top terms
> > > >   ...
> > > >
> > > >   md = new ModelDissector();
> > > >   ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td,
> > > > model);
> > > >   // scan through the most negative terms
> > > >   ...
> > > >
> > > > Note that all of this code is untested and I could be out to lunch
> > here.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Nov 3, 2011 at 12:19 PM, David Rahman <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > I want to have the model explain why it classified documents in a
> > > certain
> > > > > way. That should be enough at first.
> > > > >
> > > > > I want to classify documents, each document has a corresponding set
> > of
> > > > > keywords. The model should be able to classify unknown documents
> and
> > > > > provide a number of suggustions of keywords. Later on it should be
> > > > possible
> > > > > to build a search term recommender for a search engine with
> > classified
> > > > > documents as a basis.
> > > > >
> > > > > At first we wanted to use the lucene data, but the existing data is
> > > build
> > > > > with an older lucene version, so the data is provided in xml, for
> > now.
> > > > It's
> > > > > like the wikipedia example, only with more possible keywords.
> > > > >
> > > > > Hope it's understandable.
> > > > >
> > > > > Thanks for your endurance and regards,
> > > > > David
> > > > >
> > > > > 2011/11/3 Ted Dunning <[email protected]>
> > > > >
> > > > > > I am sorry for being dense, but I don't really understand what
> you
> > > are
> > > > > > trying to do.
> > > > > >
> > > > > > As I see it,
> > > > > >
> > > > > > - the input is documents
> > > > > >
> > > > > > - the output is a category
> > > > > >
> > > > > > You want one or more of the following,
> > > > > >
> > > > > > - to have the model explain why it classified documents a certain
> > way
> > > > > >
> > > > > > or
> > > > > >
> > > > > > - to classify non-document phrases a certain way
> > > > > >
> > > > > > or
> > > > > >
> > > > > > - to have the model show its internal structure to you
> > > > > >
> > > > > > or
> > > > > >
> > > > > > - something else entirely
> > > > > >
> > > > > > Can you say what you want in these terms?
> > > > > >
> > > > > > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <
> > > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi Ted,
> > > > > > >
> > > > > > > thank you for the explanation.
> > > > > > > For example imagine a term cloud, in which terms are presented.
> > > Some
> > > > > > terms
> > > > > > > are bigger than other, because they are more likely than the
> > other
> > > > > > terms. I
> > > > > > > would need those results for analysis. We want to compare
> > different
> > > > > > > ML-algorithms and methods and/or compinations of them. And
> first
> > I
> > > > have
> > > > > > to
> > > > > > > gain some basic knowledge about Mahout.
> > > > > > >
> > > > > > > For example, when I take the word 'social' as input I'd like to
> > > have
> > > > > that
> > > > > > > result:
> > > > > > >
> > > > > > > social                    1.0
> > > > > > > social media           0.8
> > > > > > > social networking    0.65
> > > > > > > social news            0.6
> > > > > > > facebook                0.5
> > > > > > > ...
> > > > > > >
> > > > > > > (ignore those values, it's not correct, but it should show
> what I
> > > > need)
> > > > > > >
> > > > > > > The 20Newsgroup-example shows with the summary(int n) method
> the
> > > most
> > > > > > > likely categorisation of a term (--> the most important
> > feature). I
> > > > > would
> > > > > > > like to have a list with the second, third, and so on important
> > > > > feature.
> > > > > > I
> > > > > > > imagine, while computing the features, only the most import
> ones
> > > are
> > > > > > added
> > > > > > > to the list and the less important features are rejected.
> > > > > > >
> > > > > > > Thanks and regards,
> > > > > > > David
> > > > > > >
> > > > > > > 2011/11/3 Ted Dunning <[email protected]>
> > > > > > >
> > > > > > > > There are no confidence values per se in the models computed
> by
> > > > > Mahout
> > > > > > at
> > > > > > > > this time.
> > > > > > > >
> > > > > > > > There are several issues here,
> > > > > > > >
> > > > > > > > 1) Naive Bayes doesn't have such a concept.  'Nuff said
> there.
> > > > > > > >
> > > > > > > > 2) SGD logistic regresssion could compute confidence
> intervals,
> > > > but I
> > > > > > am
> > > > > > > > not quite sure how to do that with stochastic gradient
> descent.
> > > > > > > >
> > > > > > > > 3) in most uses of Mahout's logistic regression, the issues
> are
> > > > data
> > > > > > size
> > > > > > > > and feature set size.  Confidence values are typically used
> for
> > > > > > selecting
> > > > > > > > features which is typically not a viable strategy for
> problems
> > > with
> > > > > > very
> > > > > > > > large feature sets.  That is what the L1 regularization is
> all
> > > > about.
> > > > > > > >
> > > > > > > > 4) with an extremely large number features, the noise on
> > > confidence
> > > > > > > > intervals makes them very hard to understand
> > > > > > > >
> > > > > > > > 5) with hashed features and feature collisions it is hard
> > enough
> > > to
> > > > > > > > understand which feature is doing what, much less what the
> > > > confidence
> > > > > > > > interval means.
> > > > > > > >
> > > > > > > > Can you say more about your problem?  Is it small enough to
> use
> > > > > > bayesglm
> > > > > > > in
> > > > > > > > R?
> > > > > > > >
> > > > > > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> > > > > > [email protected]
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > Me again,
> > > > > > > > >
> > > > > > > > > can someone point me to right direction? How can I access
> > these
> > > > > > > features?
> > > > > > > > > I looked into the summary(int n) -method located in
> > > > > > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but
> > > > somehow I
> > > > > > > don't
> > > > > > > > > understand how it works.
> > > > > > > > >
> > > > > > > > > Could someone explain to me how it works? As I understand
> it,
> > > it
> > > > > > > returns
> > > > > > > > > just the max-value of a feature.
> > > > > > > > >
> > > > > > > > > Thanks and regards,
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > 2011/10/20 David Rahman <[email protected]>
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > how can I access the confidence values of one (or more)
> > > > > feature(s)
> > > > > > > with
> > > > > > > > > > its possibilities?
> > > > > > > > > >
> > > > > > > > > > In the 20Newsgroup-example, there is the dissect method,
> > > within
> > > > > > there
> > > > > > > > is
> > > > > > > > > > used summary(int n), which returns the n most important
> > > > features
> > > > > > with
> > > > > > > > > their
> > > > > > > > > > weights. I want also the features which are placed second
> > or
> > > > > third
> > > > > > > (or
> > > > > > > > > > more). How can I access those?
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Reply via email to