Here are some hints. https://cwiki.apache.org/MAHOUT/how-to-contribute.html
It is really easy and we would happy to help. On Thu, Nov 3, 2011 at 1:48 PM, David Rahman <[email protected]>wrote: > Never done that before, but I will look into it. As an alternative I could > send it to your email. But first I have to implement it successfully. > > Thanks again and regards, > David > > 2011/11/3 Ted Dunning <[email protected]> > > > If you do get to that, could you write up a JIRA and attach a patch? > > > > On Thu, Nov 3, 2011 at 1:33 PM, David Rahman <[email protected] > > >wrote: > > > > > Thank you Ted, > > > > > > I will test the methods next week, when I'm back in the office and let > > you > > > know how it went. > > > > > > Thank you and best regards, > > > David > > > > > > 2011/11/3 Ted Dunning <[email protected]> > > > > > > > OK. > > > > > > > > So the simplest design in Mahout terms is a binary classifier for > each > > > > keyword (if the keywords are not mutually exclusive). If you can > > define > > > a > > > > useful ordering for terms or have some logical entailment, you may > want > > > to > > > > allow the presence of some terms to be features for certain other > > terms. > > > > > > > > So the question boils down to how to ask a binary logistic regression > > how > > > > it came to its conclusion. > > > > > > > > You are correct to look to the model dissector for the function you > > want, > > > > but you will have to call it in a little bit unusual way because it > is > > > > really intended to describe a model rather than a single decision. > The > > > > logistic regression functions in Mahout don't actually expose quite > as > > > much > > > > information as you need for this, but if you add this method, you > > should > > > > get the basic information you need: > > > > > > > > /** > > > > * Return the element-wise product of the feature vector versus each > > > > column > > > > * of the beta matrix. This can then be used to extract the most > > > > interesting > > > > * features for a decision for each alternative output. > > > > * @param instance A feature vector > > > > * @return A matrix like beta but with each column multiplied by > > > > instance. > > > > */ > > > > public Matrix explain(Vector instance) { > > > > regularize(instance); > > > > Matrix r = beta.like().assign(beta); > > > > for (int column = 0; column < r.columnSize(); column++) { > > > > r.viewColumn(column).assign(instance, Functions.MULT); > > > > } > > > > return r; > > > > } > > > > > > > > > > > > Then to explain your binary model, you probably want some code like > > this: > > > > > > > > Map<String, Set<Integer>> traceDictionary = Maps.newHashSet(); > > > > Vector instance = encode(data, traceDictionary) > > > > Matrix b = model.explain(instance); > > > > > > > > ModelDissector md = new ModelDissector(); > > > > // get positive terms > > > > ModelDissector.update(b.getColumn(0), td, model); > > > > // scan through the top terms > > > > ... > > > > > > > > md = new ModelDissector(); > > > > ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td, > > > > model); > > > > // scan through the most negative terms > > > > ... > > > > > > > > Note that all of this code is untested and I could be out to lunch > > here. > > > > > > > > > > > > > > > > > > > > On Thu, Nov 3, 2011 at 12:19 PM, David Rahman < > > > [email protected] > > > > >wrote: > > > > > > > > > Hi Ted, > > > > > > > > > > I want to have the model explain why it classified documents in a > > > certain > > > > > way. That should be enough at first. > > > > > > > > > > I want to classify documents, each document has a corresponding set > > of > > > > > keywords. The model should be able to classify unknown documents > and > > > > > provide a number of suggustions of keywords. Later on it should be > > > > possible > > > > > to build a search term recommender for a search engine with > > classified > > > > > documents as a basis. > > > > > > > > > > At first we wanted to use the lucene data, but the existing data is > > > build > > > > > with an older lucene version, so the data is provided in xml, for > > now. > > > > It's > > > > > like the wikipedia example, only with more possible keywords. > > > > > > > > > > Hope it's understandable. > > > > > > > > > > Thanks for your endurance and regards, > > > > > David > > > > > > > > > > 2011/11/3 Ted Dunning <[email protected]> > > > > > > > > > > > I am sorry for being dense, but I don't really understand what > you > > > are > > > > > > trying to do. > > > > > > > > > > > > As I see it, > > > > > > > > > > > > - the input is documents > > > > > > > > > > > > - the output is a category > > > > > > > > > > > > You want one or more of the following, > > > > > > > > > > > > - to have the model explain why it classified documents a certain > > way > > > > > > > > > > > > or > > > > > > > > > > > > - to classify non-document phrases a certain way > > > > > > > > > > > > or > > > > > > > > > > > > - to have the model show its internal structure to you > > > > > > > > > > > > or > > > > > > > > > > > > - something else entirely > > > > > > > > > > > > Can you say what you want in these terms? > > > > > > > > > > > > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman < > > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > Hi Ted, > > > > > > > > > > > > > > thank you for the explanation. > > > > > > > For example imagine a term cloud, in which terms are presented. > > > Some > > > > > > terms > > > > > > > are bigger than other, because they are more likely than the > > other > > > > > > terms. I > > > > > > > would need those results for analysis. We want to compare > > different > > > > > > > ML-algorithms and methods and/or compinations of them. And > first > > I > > > > have > > > > > > to > > > > > > > gain some basic knowledge about Mahout. > > > > > > > > > > > > > > For example, when I take the word 'social' as input I'd like to > > > have > > > > > that > > > > > > > result: > > > > > > > > > > > > > > social 1.0 > > > > > > > social media 0.8 > > > > > > > social networking 0.65 > > > > > > > social news 0.6 > > > > > > > facebook 0.5 > > > > > > > ... > > > > > > > > > > > > > > (ignore those values, it's not correct, but it should show > what I > > > > need) > > > > > > > > > > > > > > The 20Newsgroup-example shows with the summary(int n) method > the > > > most > > > > > > > likely categorisation of a term (--> the most important > > feature). I > > > > > would > > > > > > > like to have a list with the second, third, and so on important > > > > > feature. > > > > > > I > > > > > > > imagine, while computing the features, only the most import > ones > > > are > > > > > > added > > > > > > > to the list and the less important features are rejected. > > > > > > > > > > > > > > Thanks and regards, > > > > > > > David > > > > > > > > > > > > > > 2011/11/3 Ted Dunning <[email protected]> > > > > > > > > > > > > > > > There are no confidence values per se in the models computed > by > > > > > Mahout > > > > > > at > > > > > > > > this time. > > > > > > > > > > > > > > > > There are several issues here, > > > > > > > > > > > > > > > > 1) Naive Bayes doesn't have such a concept. 'Nuff said > there. > > > > > > > > > > > > > > > > 2) SGD logistic regresssion could compute confidence > intervals, > > > > but I > > > > > > am > > > > > > > > not quite sure how to do that with stochastic gradient > descent. > > > > > > > > > > > > > > > > 3) in most uses of Mahout's logistic regression, the issues > are > > > > data > > > > > > size > > > > > > > > and feature set size. Confidence values are typically used > for > > > > > > selecting > > > > > > > > features which is typically not a viable strategy for > problems > > > with > > > > > > very > > > > > > > > large feature sets. That is what the L1 regularization is > all > > > > about. > > > > > > > > > > > > > > > > 4) with an extremely large number features, the noise on > > > confidence > > > > > > > > intervals makes them very hard to understand > > > > > > > > > > > > > > > > 5) with hashed features and feature collisions it is hard > > enough > > > to > > > > > > > > understand which feature is doing what, much less what the > > > > confidence > > > > > > > > interval means. > > > > > > > > > > > > > > > > Can you say more about your problem? Is it small enough to > use > > > > > > bayesglm > > > > > > > in > > > > > > > > R? > > > > > > > > > > > > > > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman < > > > > > > [email protected] > > > > > > > > >wrote: > > > > > > > > > > > > > > > > > Me again, > > > > > > > > > > > > > > > > > > can someone point me to right direction? How can I access > > these > > > > > > > features? > > > > > > > > > I looked into the summary(int n) -method located in > > > > > > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but > > > > somehow I > > > > > > > don't > > > > > > > > > understand how it works. > > > > > > > > > > > > > > > > > > Could someone explain to me how it works? As I understand > it, > > > it > > > > > > > returns > > > > > > > > > just the max-value of a feature. > > > > > > > > > > > > > > > > > > Thanks and regards, > > > > > > > > > David > > > > > > > > > > > > > > > > > > 2011/10/20 David Rahman <[email protected]> > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > how can I access the confidence values of one (or more) > > > > > feature(s) > > > > > > > with > > > > > > > > > > its possibilities? > > > > > > > > > > > > > > > > > > > > In the 20Newsgroup-example, there is the dissect method, > > > within > > > > > > there > > > > > > > > is > > > > > > > > > > used summary(int n), which returns the n most important > > > > features > > > > > > with > > > > > > > > > their > > > > > > > > > > weights. I want also the features which are placed second > > or > > > > > third > > > > > > > (or > > > > > > > > > > more). How can I access those? > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
