I am sorry for being dense, but I don't really understand what you are trying to do.
As I see it, - the input is documents - the output is a category You want one or more of the following, - to have the model explain why it classified documents a certain way or - to classify non-document phrases a certain way or - to have the model show its internal structure to you or - something else entirely Can you say what you want in these terms? On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <[email protected]>wrote: > Hi Ted, > > thank you for the explanation. > For example imagine a term cloud, in which terms are presented. Some terms > are bigger than other, because they are more likely than the other terms. I > would need those results for analysis. We want to compare different > ML-algorithms and methods and/or compinations of them. And first I have to > gain some basic knowledge about Mahout. > > For example, when I take the word 'social' as input I'd like to have that > result: > > social 1.0 > social media 0.8 > social networking 0.65 > social news 0.6 > facebook 0.5 > ... > > (ignore those values, it's not correct, but it should show what I need) > > The 20Newsgroup-example shows with the summary(int n) method the most > likely categorisation of a term (--> the most important feature). I would > like to have a list with the second, third, and so on important feature. I > imagine, while computing the features, only the most import ones are added > to the list and the less important features are rejected. > > Thanks and regards, > David > > 2011/11/3 Ted Dunning <[email protected]> > > > There are no confidence values per se in the models computed by Mahout at > > this time. > > > > There are several issues here, > > > > 1) Naive Bayes doesn't have such a concept. 'Nuff said there. > > > > 2) SGD logistic regresssion could compute confidence intervals, but I am > > not quite sure how to do that with stochastic gradient descent. > > > > 3) in most uses of Mahout's logistic regression, the issues are data size > > and feature set size. Confidence values are typically used for selecting > > features which is typically not a viable strategy for problems with very > > large feature sets. That is what the L1 regularization is all about. > > > > 4) with an extremely large number features, the noise on confidence > > intervals makes them very hard to understand > > > > 5) with hashed features and feature collisions it is hard enough to > > understand which feature is doing what, much less what the confidence > > interval means. > > > > Can you say more about your problem? Is it small enough to use bayesglm > in > > R? > > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <[email protected] > > >wrote: > > > > > Me again, > > > > > > can someone point me to right direction? How can I access these > features? > > > I looked into the summary(int n) -method located in > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I > don't > > > understand how it works. > > > > > > Could someone explain to me how it works? As I understand it, it > returns > > > just the max-value of a feature. > > > > > > Thanks and regards, > > > David > > > > > > 2011/10/20 David Rahman <[email protected]> > > > > > > > Hi, > > > > > > > > how can I access the confidence values of one (or more) feature(s) > with > > > > its possibilities? > > > > > > > > In the 20Newsgroup-example, there is the dissect method, within there > > is > > > > used summary(int n), which returns the n most important features with > > > their > > > > weights. I want also the features which are placed second or third > (or > > > > more). How can I access those? > > > > > > > > Regards, > > > > David > > > > > > > > > >
