Indicator Matrix and Mahout + Solr recommender

2014-05-27 Thread Pat Ferrel
I was talking with Ken Krugler off list about the Mahout + Solr recommender and he had an interesting request. When calculating the indicator/item similarity matrix using ItemSimilarityJob there is a --threshold option. Wouldn’t it be better to have an option that specified the fraction of

Re: Indicator Matrix and Mahout + Solr recommender

2014-05-27 Thread Ted Dunning
The threshold should not normally be used in the Mahout+Solr deployment style. This need is better supported by specifying the maximum number of indicators. This is mathematically equivalent to specifying a fraction of values, but is more meaningful to users since good values for this number are

Re: Indicator Matrix and Mahout + Solr recommender

2014-05-27 Thread Pat Ferrel
On May 27, 2014, at 8:15 AM, Ted Dunning ted.dunn...@gmail.com wrote: The threshold should not normally be used in the Mahout+Solr deployment style. Understood and that’s why an alternative way of specifying a cutoff may be a good idea. This need is better supported by specifying the

Re: Indicator Matrix and Mahout + Solr recommender

2014-05-27 Thread Sebastian Schelter
I have added the threshold merely as a way to increase the performance of RowSimilarityJob. If a threshold is given, some item pairs don't need to be looked at. A simple example is if you use cooccurrence count as similarity measure, and set a threshold of n cooccurrences, than any pair

Re: Confusion on runtime of mahout.

2014-05-27 Thread Jay Vyas
have you verified that all the slaves are running tasks? sometimes only a few slaves on a cluster willl pick up a task because of other limitations. Also some algorithms in mahout arent distribnuted. also obviously you will want to make sure that your running the distributed implementations of

Re: Confusion on runtime of mahout.

2014-05-27 Thread dongdan39
Yes, those nodes are running tasks. For Logistic Regression, it's reasonable as this algorithm is only sequential implementation. But for Naive Bayes and Random Forest, it's hard to understand. By the way, how do I know/check if I am running the distributed implementation of these algorithms?

Changing output columns of logistic regression

2014-05-27 Thread Chhaya Vishwakarma
Hi, Logistic regression gives output which has three columns Target Model-output likelihood Is it possible to add more columns to output? I would like to add an ID column so that i can join logistic result with input data. Regards, Chhaya Vishwakarma The