I was talking with Ken Krugler off list about the Mahout + Solr recommender and
he had an interesting request.
When calculating the indicator/item similarity matrix using ItemSimilarityJob
there is a --threshold option. Wouldn’t it be better to have an option that
specified the fraction of
The threshold should not normally be used in the Mahout+Solr deployment
style.
This need is better supported by specifying the maximum number of
indicators. This is mathematically equivalent to specifying a fraction of
values, but is more meaningful to users since good values for this number
are
On May 27, 2014, at 8:15 AM, Ted Dunning ted.dunn...@gmail.com wrote:
The threshold should not normally be used in the Mahout+Solr deployment
style.
Understood and that’s why an alternative way of specifying a cutoff may be a
good idea.
This need is better supported by specifying the
I have added the threshold merely as a way to increase the performance
of RowSimilarityJob. If a threshold is given, some item pairs don't need
to be looked at. A simple example is if you use cooccurrence count as
similarity measure, and set a threshold of n cooccurrences, than any
pair
have you verified that all the slaves are running tasks? sometimes only a
few slaves on a cluster willl pick up a task because of other limitations.
Also some algorithms in mahout arent distribnuted.
also obviously you will want to make sure that your running the distributed
implementations of
Yes, those nodes are running tasks. For Logistic Regression, it's reasonable as
this algorithm is
only sequential implementation. But for Naive Bayes and Random Forest, it's
hard to understand. By the way, how do I know/check if I am running the
distributed implementation of these algorithms?
Hi,
Logistic regression gives output which has three columns
Target
Model-output
likelihood
Is it possible to add more columns to output?
I would like to add an ID column so that i can join logistic result with input
data.
Regards,
Chhaya Vishwakarma
The