I have added the threshold merely as a way to increase the performance
of RowSimilarityJob. If a threshold is given, some item pairs don't need
to be looked at. A simple example is if you use cooccurrence count as
similarity measure, and set a threshold of n cooccurrences, than any
pair containing an item with less than n interactions can be ignored.
IIRC similar techniques are implemented for cosine and jaccard.
Best,
Sebastian
On 05/27/2014 07:08 PM, Pat Ferrel wrote:
On May 27, 2014, at 8:15 AM, Ted Dunning <[email protected]> wrote:
The threshold should not normally be used in the Mahout+Solr deployment
style.
Understood and that’s why an alternative way of specifying a cutoff may be a
good idea.
This need is better supported by specifying the maximum number of
indicators. This is mathematically equivalent to specifying a fraction of
values, but is more meaningful to users since good values for this number
are pretty consistent across different uses (50-100 are reasonable values
for most needs larger values are quite plausible).
Assume you mean 50-100 as the average number per item.
The total for the entire indicator matrix is what Ken was asking for. But I was
thinking about the use with itemsimilarity where the user may not know the
dimensionality since itemsimilarity assembles the matrix from individual prefs.
The user probably knows the number of items in their catalog but the indicator
matrix dimensionality is arbitrarily smaller.
Currently the help reads:
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem try to cap the number
of similar items per item to this number (default: 100)
If this were actually the average # per item it would do what you describe but
it looks like it’s a literal a cutoff per vector in the code.
A cutoff based on the highest scores in the entire matrix seems to imply a sort
when the total is larger than the average would allow and I don’t see an
obvious sort being done in the MR.
Anyway, it looks like we could do this by
1) total number of values in the matrix (what Ken was asking for) This requires
that the user know the dimensionality of the indicator matrix to be very useful.
2) average number per item (what Ted describes) This seems the most intuitive
and does not require the dimensionality be known
3) fraction of the values. This might be useful if you are more interested in
downsampling by score, at least it seems more useful than —threshold as it is
today but maybe I’m missing some use cases? Is there really a need for a hard
score threshold?
On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <[email protected]> wrote:
I was talking with Ken Krugler off list about the Mahout + Solr
recommender and he had an interesting request.
When calculating the indicator/item similarity matrix using
ItemSimilarityJob there is a --threshold option. Wouldn’t it be better to
have an option that specified the fraction of values kept in the entire
matrix based on their similarity strength? This is very difficult to do
with --threshold. It would be like expressing the threshold as a fraction
of total number of values rather than a strength value. Seems like this
would have the effect of tossing the least interesting similarities where
limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
most interesting.
At very least it seems like a better way of expressing the threshold,
doesn’t it?