Re: Indicator Matrix and Mahout + Solr recommender

Sebastian Schelter Tue, 27 May 2014 12:01:22 -0700

I have added the threshold merely as a way to increase the performanceof RowSimilarityJob. If a threshold is given, some item pairs don't needto be looked at. A simple example is if you use cooccurrence count assimilarity measure, and set a threshold of n cooccurrences, than anypair containing an item with less than n interactions can be ignored.IIRC similar techniques are implemented for cosine and jaccard.


Best,
Sebastian




On 05/27/2014 07:08 PM, Pat Ferrel wrote:


On May 27, 2014, at 8:15 AM, Ted Dunning <[email protected]> wrote:

The threshold should not normally be used in the Mahout+Solr deployment
style.


Understood and that’s why an alternative way of specifying a cutoff may be a 
good idea.


This need is better supported by specifying the maximum number of
indicators.  This is mathematically equivalent to specifying a fraction of
values, but is more meaningful to users since good values for this number
are pretty consistent across different uses (50-100 are reasonable values
for most needs larger values are quite plausible).


Assume you mean 50-100 as the average number per item.

The total for the entire indicator matrix is what Ken was asking for. But I was 
thinking about the use with itemsimilarity where the user may not know the 
dimensionality since itemsimilarity assembles the matrix from individual prefs. 
The user probably knows the number of items in their catalog but the indicator 
matrix dimensionality is arbitrarily smaller.

Currently the help reads:
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the number 
of similar items per  item to this number  (default: 100)

If this were actually the average # per item it would do what you describe but 
it looks like it’s a literal a cutoff per vector in the code.

A cutoff based on the highest scores in the entire matrix seems to imply a sort 
when the total is larger than the average would allow and I don’t see an 
obvious sort being done in the MR.

Anyway, it looks like we could do this by
1) total number of values in the matrix (what Ken was asking for) This requires 
that the user know the dimensionality of the indicator matrix to be very useful.
2) average number per item (what Ted describes) This seems the most intuitive 
and does not require the dimensionality be known
3) fraction of the values. This might be useful if you are more interested in 
downsampling by score, at least it seems more useful than —threshold as it is 
today but maybe I’m missing some use cases? Is there really a need for a hard 
score threshold?



On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <[email protected]> wrote:

I was talking with Ken Krugler off list about the Mahout + Solr
recommender and he had an interesting request.

When calculating the indicator/item similarity matrix using
ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to
have an option that specified the fraction of values kept in the entire
matrix based on their similarity strength? This is very difficult to do
with --threshold. It would be like expressing the threshold as a fraction
of total number of values rather than a strength value. Seems like this
would have the effect of tossing the least interesting similarities where
limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
most interesting.

At very least it seems like a better way of expressing the threshold,
doesn’t it?

Re: Indicator Matrix and Mahout + Solr recommender

Reply via email to