I was talking with Ken Krugler off list about the Mahout + Solr recommender and he had an interesting request.
When calculating the indicator/item similarity matrix using ItemSimilarityJob there is a --threshold option. Wouldn’t it be better to have an option that specified the fraction of values kept in the entire matrix based on their similarity strength? This is very difficult to do with --threshold. It would be like expressing the threshold as a fraction of total number of values rather than a strength value. Seems like this would have the effect of tossing the least interesting similarities where limiting per item (—maxSimilaritiesPerItem) could easily toss some of the most interesting. At very least it seems like a better way of expressing the threshold, doesn’t it?
