My own recommendation is to reduce both scores to binary form using whatever sound statistical method you care to adopt and then use OR.
A viable alternative that is relatively good is to convert both scores to percentiles with the same polarity (i.e. 99-th %-ile is very close or very similar). Then transform both percentiles using the logit function to get unbounded real numbers. The logit of p is just log(p / (1-p)) where p is in the range (0,1). These transformed percentiles can be added with reasonable impunity and the result can interpreted pretty easily. The time that this doesn't work so well is when one of the values is heavily quantized near the interesting end of the scale, but that problem is inherent in the data, not in the method. A similar result can be had by using -log(1-p) where p is the percentile in question. For values of p near 1, this is approximately the same as using the logit function. For values far from 1, we don't care what it means. On Fri, Mar 23, 2012 at 1:52 PM, Sean Owen <[email protected]> wrote: > On Fri, Mar 23, 2012 at 8:33 PM, Ahmed Abdeen Hamed > <[email protected]> wrote: > > As for merging the scores, I need an OR rule, which translates to the > > addition. If I used AND that will make the likelihood smaller because the > > probabilities will be multiplied. This will restrict the clusters to > items > > that appears in the intersection of content-based similarity AND sales > > correlations. Does this sound right to you? > > Not really, because of course you multiply probabilities in all cases. > Yes, all similarities would be smaller in absolute term, but that's > fine -- the absolute value does not matter. > > The problem with adding is that again it assumes the two terms are in > the same "units" and that is not clear here. The product doesn't > contain that assumption, at least. > > > > > A very important issue I am having now is about evaluation. How do we > > evaluate these clusters resulting from a TreeClusteringRecommender? > > > > In the context of recommenders, you don't. The clusters are not the > output, just a possible implementation by-product. You could compute > metrics like intra-cluster distance vs inter-cluster distance but I > don't know what it says about the quality of the recs. > > You should start with the standard rec eval code if you can. >
