Hi Ted, Thanks for the help below. I ended using the binary representation approach. First, I had to convert the similarity scores to percentages by multiplying by 100 to turn fractions into decimals. After I computed the binary representation of the two scores I took the OR then divided by 100 to convert to a similarity back. This seems to be working fine.
I also learned over the weekend that there is a theory to compute such a score from scores computed from different domain. The theory is called "Utility Theory and it uses a method called Kappa Statistics". I thought I would share with everyone here Thanks again for you help with this. It is very much appreciated. -Ahmed On Fri, Mar 23, 2012 at 6:32 PM, Ted Dunning <[email protected]> wrote: > My own recommendation is to reduce both scores to binary form using > whatever sound statistical method you care to adopt and then use OR. > > A viable alternative that is relatively good is to convert both scores to > percentiles with the same polarity (i.e. 99-th %-ile is very close or very > similar). Then transform both percentiles using the logit function to get > unbounded real numbers. The logit of p is just log(p / (1-p)) where p is > in the range (0,1). These transformed percentiles can be added with > reasonable impunity and the result can interpreted pretty easily. The time > that this doesn't work so well is when one of the values is heavily > quantized near the interesting end of the scale, but that problem is > inherent in the data, not in the method. > > A similar result can be had by using -log(1-p) where p is the percentile > in question. For values of p near 1, this is approximately the same as > using the logit function. For values far from 1, we don't care what it > means. > > > On Fri, Mar 23, 2012 at 1:52 PM, Sean Owen <[email protected]> wrote: > >> On Fri, Mar 23, 2012 at 8:33 PM, Ahmed Abdeen Hamed >> <[email protected]> wrote: >> > As for merging the scores, I need an OR rule, which translates to the >> > addition. If I used AND that will make the likelihood smaller because >> the >> > probabilities will be multiplied. This will restrict the clusters to >> items >> > that appears in the intersection of content-based similarity AND sales >> > correlations. Does this sound right to you? >> >> Not really, because of course you multiply probabilities in all cases. >> Yes, all similarities would be smaller in absolute term, but that's >> fine -- the absolute value does not matter. >> >> The problem with adding is that again it assumes the two terms are in >> the same "units" and that is not clear here. The product doesn't >> contain that assumption, at least. >> >> > >> > A very important issue I am having now is about evaluation. How do we >> > evaluate these clusters resulting from a TreeClusteringRecommender? >> > >> >> In the context of recommenders, you don't. The clusters are not the >> output, just a possible implementation by-product. You could compute >> metrics like intra-cluster distance vs inter-cluster distance but I >> don't know what it says about the quality of the recs. >> >> You should start with the standard rec eval code if you can. >> > >
