The basic reason that it is common to binarize the relationships is that putting weights on these relationships makes it really easy to over-fit, thus giving you very goofy results.
One method for putting weights on these elements is to simply use weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) log((N_cols +1) / (colSum_j +1)) Where all weights are set to zero if you don't have a 1 in that cell of the item-item matrix. Another reasonable weighting is to simply use row or column counts (depending on the application). You get something very similar to this weighting when you use a text retrieval engine to produce recommendations where documents are columns of the item-item matrix and you multiply by a user history expressed in items. On Fri, Dec 21, 2012 at 3:45 PM, Kai R. Larsen <[email protected]>wrote: > Hi, > > My sincere apologies if this is a naïve question (I'm sure it is). > > I've engaged a programmer to take an weblog and focus on 250 pages > containing items that may be similar (or not). The goal is create > item-item relationship tables where every cell contains a score for how > similar two items are. He now tells me that only two of the (many) Mahout > algorithms can be used to generate such tables, and those that do generate > a distance of 1 or some other constant value between all pairs. > > This can't be true, can it? There must be a way to tease out such > information from the algorithms. Any advice? Any ideas why all > relationships would be one? While it is common for the website users to > have visited only one page at a time, it is not pervasive. > > Best, > > Kai Larsen >
