Setting didn't-buy to 0 and getting a valid cosine distance is pretty common in these scenarios.
I still prefer what Sean is recommending in terms of LLR for item to item links, but the cosine version does make sense to support, especially for purchase histories. Even better would be to remember number of times an item was offered as well as the number of times it was purchased. This allows regression techniques to be applied, often with good results. On Tue, Apr 26, 2011 at 12:16 AM, Sean Owen <[email protected]> wrote: > I think my comment mostly addressed his comments. Yes, this is the > definition of cosine distance, and is implemented. No it doesn't work over > true binary data. There is no "0", only "1" or non-existent. > > What is the remaining question? > > On Tue, Apr 26, 2011 at 3:21 AM, Chris Waggoner <[email protected] > >wrote: > > > > > > I've never used Mahout but what this @allclaws wants sounds like a simple > > proposition. Given a vector like > > > > bought > > didn't buy > > didn't buy > > didn't buy > > didn't buy > > didn't buy > > didn't buy > > bought > > didn't buy > > bought > > bought > > bought > > > > > > > > define "bought" == 1 and "didn't buy" == 0. Define distance between two > > such vectors to be { A dot B } over { |A| times |B| }. Not that I find > this > > compelling as a definition of similarity but @allclaws called this a > first, > > rough pass. >
