Agree with Ted. If you really want to do this, use the Tanimoto similarity implementation in the job I described earlier and you should have similarity ranked by overlap. It's one of the simplest similarity functions. But it's not a great idea. You will find that most of the 'recommendations' are skewed towards top-selling items.
Something based on cooccurrence or a latent factor model should give better results. For example, I don't think Amazon actually uses this for most-similar item calculations. If it ever shows this value, it's probably just because it is something humans can understand as a justification. I would choose a different similarity metric. These aren't recommendations; they're not personalized. They're just most-similar items. That may be fine if that's what you want but you could also explore making actual personalized recommendations. That would take more computation of course. On Mon, Dec 3, 2012 at 8:03 AM, Ted Dunning <[email protected]> wrote: > On Mon, Dec 3, 2012 at 3:06 AM, Koobas <[email protected]> wrote: > >> Thank you very much. >> The pointer to Myrrix is a very useful piece of information. >> Myrrix, however, relies on an iterative sparse matrix factorization to do >> PCA. >> I want to produce Amazon-like recommendations. >> I.e., "70% of users who bough this, also bought that." >> > > You can always quote figures like that no matter how you got the > recommendation but it is usually very bad to simply use such coocurrence > statistics directly to form recommendations since they are seriously > affected by accidental coincidence. > > >> So, I specifically want the direct kNN algorithm. >> Any clue what Mahout + Hadoop can deliver on that one? >> > > Yes. Mahout can do this.
