Re: Mahout Amazon EMR usage cost

Sean Owen Mon, 03 Dec 2012 03:35:19 -0800

Agree with Ted. If you really want to do this, use the Tanimoto
similarity implementation in the job I described earlier and you
should have similarity ranked by overlap. It's one of the simplest
similarity functions. But it's not a great idea. You will find that
most of the 'recommendations' are skewed towards top-selling items.

Something based on cooccurrence or a latent factor model should give
better results. For example, I don't think Amazon actually uses this
for most-similar item calculations. If it ever shows this value, it's
probably just because it is something humans can understand as a
justification. I would choose a different similarity metric.

These aren't recommendations; they're not personalized. They're just
most-similar items. That may be fine if that's what you want but you
could also explore making actual personalized recommendations. That
would take more computation of course.

On Mon, Dec 3, 2012 at 8:03 AM, Ted Dunning <[email protected]> wrote:
> On Mon, Dec 3, 2012 at 3:06 AM, Koobas <[email protected]> wrote:
>
>> Thank you very much.
>> The pointer to Myrrix is a very useful piece of information.
>> Myrrix, however, relies on an iterative sparse matrix factorization to do
>> PCA.
>> I want to produce Amazon-like recommendations.
>> I.e., "70% of users who bough this, also bought that."
>>
>
> You can always quote figures like that no matter how you got the
> recommendation but it is usually very bad to simply use such coocurrence
> statistics directly to form recommendations since they are seriously
> affected by accidental coincidence.
>
>
>> So, I specifically want the direct kNN algorithm.
>> Any clue what Mahout + Hadoop can deliver on that one?
>>
>
> Yes. Mahout can do this.

Re: Mahout Amazon EMR usage cost

Reply via email to