Re: Mahout Amazon EMR usage cost

Koobas Wed, 05 Dec 2012 08:30:21 -0800

I am very happy to see that I started a lively thread.
I am a newcomer to the field, so this is all very useful.


Now yet another naive question.
Ted is probably going to go ballistic ;)
Assuming that simple overlap methods suck,
is there still a metric that works better than others
(i.e. Tanimoto vs. Jaccard vs something else)?


On Wed, Dec 5, 2012 at 3:24 AM, Paulo Villegas <[email protected]>wrote:

> I don't disagree at all with what you're saying. I never said (or intended
> to say) that explanations would have to be a thorough dump of the engine's
> internal computation; this would not make sense to the user and would just
> overwhelm him. Picking up a couple of representative items would be more
> than enough.
>
> And if the original algorithm is too complicated yes, it may make sense to
> bring up an additional, simpler and more understandable engine just to pick
> up explanations. But then you need to ensure that the explanations fit well
> with the results you're actually delivering. And in any case if you've got
> that additional engine and it works sensibly, you could as well aggregate
> its results into the main system and build up an ensemble. It may not work
> in all cases, but may do well in others. YMMV.
>
> I'm also not saying I know exactly what Amazon is doing internally, you
> need a lot more than a casual look at the UI to infer that. They could be
> doing frequent itemset mining, or they couldn't. But I sustain it can be a
> valid approach. A recommendation coming from association rules will have
> less coverage than a "standard" CF engine, and will probably miss a bigger
> part of the long tail, but for the target of enlarging the basket of items
> the user is willing to buy in a single transaction is perfectly well suited
> (i.e. don't find "the next best item", find "the item that goes along well
> with this one").
>
> And if you model transactions adequately (like items watched in a single
> browsing session, when you might state that the user has a single main
> intent, as opposed to coming back next day with a different thing in mind)
> then it might help to neglect spurious associations (such as you see
> sometimes in Amazon, anyway). Of course, a similar effect can be achieved
> with a "standard" recommender engine if you introduce time effects.
>
>
>
>
>  On Wed, Dec 5, 2012 at 6:57 AM, Paulo Villegas <[email protected]
>> >wrote:
>>
>>  On 05/12/12 00:53, Ted Dunning wrote:
>>>
>>>  Also, you have to separate UI considerations from algorithm
>>>> considerations.
>>>>    What algorithm populates the recommendations is the recommender
>>>> algorithm.
>>>>    It has two responsibilities... first, find items that the users will
>>>> like
>>>> and second, pick out a variety of less certain items to learn about.  It
>>>> is
>>>> not responsible for justifying choices to the user.  The UI does that
>>>> and
>>>> it may use analytics of some kind to make claims about choices made, but
>>>> that won't change the choices.
>>>>
>>>>
>>> Here I disagree: explaining recommendations to the user is an important
>>> factor in user acceptance (and therefore uptake) of the results, since if
>>> she can understand why some completely unknown item was recommended it'll
>>> make her more confident that it's a good choice (this has also been
>>> proven
>>> experimentally).
>>>
>>
>>
>> I have demonstrated that explanations help as well in some cases.  Not in
>> all.
>>
>>
>>  And the best one to know why something was recommended is the engine
>>> itself.
>>>
>>
>>
>> This is simply not true.  The engine may have very complex reasons for
>> recommendation.  This applies in classification as well.  It is completely
>> conventional, and often critical to performance to have one engine for
>> recommendation or classification and a completely independent one for
>> explanation.
>>
>>
>>  That's one good additional reason why item-based neighbourhood is more
>>> advantageous than user-based: you can communicate item neighbours to the
>>> user, which then see items she knows that are similar to the one being
>>> recommended (it's one of the things Amazon does in its recommendation
>>> lists).
>>>
>>
>>
>> Again.  This simply isn't that important.  The major goal of the
>> recommendation engine is to produce high quality recommendations and one
>> of
>> the major problems in doing that is avoiding noise effects.  Ironically,
>> it
>> is also important for the recommendation engine to inject metered amounts
>> of a different kind of noise as well.  Neither of those capabilities make
>> sense to explain to the user and these may actually dominate the
>> decisions.
>>
>> Once an explainer is given a clean set of recommendations, then the
>> problem
>> of explaining is vastly different than the job of recommending.  For
>> instance Tanimoto or Jaccard are horrible for recommendation but great for
>> explaining.  The issue is that the explainer doesn't have to explain all
>> of
>> the items that are *not* shown, only those which are shown.
>>
>> Note that Amazon does not actually explain their market basket
>> recommendations.  And in their personal recommendations (which they have
>> partially hidden now), you have to ask for the explanation.  The
>> explanation that they give is typically one or two of your actions which
>> is
>> patently not a complete explanation.  So they clearly are saying one thing
>> and doing another, just as I am recommending here.
>>
>>
>>  Speaking about Amazon, the "also bought" UI thing is still there in their
>>> website, only *not* in their specific recommendation lists.
>>>
>>
>>
>> But note that they don't give percentages any more.  Also note that they
>> don't explain all of the things that they *don't* show you.
>>
>>
>>  It's down in the page, in sections like "Continue Shopping: Customers Who
>>> Bought Items in Your Recent History Also Bought". It does not give %
>>> values
>>> now, but it's essentially the same (and it works also when you are not
>>> logged in, since it is using your recent viewing history). That's why I
>>> thought it's coming from Market Basket Analysis (i.e. frequent itemsets).
>>>
>>>
>> I doubt it seriously.  Frequent itemsets is typically much more expensive
>> than simple recommendations.
>>
>>
>>  Lift is indeed a good metric for the interestingness of a rule, but it
>>> can
>>> also produce unreasonably big values for rare itemsets. On the other
>>> hand,
>>> maybe this is good for uncovering long tail associations.
>>>
>>>
>> I have built a number of commercially successful recommendation engines
>> and
>> simple overlap has always been a complete disaster.  I have also counseled
>> a number of companies along the lines given here and the resulting numbers
>> that they have achieved have been quite striking when they switched to
>> roughly what I am describing here.
>>
>> The only time the overlap is likely to work is if you have absolutely
>> massive data and can afford very high thresholds.  That completely
>> obliterates the long tail.
>>
>> You can claim to understand a system like Amazon's from the UI, but I
>> would
>> seriously doubt that you are seeing 5% of what the recommendation engine
>> is
>> really doing.
>>
>>
>

Re: Mahout Amazon EMR usage cost

Reply via email to