Hello,

I'm trying to solve a problem of recommending/predicting 'my favorite
drink' and I'm hoping to get some support from this energetic community. Of
course, I plan to use Mahout to process tremendous amount of data.

Problem definition:
There are 20 different drinks, e.g. pepsi, coke, fanta, etc.
There are millions of customers of a supermarket who were buying those
drinks over a period of lets say last three months.

Data definition:

Drinks (id,name): [100:Pepsi],[101:Coke],....

Transactions
customer_id, list of bought drink ids
1   100,100,100,101,101,101
2   100, 102,106,106,106...
....

Definition of 'my favorite drink' is a bit foggy. We don't have any
training data we can learn from, e.g. list of fans for a given drink, the
only thing we have are transactions, and customer data (id, age, postcode).
Customer may not have a favorite drink and this should be predicted as well.

Those are 4 approaches I came up for predicting 'my favorite drink'.

1) 50% ratio - The drink, I buy the most. If the percentage of a my drink
transactions is >50% then this is my favorite drink. Otherwise I don't have
a favorite drink.

2) Gini index, more clever version of 50% ratio, If I bought pepsi 4 times,
and other 6 drinks once only each, then Pepsi is my favorite drink. Gini
index = 1 minus sum of squares of drink probabilities. In this case Gini =
1 - (4/10)^2 + 6*(1/10)^2. I have a favorite drink if gini is <0.7.

3) Rationale - My favorite drink not necessarily has to be the one I drink
the most. For example if I bought 49 CopaCopa drinks and 51 Pepsi drinks,
then CopaCopa drink is more likely my favorite one. This is based on
observations that customers who buy CopaCopa are more likely to buy Pepsi
(because this is generally popular drink), than the other way round. If I
buy the same number of unpopular CopaCopa and popular Pepsi drinks then it
probably means I'm more likely a fan of CopaCopa.

Method 3a: Naive Bayes Text classifier.  For this approach I calculate
priors - probability of buying a given drink using Maximum Likelihood based
on all customers transactions data, e.g. P(Pepsi)=0.2, P(CopaCopa)=0.02.
And then I calculate conditional probabilities of buying a drink given I
also bought something else, e.g. P(CopaCopa | Pepsi) = 0.03 and
P(Pepsi|CopaCopa) = 0.07.

Customer has a favorite drink if a posterior, e.g P(CopaCopa | Pepsi,
Pepsi, CopaCopa, CopaCopa) (probability of being a fan of CopaCopa given I
bought both Pepsi and CopaCopa twice) is >50%.

Data for bayes classification (one record for a single drink transaction).
Those five records represent a customer who bought three drinks 101,101,102
and a customer who bought two drinks 105:
drink_id(prior)
all_drink_ids_bought_by_customer_of_this_drink_transaction(prediction
record)
101 101,101,102
101 101,101,102
102 101,101,102
105 105,105
105 105,105

Method 3b: Logistic regression, I represent transactions as

Target = transaction drink id, prediction variables = percentages of drinks
for a given customer, who placed this transaction, e.g. for a single
customer, who bought pepsi, pepsi, and copacopa, we have three
classification records (one per transaction):

target, %pepsi, %copacopa, %coke,.....
pepsi, 2/3,1/3,0,0,0,0...
pepsi, 2/3,1/3,0,0,0,0...
copacopa, 2/3,1/3,0,0,0,0...

Customer has a favorite drink if a logistic regression predicts drink with
>50% confidence, e.g. I take a customer who is represented by
classification record: 0.1(CopaCopa), 0.7(Pepsi),0(Coke)..... I'm fan of
Pepsi with a confidence level of 0.64.

I would appreciate any feedback on presented approaches. Maybe there is a
better way to address this problem? I would be also glad to hear on some
papers describing similar prediction problems in various domains.

Regards.

-- 
Daniel Korzekwa
Software Engineer
priv: http://danmachine.com
blog: http://blog.danmachine.com

Reply via email to