If the purchase data is really all you have... then I don't know if you can do better than assuming that the drink that is bought most is the favorite (or perhaps one of your more subtle suggestions, yes).
This isn't a recommender problem, it seems, since you are not trying to suggest other liked drinks. It doesn't seem to be a supervised learning problem either since you don't have outside information about what each customer really does like best, which is needed to train. Do we need to define "favorite drink" more specifically here? are you really trying to predict future purchases? On Thu, Jan 19, 2012 at 11:15 AM, Daniel Korzekwa <[email protected]> wrote: > Hello, > > I'm trying to solve a problem of recommending/predicting 'my favorite > drink' and I'm hoping to get some support from this energetic community. Of > course, I plan to use Mahout to process tremendous amount of data. > > Problem definition: > There are 20 different drinks, e.g. pepsi, coke, fanta, etc. > There are millions of customers of a supermarket who were buying those > drinks over a period of lets say last three months. > > Data definition: > > Drinks (id,name): [100:Pepsi],[101:Coke],.... > > Transactions > customer_id, list of bought drink ids > 1 100,100,100,101,101,101 > 2 100, 102,106,106,106... > .... > > Definition of 'my favorite drink' is a bit foggy. We don't have any > training data we can learn from, e.g. list of fans for a given drink, the > only thing we have are transactions, and customer data (id, age, postcode). > Customer may not have a favorite drink and this should be predicted as well. > > Those are 4 approaches I came up for predicting 'my favorite drink'. > > 1) 50% ratio - The drink, I buy the most. If the percentage of a my drink > transactions is >50% then this is my favorite drink. Otherwise I don't have > a favorite drink. > > 2) Gini index, more clever version of 50% ratio, If I bought pepsi 4 times, > and other 6 drinks once only each, then Pepsi is my favorite drink. Gini > index = 1 minus sum of squares of drink probabilities. In this case Gini = > 1 - (4/10)^2 + 6*(1/10)^2. I have a favorite drink if gini is <0.7. > > 3) Rationale - My favorite drink not necessarily has to be the one I drink > the most. For example if I bought 49 CopaCopa drinks and 51 Pepsi drinks, > then CopaCopa drink is more likely my favorite one. This is based on > observations that customers who buy CopaCopa are more likely to buy Pepsi > (because this is generally popular drink), than the other way round. If I > buy the same number of unpopular CopaCopa and popular Pepsi drinks then it > probably means I'm more likely a fan of CopaCopa. > > Method 3a: Naive Bayes Text classifier. For this approach I calculate > priors - probability of buying a given drink using Maximum Likelihood based > on all customers transactions data, e.g. P(Pepsi)=0.2, P(CopaCopa)=0.02. > And then I calculate conditional probabilities of buying a drink given I > also bought something else, e.g. P(CopaCopa | Pepsi) = 0.03 and > P(Pepsi|CopaCopa) = 0.07. > > Customer has a favorite drink if a posterior, e.g P(CopaCopa | Pepsi, > Pepsi, CopaCopa, CopaCopa) (probability of being a fan of CopaCopa given I > bought both Pepsi and CopaCopa twice) is >50%. > > Data for bayes classification (one record for a single drink transaction). > Those five records represent a customer who bought three drinks 101,101,102 > and a customer who bought two drinks 105: > drink_id(prior) > all_drink_ids_bought_by_customer_of_this_drink_transaction(prediction > record) > 101 101,101,102 > 101 101,101,102 > 102 101,101,102 > 105 105,105 > 105 105,105 > > Method 3b: Logistic regression, I represent transactions as > > Target = transaction drink id, prediction variables = percentages of drinks > for a given customer, who placed this transaction, e.g. for a single > customer, who bought pepsi, pepsi, and copacopa, we have three > classification records (one per transaction): > > target, %pepsi, %copacopa, %coke,..... > pepsi, 2/3,1/3,0,0,0,0... > pepsi, 2/3,1/3,0,0,0,0... > copacopa, 2/3,1/3,0,0,0,0... > > Customer has a favorite drink if a logistic regression predicts drink with >>50% confidence, e.g. I take a customer who is represented by > classification record: 0.1(CopaCopa), 0.7(Pepsi),0(Coke)..... I'm fan of > Pepsi with a confidence level of 0.64. > > I would appreciate any feedback on presented approaches. Maybe there is a > better way to address this problem? I would be also glad to hear on some > papers describing similar prediction problems in various domains. > > Regards. > > -- > Daniel Korzekwa > Software Engineer > priv: http://danmachine.com > blog: http://blog.danmachine.com
