I came up with the same conclusions you did below, it's neither pure recommendation nor classification. It sounds similar to clustering with a fixed number of classes, e.g. one class per drink and one additional class for non_fan. However I haven't achieved any satisfying results with this method yet.
The reasoning behind this 'my favorite drink' question is: If significant number of customers are fans of particular drink, then maybe it's better to organize shelves (or web shelves) with drinks in a more fans driven way to achieve better selling results?. Or maybe if we know that there are many fans of unpopular drink CopaCopa and then if we advert it more strongly we may get a big jump in selling. Those fans may be customers, who buy more Pepsi than CopaCopa, because Pepsi is really well advertised and easily available. Looking only at individual customer transactions, it may seem he is a fan of Pepsi, whereas if we look at all transactions for all customers with .e.g. Bayesian model. we can classify more customers as fans of PopoPopa, which may push us to advert PopaPopa more aggressively? It's quite difficult to find research papers on such example of machine learning, therefore I was hoping to find someone in this forum, who I could discuss this problem with. Regards. Daniel 2012/1/19 Sean Owen <[email protected]> > If the purchase data is really all you have... then I don't know if > you can do better than assuming that the drink that is bought most is > the favorite (or perhaps one of your more subtle suggestions, yes). > > This isn't a recommender problem, it seems, since you are not trying > to suggest other liked drinks. > > It doesn't seem to be a supervised learning problem either since you > don't have outside information about what each customer really does > like best, which is needed to train. > > Do we need to define "favorite drink" more specifically here? are you > really trying to predict future purchases? > > On Thu, Jan 19, 2012 at 11:15 AM, Daniel Korzekwa > <[email protected]> wrote: > > Hello, > > > > I'm trying to solve a problem of recommending/predicting 'my favorite > > drink' and I'm hoping to get some support from this energetic community. > Of > > course, I plan to use Mahout to process tremendous amount of data. > > > > Problem definition: > > There are 20 different drinks, e.g. pepsi, coke, fanta, etc. > > There are millions of customers of a supermarket who were buying those > > drinks over a period of lets say last three months. > > > > Data definition: > > > > Drinks (id,name): [100:Pepsi],[101:Coke],.... > > > > Transactions > > customer_id, list of bought drink ids > > 1 100,100,100,101,101,101 > > 2 100, 102,106,106,106... > > .... > > > > Definition of 'my favorite drink' is a bit foggy. We don't have any > > training data we can learn from, e.g. list of fans for a given drink, the > > only thing we have are transactions, and customer data (id, age, > postcode). > > Customer may not have a favorite drink and this should be predicted as > well. > > > > Those are 4 approaches I came up for predicting 'my favorite drink'. > > > > 1) 50% ratio - The drink, I buy the most. If the percentage of a my drink > > transactions is >50% then this is my favorite drink. Otherwise I don't > have > > a favorite drink. > > > > 2) Gini index, more clever version of 50% ratio, If I bought pepsi 4 > times, > > and other 6 drinks once only each, then Pepsi is my favorite drink. Gini > > index = 1 minus sum of squares of drink probabilities. In this case Gini > = > > 1 - (4/10)^2 + 6*(1/10)^2. I have a favorite drink if gini is <0.7. > > > > 3) Rationale - My favorite drink not necessarily has to be the one I > drink > > the most. For example if I bought 49 CopaCopa drinks and 51 Pepsi drinks, > > then CopaCopa drink is more likely my favorite one. This is based on > > observations that customers who buy CopaCopa are more likely to buy Pepsi > > (because this is generally popular drink), than the other way round. If I > > buy the same number of unpopular CopaCopa and popular Pepsi drinks then > it > > probably means I'm more likely a fan of CopaCopa. > > > > Method 3a: Naive Bayes Text classifier. For this approach I calculate > > priors - probability of buying a given drink using Maximum Likelihood > based > > on all customers transactions data, e.g. P(Pepsi)=0.2, P(CopaCopa)=0.02. > > And then I calculate conditional probabilities of buying a drink given I > > also bought something else, e.g. P(CopaCopa | Pepsi) = 0.03 and > > P(Pepsi|CopaCopa) = 0.07. > > > > Customer has a favorite drink if a posterior, e.g P(CopaCopa | Pepsi, > > Pepsi, CopaCopa, CopaCopa) (probability of being a fan of CopaCopa given > I > > bought both Pepsi and CopaCopa twice) is >50%. > > > > Data for bayes classification (one record for a single drink > transaction). > > Those five records represent a customer who bought three drinks > 101,101,102 > > and a customer who bought two drinks 105: > > drink_id(prior) > > all_drink_ids_bought_by_customer_of_this_drink_transaction(prediction > > record) > > 101 101,101,102 > > 101 101,101,102 > > 102 101,101,102 > > 105 105,105 > > 105 105,105 > > > > Method 3b: Logistic regression, I represent transactions as > > > > Target = transaction drink id, prediction variables = percentages of > drinks > > for a given customer, who placed this transaction, e.g. for a single > > customer, who bought pepsi, pepsi, and copacopa, we have three > > classification records (one per transaction): > > > > target, %pepsi, %copacopa, %coke,..... > > pepsi, 2/3,1/3,0,0,0,0... > > pepsi, 2/3,1/3,0,0,0,0... > > copacopa, 2/3,1/3,0,0,0,0... > > > > Customer has a favorite drink if a logistic regression predicts drink > with > >>50% confidence, e.g. I take a customer who is represented by > > classification record: 0.1(CopaCopa), 0.7(Pepsi),0(Coke)..... I'm fan of > > Pepsi with a confidence level of 0.64. > > > > I would appreciate any feedback on presented approaches. Maybe there is a > > better way to address this problem? I would be also glad to hear on some > > papers describing similar prediction problems in various domains. > > > > Regards. > > > > -- > > Daniel Korzekwa > > Software Engineer > > priv: http://danmachine.com > > blog: http://blog.danmachine.com > -- Daniel Korzekwa Software Engineer priv: http://danmachine.com blog: http://blog.danmachine.com
