Yes. I would start with the SGD system and possibly use the naive bayes models if you have massive amounts of data.
In fact, if you have < 100,000 observations I would strongly recommend using a more user friendly system such as R. Regardless of which system, you need to decide what kind of model you need to build. There are several natural alternatives: a) only one of the possible actions matters (or only one can be done) and the actions are not ordered. Use multi-nomial logisitic regression (SGD implements this very nicely). b) the actions nest in some way. An example might be progression by a web visitor toward economic conversion. Action 1 might be any visitor, action 2 is clicking on product information, action 3 might be putting an item in a shopping cart and action 4 might be buying an item. These items have a clear and important ordering and all users who complete action n have completed all lower actions. Ordinal logistic regression is a natural choice here. Mahout does not really support this. You can do the poor man's version by just using the largest action completed and using multinomial logistic regression. c) the actions are relatively independent. Here you can start with n binary logistic regression models. This will ignore any nesting or implication structure among actions. Mahout can help here with the binary logistic regression. On Wed, Nov 17, 2010 at 1:12 PM, Radu Spineanu <[email protected]>wrote: > Hi. > > > We have data about users that perform certain actions: > user, age, sex, interests has performed actions 1,2,3 > (training data) > > Our goal is to ask in real time how likely is it that another user having > age, sex, interests would perform the same actions. > > > Can we use mahout for this? If yes, which algorithm do you think would be > best? Would it work if we had partial data, like only age? > > > Thank you. > -r. >
