We're going to start with < 1.000 observations but we have to be able to
scale out very quickly if it works. It could get to 100.000 observations
in 6-8 months.
The model is a combination between c) and b). All actions except the
first one are independent. If we build the model around c) would it be
hard to move to b) later on if that's the case? I want to go the easier
route for now.
Could you point me to books, docs, howtos, articles about getting up and
running with c)?
I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm
able to wrap my head around everything and get it working I would love
to contribute back and package it.
> But your specific case will tell. Your most important priority will
be to
> figure out how to test models realistically off-line.
What do you mean by this?
-r.
On 11/18/2010 12:03 AM, Ted Dunning wrote:
Yes.
I would start with the SGD system and possibly use the naive bayes models if
you have massive amounts of data.
In fact, if you have< 100,000 observations I would strongly recommend using
a more user friendly system such as R.
Regardless of which system, you need to decide what kind of model you need
to build. There are several natural alternatives:
a) only one of the possible actions matters (or only one can be done) and
the actions are not ordered. Use multi-nomial logisitic regression (SGD
implements this very nicely).
b) the actions nest in some way. An example might be progression by a web
visitor toward economic conversion. Action 1 might be any visitor, action 2
is clicking on product information, action 3 might be putting an item in a
shopping cart and action 4 might be buying an item. These items have a
clear and important ordering and all users who complete action n have
completed all lower actions. Ordinal logistic regression is a natural
choice here. Mahout does not really support this. You can do the poor
man's version by just
using the largest action completed and using multinomial logistic
regression.
c) the actions are relatively independent. Here you can start with n binary
logistic regression models. This will ignore any nesting
or implication structure among actions. Mahout can help here with the
binary logistic regression.
On Wed, Nov 17, 2010 at 1:12 PM, Radu Spineanu<[email protected]>wrote:
Hi.
We have data about users that perform certain actions:
user, age, sex, interests has performed actions 1,2,3
(training data)
Our goal is to ask in real time how likely is it that another user having
age, sex, interests would perform the same actions.
Can we use mahout for this? If yes, which algorithm do you think would be
best? Would it work if we had partial data, like only age?
Thank you.
-r.