On Fri, Jul 22, 2011 at 3:33 AM, Svetlomir Kasabov < skasa...@smail.inf.fh-brs.de> wrote:
> thanks for your reply and detailed answer. I will probably use the L_1 > regularization since you recommended it. Can I use Mahout's class L1 for > this case ? Which other classes can be useful? > OnlineLogisticRegression and AdaptiveLogisticRegression are what you should use. If you can find good and stable values for the annealing coefficients in OnlineLogisticRegression, then you should be good with that and it will be blazing fast. AdaptiveLogisticRegression will beat up your machines more and may not give you quite as good a final answer. Both support L_1 regularization. As you suggestion the L1 class in Mahout is the way to signal this to the learning algorithms. Actially I thought it can solve this problem easier: > > Quote from: > http://webcache.**googleusercontent.com/search?** > q=cache:http://radiographics.**rsna.org/content/30/1/13.full.**pdf<http://webcache.googleusercontent.com/search?q=cache:http://radiographics.rsna.org/content/30/1/13.full.pdf>): > > > "Each regression coefficient describes the size of the contribution of the > corresponding predictor variable to the outcome. The effect of the predictor > variables on the outcome variable is commonly measured by using the odds > ratio of the predictor variable, which represents the factor by which the > odds of an outcome change for a one-unit change in the predictor variable. > The odds ratio is estimated by taking the exponential of the coefficient > (eg, exp[β1])." > This allows you to estimate the size of the coefficient, but not the error bars on the coefficient. One pragmatic way to do that if you have vats of compute power and training data is to bootstrap on your input. With really large data, you can simply use a mapper to shard your input data and then look at the variation in the coefficients in the output. With small training data, you can build a special hadoop input format that samples with replacement from your training data and passes the data to a map-side learning algorithm. The variability of the resulting coefficients gives you an idea of the error bars. Can't I then simply evaluate "exp[ß1]" and get the parameter significance > for Y this way? Doesn't Mahout's logistic regression use it imlplicitely? > That gives you size, but not significance. It would still be nice to know if the error bars cross the > > If you must do variable selection, you can run many alternative >> learningalgorithms at the same time with alternative variable selections. >> There is >> >> a pretty easy way to get average log likelihood out of these learning >> algorithms and the differences in these are (roughly) the log-likelihood >> ratio that you are talking about. >> >> Which other algorithms could I use for this ? > You would be in new ground for mahout here. I would suggest a) using normal logistic regression with a random sample of the input variables. Use sqrt(n) variables where there are n possible variables. See the Breiman and Cutler paper on random forests for more ideas on this. b) use combinations of random forest and logistic regression. this is more involved than just gluing random forest into (a) because you really want to glue logistic regression as a classifier into the random forest system. I recommend starting simple. Do the bootstrap stuff I mentioned earlier before you do the feature sharding stuff. > > Many thanks and best regards, > > Svetlomir Kasabov. > > >