Re: Wald's Test / parameter significance tests (Logistic Regression)

Ted Dunning Fri, 22 Jul 2011 11:43:56 -0700

On Fri, Jul 22, 2011 at 3:33 AM, Svetlomir Kasabov <
skasa...@smail.inf.fh-brs.de> wrote:


> thanks for your reply and detailed answer.  I will probably use the L_1
> regularization since you recommended it. Can I use Mahout's class L1 for
> this case ? Which other classes can be useful?
>

OnlineLogisticRegression and AdaptiveLogisticRegression are what you should
use.

If you can find good and stable values for the annealing coefficients in
OnlineLogisticRegression, then you should be good with that and it will be
blazing fast.  AdaptiveLogisticRegression will beat up your machines more
and may not give you quite as good a final answer.  Both support L_1
regularization.

As you suggestion the L1 class in Mahout is the way to signal this to the
learning algorithms.

Actially I  thought it can solve this problem easier:
>
> Quote from:
> http://webcache.**googleusercontent.com/search?**
> q=cache:http://radiographics.**rsna.org/content/30/1/13.full.**pdf<http://webcache.googleusercontent.com/search?q=cache:http://radiographics.rsna.org/content/30/1/13.full.pdf>):
>
>
> "Each regression coefficient describes the size of the contribution of the
> corresponding predictor variable to the outcome. The effect of the predictor
> variables on the outcome variable is commonly measured by using the odds
> ratio of the predictor variable, which represents the factor by which the
> odds of an outcome change for a one-unit change in the predictor variable.
> The odds ratio is estimated by taking the exponential of the coefficient
> (eg, exp[β1])."
>

This allows you to estimate the size of the coefficient, but not the error
bars on the coefficient.  One pragmatic way to do that if you have vats of
compute power and training data is to bootstrap on your input.  With really
large data, you can simply use a mapper to shard your input data and then
look at the variation in the coefficients in the output.  With small
training data, you can build a special hadoop input format that samples with
replacement from your training data and passes the data to a map-side
learning algorithm.  The variability of the resulting coefficients gives you
an idea of the error bars.

Can't I then simply evaluate "exp[ß1]" and get the parameter significance
> for Y this way? Doesn't Mahout's logistic regression use it imlplicitely?
>

That gives you size, but not significance.  It would still be nice to know
if the error bars cross the


>
>  If you must do variable selection, you can run many alternative
>> learningalgorithms at the same time with alternative variable selections.
>>  There is
>>
>> a pretty easy way to get average log likelihood out of these learning
>> algorithms and the differences in these are (roughly) the log-likelihood
>> ratio that you are talking about.
>>
>>  Which other algorithms could I use for this ?
>

You would be in new ground for mahout here.  I would suggest

a) using normal logistic regression with a random sample of the input
variables.  Use sqrt(n) variables where there are n possible variables.  See
the Breiman and Cutler paper on random forests for more ideas on this.

b) use combinations of random forest and logistic regression.  this is more
involved than just gluing random forest into (a) because you really want to
glue logistic regression as a classifier into the random forest system.

I recommend starting simple.  Do the bootstrap stuff I mentioned earlier
before you do the feature sharding stuff.

>
> Many thanks and best regards,
>
> Svetlomir Kasabov.
>
>
>

Re: Wald's Test / parameter significance tests (Logistic Regression)

Reply via email to