Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Marc Schwartz
On Jun 13, 2010, at 10:20 PM, array chip wrote:

 Hi, this is not R technical question per se. I know there are many excellent 
 statisticians in this list, so here my questions: I have dataset with ~1800 
 observations and 50 independent variables, so there are about 35 samples per 
 variable. Is it wise to build a stable multiple logistic model with 50 
 independent variables? Any problem with this approach? Thanks
 
 John


The general rule of thumb is to have 10-20 'events' per covariate degree of 
freedom. Frank has suggested that in some cases that number should be as high 
as 25. 

The number of events is the smaller of the two possible outcomes for your 
binary dependent variable.

Covariate degrees of freedom refers to the number of columns in the model 
matrix. Continuous variables are 1, binary factors are 1, K-level factors are K 
- 1.

So if out of your 1800 records, you have at least 500 to 1000 events, depending 
upon how many of your 50 variables are K-level factors and whether or not you 
need to consider interactions, you may be OK. Better if towards the high end of 
that range, especially if the model is for prediction versus explanation.

Two excellent references would be Frank's book:

  
http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/

and Steyerberg's book:

  
http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/

to assist in providing guidance for model building/validation techniques.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Joris Meys
Hi,

Marcs explanation is valid to a certain extent, but I don't agree with
his conclusion. I'd like to point out the curse of
dimensionality(Hughes effect) which starts to play rather quickly.

The curse of dimensionality is easily demonstrated looking at the
proximity between your datapoints. Say we scale the interval in one
dimension to be 1 unit. If you have 20 evenly-spaced observations, the
distance between the observations is 0.05 units. To have a proximity
like that in a 2-dimensional space, you need 20^2=400 observations. in
a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
distance between your observations is important, as a sparse dataset
will definitely make your model misbehave.

Even with about 35 samples per variable, using 50 independent
variables will render a highly unstable model, as your dataspace is
about as sparse as it can get. On top of that, interpreting a model
with 50 variables is close to impossible, and then I didn't even start
on interactions. No point in trying I'd say. If you really need all
that information, you might want to take a look at some dimension
reduction methods first.

Cheers
Joris

On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz marc_schwa...@me.com wrote:
 On Jun 13, 2010, at 10:20 PM, array chip wrote:

 Hi, this is not R technical question per se. I know there are many excellent 
 statisticians in this list, so here my questions: I have dataset with ~1800 
 observations and 50 independent variables, so there are about 35 samples per 
 variable. Is it wise to build a stable multiple logistic model with 50 
 independent variables? Any problem with this approach? Thanks

 John


 The general rule of thumb is to have 10-20 'events' per covariate degree of 
 freedom. Frank has suggested that in some cases that number should be as high 
 as 25.

 The number of events is the smaller of the two possible outcomes for your 
 binary dependent variable.

 Covariate degrees of freedom refers to the number of columns in the model 
 matrix. Continuous variables are 1, binary factors are 1, K-level factors are 
 K - 1.

 So if out of your 1800 records, you have at least 500 to 1000 events, 
 depending upon how many of your 50 variables are K-level factors and whether 
 or not you need to consider interactions, you may be OK. Better if towards 
 the high end of that range, especially if the model is for prediction versus 
 explanation.

 Two excellent references would be Frank's book:

  http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/

 and Steyerberg's book:

  http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/

 to assist in providing guidance for model building/validation techniques.

 HTH,

 Marc Schwartz

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Claudia Beleites

Dear all,

(this first part of the email I sent to John earlier today, but forgot to put it 
to the list as well)

Dear John,

 Hi, this is not R technical question per se. I know there are many excellent
 statisticians in this list, so here my questions: I have dataset with ~1800
 observations and 50 independent variables, so there are about 35 samples per
 variable. Is it wise to build a stable multiple logistic model with 50
 independent variables? Any problem with this approach? Thanks

First: I'm not a statistician, but a spectroscopist.
But I do build logistic Regression models with far less than 1800 samples and 
far more variates (e.g. 75 patients / 256 spectral measurement channels). Though 
I have many measurements per sample: typically several hundred spectra per sample.


Question: are the 1800 real, independent samples?

Model stability is something you can measure.
Do a honest validation of your model with really _independent_ test data and 
measure the stability according to what your stability needs are (e.g. stable 
parameters or stable predictions?).




(From here on reply to Joris)

 Marcs explanation is valid to a certain extent, but I don't agree with
 his conclusion. I'd like to point out the curse of
 dimensionality(Hughes effect) which starts to play rather quickly.
No doubt.

 The curse of dimensionality is easily demonstrated looking at the
 proximity between your datapoints. Say we scale the interval in one
 dimension to be 1 unit. If you have 20 evenly-spaced observations, the
 distance between the observations is 0.05 units. To have a proximity
 like that in a 2-dimensional space, you need 20^2=400 observations. in
 a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
 distance between your observations is important, as a sparse dataset
 will definitely make your model misbehave.

But won't also the distance between groups grow?
No doubt, that high-dimensional spaces are _very_ unintuitive.

However, the required sample size may grow substantially slower, if the model 
has appropriate restrictions. I remember the recommendation of at least 5 
samples per class and variate for linear classification models. I.e. not to get 
a good model, but to have a reasonable chance of getting a stable model.


 Even with about 35 samples per variable, using 50 independent
 variables will render a highly unstable model,
Am I wrong thinking that there may be a substantial difference between stability 
of predictions and stability of model parameters?


BTW: if the models are unstable, there's also aggregation.

At least for my spectra I can give toy examples with physical-chemical 
explanation that yield the same prediction with different parameters (of course 
because of correlation).


 as your dataspace is
 about as sparse as it can get. On top of that, interpreting a model
 with 50 variables is close to impossible,
No, not necessary. IMHO it depends very much on the meaning of the variables. 
E.g. for the spectra a set of model parameters may be interpreted like spectra 
or difference spectra. Of course this has to do with the fact, that a parallel 
coordinate plot is the more natural view of spectra compared to a point in so 
many dimensions.


 and then I didn't even start
 on interactions. No point in trying I'd say. If you really need all
 that information, you might want to take a look at some dimension
 reduction methods first.

Which puts to my mind a question I've had since long:
I assume that all variables that I know beforehand to be without information are 
already discarded.
The dimensionality is then further reduced in a data-driven way (e.g. by PCA or 
PLS). The model is built in the reduced space.


How much less samples are actually needed, considering the fact that the 
dimension reduction is a model estimated on the data?
...which of course also means that the honest validation embraces the 
data-driven dimensionality reduction as well...


Are there recommendations about that?


The other curious question I have is:
I assume that it is impossible for him to obtain the 10^xy samples required for 
comfortable model building.

So what is he to do?


Cheers,

Claudia



--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbelei...@units.it

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Robert A LaBudde
I think the real issue is why the fit is being 
done. If it is solely to interpolate and condense 
the dataset, the number of variables is not an important issue.


If the issue is developing a model that will 
capture causality, it is hard to believe that can 
be accomplished with 50+ variables. With this 
many, some kind of hunt would have to be done, 
and the resulting model would not be real stable. 
It would be better perhaps to first reduce the 
variable set by, say, principal components 
analysis, so that a reasonable sized set results.


If a stable and meaningful model is the goal, 
each term in the final model should be plausibly causal.


At 10:36 AM 6/14/2010, Claudia Beleites wrote:

Dear all,

(this first part of the email I sent to John 
earlier today, but forgot to put it to the list as well)

Dear John,

 Hi, this is not R technical question per se. 
I know there are many excellent

 statisticians in this list, so here my questions: I have dataset with ~1800
 observations and 50 independent variables, so 
there are about 35 samples per

 variable. Is it wise to build a stable multiple logistic model with 50
 independent variables? Any problem with this approach? Thanks

First: I'm not a statistician, but a spectroscopist.
But I do build logistic Regression models with 
far less than 1800 samples and far more variates 
(e.g. 75 patients / 256 spectral measurement 
channels). Though I have many measurements per 
sample: typically several hundred spectra per sample.


Question: are the 1800 real, independent samples?

Model stability is something you can measure.
Do a honest validation of your model with really 
_independent_ test data and measure the 
stability according to what your stability needs 
are (e.g. stable parameters or stable predictions?).




(From here on reply to Joris)

 Marcs explanation is valid to a certain extent, but I don't agree with
 his conclusion. I'd like to point out the curse of
 dimensionality(Hughes effect) which starts to play rather quickly.
No doubt.

 The curse of dimensionality is easily demonstrated looking at the
 proximity between your datapoints. Say we scale the interval in one
 dimension to be 1 unit. If you have 20 evenly-spaced observations, the
 distance between the observations is 0.05 units. To have a proximity
 like that in a 2-dimensional space, you need 20^2=400 observations. in
 a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
 distance between your observations is important, as a sparse dataset
 will definitely make your model misbehave.

But won't also the distance between groups grow?
No doubt, that high-dimensional spaces are _very_ unintuitive.

However, the required sample size may grow 
substantially slower, if the model has 
appropriate restrictions. I remember the 
recommendation of at least 5 samples per class 
and variate for linear classification models. 
I.e. not to get a good model, but to have a 
reasonable chance of getting a stable model.


 Even with about 35 samples per variable, using 50 independent
 variables will render a highly unstable model,
Am I wrong thinking that there may be a 
substantial difference between stability of 
predictions and stability of model parameters?


BTW: if the models are unstable, there's also aggregation.

At least for my spectra I can give toy examples 
with physical-chemical explanation that yield 
the same prediction with different parameters 
(of course because of correlation).


 as your dataspace is
 about as sparse as it can get. On top of that, interpreting a model
 with 50 variables is close to impossible,
No, not necessary. IMHO it depends very much on 
the meaning of the variables. E.g. for the 
spectra a set of model parameters may be 
interpreted like spectra or difference spectra. 
Of course this has to do with the fact, that a 
parallel coordinate plot is the more natural 
view of spectra compared to a point in so many dimensions.


 and then I didn't even start
 on interactions. No point in trying I'd say. If you really need all
 that information, you might want to take a look at some dimension
 reduction methods first.

Which puts to my mind a question I've had since long:
I assume that all variables that I know 
beforehand to be without information are already discarded.
The dimensionality is then further reduced in a 
data-driven way (e.g. by PCA or PLS). The model is built in the reduced space.


How much less samples are actually needed, 
considering the fact that the dimension 
reduction is a model estimated on the data?
...which of course also means that the honest 
validation embraces the data-driven dimensionality reduction as well...


Are there recommendations about that?


The other curious question I have is:
I assume that it is impossible for him to obtain 
the 10^xy samples required for comfortable model building.

So what is he to do?


Cheers,

Claudia



--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università 

Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Marc Schwartz
Joris,

There are two separate issues here:

1. Can you consider an LR model with 50 covariates?

2. Should you have 50 covariates in your LR model?


The answer to 1 is certainly yes, given what I noted below as a general working 
framework. I have personally been involved with the development and validation 
of LR models with ~35 covariates, albeit with notably larger datasets than 
discussed below, because the models are used for prediction. In fact, the 
current incarnations of those same models, now 15 years later, appear to have 
40 covariates and are quite stable. The interpretation of the models by both 
statisticians and clinicians is relatively straightforward.

The answer to 2 gets into the subject matter that you raise, which is to 
consider other factors beyond the initial rules of thumb for minimum sample 
size. These get into reasonable data reduction methods, the consideration of 
collinearity, subject matter expertise, sparse data, etc.

The issues raised in number 2 are discussed in the two references that I noted.

Two additional references that might be helpful here on the first point are:

P. Peduzzi, J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein. A 
simulation study of the number of events per variable in logistic regression 
analysis. J Clin Epi, 49:1373–1379, 1996. 

E. Vittinghoff and C. E. McCulloch. Relaxing the rule of ten events per 
variable in logistic and Cox regression. Am J Epi, 165:710–718, 2006.


Regards,

Marc

On Jun 14, 2010, at 8:38 AM, Joris Meys wrote:

 Hi,
 
 Marcs explanation is valid to a certain extent, but I don't agree with
 his conclusion. I'd like to point out the curse of
 dimensionality(Hughes effect) which starts to play rather quickly.
 
 The curse of dimensionality is easily demonstrated looking at the
 proximity between your datapoints. Say we scale the interval in one
 dimension to be 1 unit. If you have 20 evenly-spaced observations, the
 distance between the observations is 0.05 units. To have a proximity
 like that in a 2-dimensional space, you need 20^2=400 observations. in
 a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
 distance between your observations is important, as a sparse dataset
 will definitely make your model misbehave.
 
 Even with about 35 samples per variable, using 50 independent
 variables will render a highly unstable model, as your dataspace is
 about as sparse as it can get. On top of that, interpreting a model
 with 50 variables is close to impossible, and then I didn't even start
 on interactions. No point in trying I'd say. If you really need all
 that information, you might want to take a look at some dimension
 reduction methods first.
 
 Cheers
 Joris
 
 On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz marc_schwa...@me.com wrote:
 On Jun 13, 2010, at 10:20 PM, array chip wrote:
 
 Hi, this is not R technical question per se. I know there are many 
 excellent statisticians in this list, so here my questions: I have dataset 
 with ~1800 observations and 50 independent variables, so there are about 35 
 samples per variable. Is it wise to build a stable multiple logistic model 
 with 50 independent variables? Any problem with this approach? Thanks
 
 John
 
 
 The general rule of thumb is to have 10-20 'events' per covariate degree of 
 freedom. Frank has suggested that in some cases that number should be as 
 high as 25.
 
 The number of events is the smaller of the two possible outcomes for your 
 binary dependent variable.
 
 Covariate degrees of freedom refers to the number of columns in the model 
 matrix. Continuous variables are 1, binary factors are 1, K-level factors 
 are K - 1.
 
 So if out of your 1800 records, you have at least 500 to 1000 events, 
 depending upon how many of your 50 variables are K-level factors and whether 
 or not you need to consider interactions, you may be OK. Better if towards 
 the high end of that range, especially if the model is for prediction versus 
 explanation.
 
 Two excellent references would be Frank's book:
 
  
 http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/
 
 and Steyerberg's book:
 
  
 http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/
 
 to assist in providing guidance for model building/validation techniques.
 
 HTH,
 
 Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Charles C. Berry

On Mon, 14 Jun 2010, Joris Meys wrote:


Hi,

Marcs explanation is valid to a certain extent, but I don't agree with
his conclusion. I'd like to point out the curse of
dimensionality(Hughes effect) which starts to play rather quickly.



Ahem!

... minimal, self-contained, reproducible code ...

The OPs situation may not be an impossible one:


set.seed(54321)
dat - as.data.frame(matrix(rnorm(1800*50),nc=50))
dat$y - rbinom(1800,1,plogis(rowSums(dat)/7))
fit - glm(y~., dat, family=binomial)
1/7 # the true coef

[1] 0.1428571

sd(coef(fit)) # roughly, the common standard error

[1] 0.05605597

colMeans(coef(summary(fit))) # what glm() got

  Estimate Std. Errorz value   Pr(|z|)
0.14590836 0.05380666 2.71067328 0.06354820

# a trickier case:
set.seed(54321)
dat - as.data.frame(matrix(rnorm(1800*50),nc=50))
dat$y - rbinom(1800,1,plogis(rowSums(dat))) # try again with coef==1
fit - glm(y~., dat, family=binomial)
colMeans(coef(summary(fit)))

   Estimate  Std. Error z valuePr(|z|)
0.982944012 0.119063631 8.222138491 0.008458002




Finer examination of the latter fit will show some values that differ too 
far from 1.0 to agree with the asymptotic std err.



sd(coef(fit)) # rather bigger than 0.119

[1] 0.1827462

range(coef(fit))

[1] -0.08128586  1.25797057

And near separability may be playing here:


cbind(

+ table(
+ cut(
+ plogis(abs(predict(fit))),
+ c( 0, 0.9, 0.99, 0.999, 0., 0.9, 1 ) ) ) )
 [,1]
(0,0.9]   453
(0.9,0.99]427
(0.99,0.999]  313
(0.999,0.]251
(0.,0.9]  173
(0.9,1]   183




Recall that the observed information contains a factor of

plogis( predict(fit) )* plogis( -predict(fit))


hist(plogis( predict(fit) )* plogis( -predict(fit)))


So the effective sample size here was much reduced.

But to the OP's question, whether what you get is reasonable depends on 
what the setup is.


I wouldn't call the first of the above cases 'highly unstable'.

Which is not to say that one cannot generate difficult cases (esp. with 
correlated covariates and/or one or more highly influential covariates) 
and that the OPs case is not one of them.



HTH,

Chuck


The curse of dimensionality is easily demonstrated looking at the
proximity between your datapoints. Say we scale the interval in one
dimension to be 1 unit. If you have 20 evenly-spaced observations, the
distance between the observations is 0.05 units. To have a proximity
like that in a 2-dimensional space, you need 20^2=400 observations. in
a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
distance between your observations is important, as a sparse dataset
will definitely make your model misbehave.

Even with about 35 samples per variable, using 50 independent
variables will render a highly unstable model, as your dataspace is
about as sparse as it can get. On top of that, interpreting a model
with 50 variables is close to impossible, and then I didn't even start
on interactions. No point in trying I'd say. If you really need all
that information, you might want to take a look at some dimension
reduction methods first.

Cheers
Joris

On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz marc_schwa...@me.com wrote:

On Jun 13, 2010, at 10:20 PM, array chip wrote:


Hi, this is not R technical question per se. I know there are many excellent 
statisticians in this list, so here my questions: I have dataset with ~1800 
observations and 50 independent variables, so there are about 35 samples per 
variable. Is it wise to build a stable multiple logistic model with 50 
independent variables? Any problem with this approach? Thanks

John



The general rule of thumb is to have 10-20 'events' per covariate degree of 
freedom. Frank has suggested that in some cases that number should be as high 
as 25.

The number of events is the smaller of the two possible outcomes for your 
binary dependent variable.

Covariate degrees of freedom refers to the number of columns in the model 
matrix. Continuous variables are 1, binary factors are 1, K-level factors are K 
- 1.

So if out of your 1800 records, you have at least 500 to 1000 events, depending 
upon how many of your 50 variables are K-level factors and whether or not you 
need to consider interactions, you may be OK. Better if towards the high end of 
that range, especially if the model is for prediction versus explanation.

Two excellent references would be Frank's book:

?http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/

and Steyerberg's book:

?http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/

to assist in providing guidance for model building/validation techniques.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, 

Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread array chip
Thanks Charles for the reproducible codes. I started this question because I 
was asked to take a look at such dataset, but I have doubt if it's meaningful 
to do a LR with 50 variables. I haven't got the dataset yet, thus have not 
tried any code. But again for sharing some simulation code.

have one question for your code. When you calculate common standard errors for 
coefficients using sd(coef(fit)), should you exlude intercept by doing 
sd(coef(fit)[-1])? Actually after removing intercept, the standard error 
calculate this way is very similar to one reported from 
colMeans(coef(summary(fit))), for both scenarios in your example (coef = 0.14 
or 1.0)

Another question is the 50 simulated variables have very low correlations 
(ranging from -0.1 to 0.08), that may contribute to the stable model. If some 
(not all) of the 50 variable have considerable correlation, like 0.7 or 0.8 
correlation, How would LR model behave?

Thanks

John



- Original Message 
From: Charles C. Berry cbe...@tajo.ucsd.edu
To: Joris Meys jorism...@gmail.com
Cc: r-help@r-project.org; Marc Schwartz marc_schwa...@me.com
Sent: Mon, June 14, 2010 8:32:02 AM
Subject: Re: [R] logistic regression with 50 varaibales

On Mon, 14 Jun 2010, Joris Meys wrote:

 Hi,

 Marcs explanation is valid to a certain extent, but I don't agree with
 his conclusion. I'd like to point out the curse of
 dimensionality(Hughes effect) which starts to play rather quickly.


Ahem!

... minimal, self-contained, reproducible code ...

The OPs situation may not be an impossible one:

 set.seed(54321)
 dat - as.data.frame(matrix(rnorm(1800*50),nc=50))
 dat$y - rbinom(1800,1,plogis(rowSums(dat)/7))
 fit - glm(y~., dat, family=binomial)
 1/7 # the true coef
[1] 0.1428571
 sd(coef(fit)) # roughly, the common standard error
[1] 0.05605597
 colMeans(coef(summary(fit))) # what glm() got
   Estimate Std. Errorz value   Pr(|z|)
0.14590836 0.05380666 2.71067328 0.06354820
 # a trickier case:
 set.seed(54321)
 dat - as.data.frame(matrix(rnorm(1800*50),nc=50))
 dat$y - rbinom(1800,1,plogis(rowSums(dat))) # try again with coef==1
 fit - glm(y~., dat, family=binomial)
 colMeans(coef(summary(fit)))
Estimate  Std. Error z valuePr(|z|)
0.982944012 0.119063631 8.222138491 0.008458002


Finer examination of the latter fit will show some values that differ too 
far from 1.0 to agree with the asymptotic std err.

 sd(coef(fit)) # rather bigger than 0.119
[1] 0.1827462
 range(coef(fit))
[1] -0.08128586  1.25797057

And near separability may be playing here:

 cbind(
+ table(
+ cut(
+ plogis(abs(predict(fit))),
+ c( 0, 0.9, 0.99, 0.999, 0., 0.9, 1 ) ) ) )
  [,1]
(0,0.9]   453
(0.9,0.99]427
(0.99,0.999]  313
(0.999,0.]251
(0.,0.9]  173
(0.9,1]   183


Recall that the observed information contains a factor of

plogis( predict(fit) )* plogis( -predict(fit))

 hist(plogis( predict(fit) )* plogis( -predict(fit)))

So the effective sample size here was much reduced.

But to the OP's question, whether what you get is reasonable depends on 
what the setup is.

I wouldn't call the first of the above cases 'highly unstable'.

Which is not to say that one cannot generate difficult cases (esp. with 
correlated covariates and/or one or more highly influential covariates) 
and that the OPs case is not one of them.


HTH,

Chuck

 The curse of dimensionality is easily demonstrated looking at the
 proximity between your datapoints. Say we scale the interval in one
 dimension to be 1 unit. If you have 20 evenly-spaced observations, the
 distance between the observations is 0.05 units. To have a proximity
 like that in a 2-dimensional space, you need 20^2=400 observations. in
 a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
 distance between your observations is important, as a sparse dataset
 will definitely make your model misbehave.

 Even with about 35 samples per variable, using 50 independent
 variables will render a highly unstable model, as your dataspace is
 about as sparse as it can get. On top of that, interpreting a model
 with 50 variables is close to impossible, and then I didn't even start
 on interactions. No point in trying I'd say. If you really need all
 that information, you might want to take a look at some dimension
 reduction methods first.

 Cheers
 Joris

 On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz marc_schwa...@me.com wrote:
 On Jun 13, 2010, at 10:20 PM, array chip wrote:

 Hi, this is not R technical question per se. I know there are many 
 excellent statisticians in this list, so here my questions: I have dataset 
 with ~1800 observations and 50 independent variables, so there are about 35 
 samples per variable. Is it wise to build a stable multiple logistic model 
 with 50 independent variables? Any problem with this approach? Thanks

 John


 The general rule of thumb is to have 10-20 'events' per covariate degree of 
 freedom. Frank has suggested

Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Charles C. Berry

On Mon, 14 Jun 2010, array chip wrote:

Thanks Charles for the reproducible codes. I started this question 
because I was asked to take a look at such dataset, but I have doubt if 
it's meaningful to do a LR with 50 variables. I haven't got the dataset 
yet, thus have not tried any code. But again for sharing some simulation 
code.


have one question for your code. When you calculate common standard 
errors for coefficients using sd(coef(fit)), should you exlude intercept 
by doing sd(coef(fit)[-1])? Actually after removing intercept, the 
standard error calculate this way is very similar to one reported from 
colMeans(coef(summary(fit))), for both scenarios in your example (coef = 
0.14 or 1.0)


Of course! I slipped up on that one!



Another question is the 50 simulated variables have very low 
correlations (ranging from -0.1 to 0.08), that may contribute to the 
stable model. If some (not all) of the 50 variable have considerable 
correlation, like 0.7 or 0.8 correlation, How would LR model behave?




Why not try it out and see?

The mvtnorm package has a function for generating correlated random 
variates.


HTH,

Chuck


Thanks

John



- Original Message 
From: Charles C. Berry cbe...@tajo.ucsd.edu
To: Joris Meys jorism...@gmail.com
Cc: r-help@r-project.org; Marc Schwartz marc_schwa...@me.com
Sent: Mon, June 14, 2010 8:32:02 AM
Subject: Re: [R] logistic regression with 50 varaibales

On Mon, 14 Jun 2010, Joris Meys wrote:


Hi,

Marcs explanation is valid to a certain extent, but I don't agree with
his conclusion. I'd like to point out the curse of
dimensionality(Hughes effect) which starts to play rather quickly.



Ahem!

... minimal, self-contained, reproducible code ...

The OPs situation may not be an impossible one:


set.seed(54321)
dat - as.data.frame(matrix(rnorm(1800*50),nc=50))
dat$y - rbinom(1800,1,plogis(rowSums(dat)/7))
fit - glm(y~., dat, family=binomial)
1/7 # the true coef

[1] 0.1428571

sd(coef(fit)) # roughly, the common standard error

[1] 0.05605597

colMeans(coef(summary(fit))) # what glm() got

  Estimate Std. Errorz value   Pr(|z|)
0.14590836 0.05380666 2.71067328 0.06354820

# a trickier case:
set.seed(54321)
dat - as.data.frame(matrix(rnorm(1800*50),nc=50))
dat$y - rbinom(1800,1,plogis(rowSums(dat))) # try again with coef==1
fit - glm(y~., dat, family=binomial)
colMeans(coef(summary(fit)))

   Estimate  Std. Error z valuePr(|z|)
0.982944012 0.119063631 8.222138491 0.008458002




Finer examination of the latter fit will show some values that differ too
far from 1.0 to agree with the asymptotic std err.


sd(coef(fit)) # rather bigger than 0.119

[1] 0.1827462

range(coef(fit))

[1] -0.08128586  1.25797057

And near separability may be playing here:


cbind(

+ table(
+ cut(
+ plogis(abs(predict(fit))),
+ c( 0, 0.9, 0.99, 0.999, 0., 0.9, 1 ) ) ) )
 [,1]
(0,0.9]   453
(0.9,0.99]427
(0.99,0.999]  313
(0.999,0.]251
(0.,0.9]  173
(0.9,1]   183




Recall that the observed information contains a factor of

   plogis( predict(fit) )* plogis( -predict(fit))


hist(plogis( predict(fit) )* plogis( -predict(fit)))


So the effective sample size here was much reduced.

But to the OP's question, whether what you get is reasonable depends on
what the setup is.

I wouldn't call the first of the above cases 'highly unstable'.

Which is not to say that one cannot generate difficult cases (esp. with
correlated covariates and/or one or more highly influential covariates)
and that the OPs case is not one of them.


HTH,

Chuck


The curse of dimensionality is easily demonstrated looking at the
proximity between your datapoints. Say we scale the interval in one
dimension to be 1 unit. If you have 20 evenly-spaced observations, the
distance between the observations is 0.05 units. To have a proximity
like that in a 2-dimensional space, you need 20^2=400 observations. in
a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
distance between your observations is important, as a sparse dataset
will definitely make your model misbehave.

Even with about 35 samples per variable, using 50 independent
variables will render a highly unstable model, as your dataspace is
about as sparse as it can get. On top of that, interpreting a model
with 50 variables is close to impossible, and then I didn't even start
on interactions. No point in trying I'd say. If you really need all
that information, you might want to take a look at some dimension
reduction methods first.

Cheers
Joris

On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz marc_schwa...@me.com wrote:

On Jun 13, 2010, at 10:20 PM, array chip wrote:


Hi, this is not R technical question per se. I know there are many excellent 
statisticians in this list, so here my questions: I have dataset with ~1800 
observations and 50 independent variables, so there are about 35 samples per 
variable. Is it wise to build a stable multiple logistic model with 50

[R] logistic regression with 50 varaibales

2010-06-13 Thread array chip
Hi, this is not R technical question per se. I know there are many excellent 
statisticians in this list, so here my questions: I have dataset with ~1800 
observations and 50 independent variables, so there are about 35 samples per 
variable. Is it wise to build a stable multiple logistic model with 50 
independent variables? Any problem with this approach? Thanks

John

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.