[R] missing values in logistic regression

2004-10-29 Thread Avril Coghlan
Dear R help list,

   I am trying to do a logistic regression
where I have a categorical response variable Y
and two numerical predictors X1 and X2. There
are quite a lot of missing values for predictor X2.
eg.,

Y X1   X2
red   0.6  0.2*
red   0.5  0.2*
red   0.5  NA
red   0.5  NA
green 0.2  0.1*
green 0.1  NA
green 0.1  NA
green 0.05 0.05   *


I am wondering can I combine X1 and X2 in
a logistic regression to predict Y, using
all the data for X1, even though there are NAs in
the X2 data?

Or do I have to take only the cases for which
there is data for both X1 and X2? (marked
with *s above)

I will be very grateful for any help,

sincerely,
Avril Coghlan
University College Dublin, Ireland

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] missing values in logistic regression

2004-10-29 Thread Peter Dalgaard
Avril Coghlan [EMAIL PROTECTED] writes:

 Dear R help list,
 
I am trying to do a logistic regression
 where I have a categorical response variable Y
 and two numerical predictors X1 and X2. There
 are quite a lot of missing values for predictor X2.
 eg.,
 
 Y X1   X2
 red   0.6  0.2*
 red   0.5  0.2*
 red   0.5  NA
 red   0.5  NA
 green 0.2  0.1*
 green 0.1  NA
 green 0.1  NA
 green 0.05 0.05   *
 
 
 I am wondering can I combine X1 and X2 in
 a logistic regression to predict Y, using
 all the data for X1, even though there are NAs in
 the X2 data?
 
 Or do I have to take only the cases for which
 there is data for both X1 and X2? (marked
 with *s above)
 
 I will be very grateful for any help,

The built-in function (glm) for logistic regression will give you
a complete-case analysis. 

For more advanced handling of missing values, you need to look into
imputation methods. Two CRAN packages (at least) are dealing with
this, namely mix and mitools. The former is support software for a
book, which you'll probably want to consult.

-- 
   O__   Peter Dalgaard Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics 2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark  Ph: (+45) 35327918
~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] missing values in logistic regression

2004-10-29 Thread Prof Brian Ripley
On 29 Oct 2004, Avril Coghlan wrote:

 Dear R help list,
 
I am trying to do a logistic regression
 where I have a categorical response variable Y
 and two numerical predictors X1 and X2. There
 are quite a lot of missing values for predictor X2.
 eg.,
 
 Y X1   X2
 red   0.6  0.2*
 red   0.5  0.2*
 red   0.5  NA
 red   0.5  NA
 green 0.2  0.1*
 green 0.1  NA
 green 0.1  NA
 green 0.05 0.05   *
 
 
 I am wondering can I combine X1 and X2 in
 a logistic regression to predict Y, using
 all the data for X1, even though there are NAs in
 the X2 data?
 
 Or do I have to take only the cases for which
 there is data for both X1 and X2? (marked
 with *s above)

You need to either

1) Train separate models for Y | X1 and Y | X1, X2 and use the appropriate 
one.

2) Produce an imputation model for X2 | X1, and use multiple imputation.

Given that the latter look like [0, 1] scores, mix (as suggested by PD) 
is not likely to be appropriate, but e.g. a 2D kde fit may well be.


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] missing values in logistic regression

2004-10-29 Thread Ted Harding
On 29-Oct-04 Avril Coghlan wrote:
 Dear R help list,
 
I am trying to do a logistic regression
 where I have a categorical response variable Y
 and two numerical predictors X1 and X2. There
 are quite a lot of missing values for predictor X2.
 eg.,
 
 Y X1   X2
 red   0.6  0.2*
 red   0.5  0.2*
 red   0.5  NA
 red   0.5  NA
 green 0.2  0.1*
 green 0.1  NA
 green 0.1  NA
 green 0.05 0.05   *
 
 I am wondering can I combine X1 and X2 in
 a logistic regression to predict Y, using
 all the data for X1, even though there are NAs in
 the X2 data?
 
 Or do I have to take only the cases for which
 there is data for both X1 and X2? (marked
 with *s above)

I don't know of any R routine directly aimed at logistic regression
with missing values as you describe.

However, if you are prepared to assume (or try to arrange by a
judiciously chosen transformation) that the distribution of (X1,X2)
is bivariate normal, with mean dependent on the value of Y but
with the same variance-covariance matrix throughout, then you
should be able to make progress along the following lines.
This ties in with Peter Dalgaard's suggestion of mix.
I shall assume for this explanation that your Y categories take
only two values A and B (as red , green), though the method can
be directly extended to several categories in Y.

The underlying theoretical point is that a linear logistic
regression is equivalent to a Bayesian discrimination between
two normally-distributed clusters. Let the vector of means for
(X1,X2) be mA for group A, and mB for group B; and let the
covariance matrix be V. Let x denote (X1,X2).

Then P(A|x) = [f(x|A)*p(A)]/[f(x|A)*p(A) + f(x|B)*p(B)]

where p(A) and p(B) are the prior probabilities of a group A
or a group B item.

Now substitute

 f(x|A) = C*exp(-0.5*(x-mA)'%*%W%*%(x-mA))

and similar for f(x|B); C is the constant 1/sqrt(2*pi*det(V))^k
where k is the dimension of x, and W is the inverse of V.

Then, with a bit of algebra,

 P(A|x) = 1/(1 + exp(a + b%*%x))

(a logistic regression) where a is the scalar

 log(p(B)/p(A)) + 0.5*(mA'%*%W%*%mA - mB'%*%W%*%mB)

and b is the vector

 (mB - mA)'%*%W

Now you can come back to the mix package. This is for multiple
imputation of missing values in a dataset consisting of variables
of two kinds: categorical and continuous.

The joint probability model for all the variables is expressed as a
product of the multinomial distribution for the categorical variables,
with a multivariate normal distribution for the continuous variables
where it is assumed that the covariance matrix is the same for every
combination of the values of the categorical variables, while the
multivariate means may differ at different levels of the categoricals.
Hence the underlying model for the mix package is exactly what is
needed for the above.

The primary output from imputation runs with mix is a set of
completed datasets (with missing values filled in). You can then
run a logistic regression on each completed dataset, obtaining
for each dataset the estimates of the regression parameters and
their standard errors. These can then be combined using the function
mi.inference in the mix library.

You can also, however, extract the parameter values (multinomial
probabilities and multivariate means and covariance matrix) used
in a particular imputation using the function getparam.mix in
the mix library. This function needs parameters s (evaluated
by the preliminary processor prelim.mix), and theta, evaluated
for each imputation by a data augmentation function such as da.mix.
Then you can substitute these in the above formulae for a and b to get
a and b directly, without needing to do an explicit logistic regression
on the completed dataset.

Hoping this helps!
Ted.



E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 29-Oct-04   Time: 13:45:46
-- XFMail --

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] missing values in logistic regression

2004-10-29 Thread Frank E Harrell Jr
(Ted Harding) wrote:
On 29-Oct-04 Avril Coghlan wrote:
Dear R help list,
  I am trying to do a logistic regression
where I have a categorical response variable Y
and two numerical predictors X1 and X2. There
are quite a lot of missing values for predictor X2.
eg.,
Y X1   X2
red   0.6  0.2*
red   0.5  0.2*
red   0.5  NA
red   0.5  NA
green 0.2  0.1*
green 0.1  NA
green 0.1  NA
green 0.05 0.05   *
I am wondering can I combine X1 and X2 in
a logistic regression to predict Y, using
all the data for X1, even though there are NAs in
the X2 data?
Or do I have to take only the cases for which
there is data for both X1 and X2? (marked
with *s above)

I don't know of any R routine directly aimed at logistic regression
with missing values as you describe.
The aregImpute function in the Hmisc package can handle this, using 
predictive mean matching with weighted multinomial sampling of donor 
observations' binary covariate values.

. . ..
Ted.

E-Mail: (Ted Harding) [EMAIL PROTECTED]

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html