[R] missing values in logistic regression
Dear R help list, I am trying to do a logistic regression where I have a categorical response variable Y and two numerical predictors X1 and X2. There are quite a lot of missing values for predictor X2. eg., Y X1 X2 red 0.6 0.2* red 0.5 0.2* red 0.5 NA red 0.5 NA green 0.2 0.1* green 0.1 NA green 0.1 NA green 0.05 0.05 * I am wondering can I combine X1 and X2 in a logistic regression to predict Y, using all the data for X1, even though there are NAs in the X2 data? Or do I have to take only the cases for which there is data for both X1 and X2? (marked with *s above) I will be very grateful for any help, sincerely, Avril Coghlan University College Dublin, Ireland __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] missing values in logistic regression
Avril Coghlan [EMAIL PROTECTED] writes: Dear R help list, I am trying to do a logistic regression where I have a categorical response variable Y and two numerical predictors X1 and X2. There are quite a lot of missing values for predictor X2. eg., Y X1 X2 red 0.6 0.2* red 0.5 0.2* red 0.5 NA red 0.5 NA green 0.2 0.1* green 0.1 NA green 0.1 NA green 0.05 0.05 * I am wondering can I combine X1 and X2 in a logistic regression to predict Y, using all the data for X1, even though there are NAs in the X2 data? Or do I have to take only the cases for which there is data for both X1 and X2? (marked with *s above) I will be very grateful for any help, The built-in function (glm) for logistic regression will give you a complete-case analysis. For more advanced handling of missing values, you need to look into imputation methods. Two CRAN packages (at least) are dealing with this, namely mix and mitools. The former is support software for a book, which you'll probably want to consult. -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] missing values in logistic regression
On 29 Oct 2004, Avril Coghlan wrote: Dear R help list, I am trying to do a logistic regression where I have a categorical response variable Y and two numerical predictors X1 and X2. There are quite a lot of missing values for predictor X2. eg., Y X1 X2 red 0.6 0.2* red 0.5 0.2* red 0.5 NA red 0.5 NA green 0.2 0.1* green 0.1 NA green 0.1 NA green 0.05 0.05 * I am wondering can I combine X1 and X2 in a logistic regression to predict Y, using all the data for X1, even though there are NAs in the X2 data? Or do I have to take only the cases for which there is data for both X1 and X2? (marked with *s above) You need to either 1) Train separate models for Y | X1 and Y | X1, X2 and use the appropriate one. 2) Produce an imputation model for X2 | X1, and use multiple imputation. Given that the latter look like [0, 1] scores, mix (as suggested by PD) is not likely to be appropriate, but e.g. a 2D kde fit may well be. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] missing values in logistic regression
On 29-Oct-04 Avril Coghlan wrote: Dear R help list, I am trying to do a logistic regression where I have a categorical response variable Y and two numerical predictors X1 and X2. There are quite a lot of missing values for predictor X2. eg., Y X1 X2 red 0.6 0.2* red 0.5 0.2* red 0.5 NA red 0.5 NA green 0.2 0.1* green 0.1 NA green 0.1 NA green 0.05 0.05 * I am wondering can I combine X1 and X2 in a logistic regression to predict Y, using all the data for X1, even though there are NAs in the X2 data? Or do I have to take only the cases for which there is data for both X1 and X2? (marked with *s above) I don't know of any R routine directly aimed at logistic regression with missing values as you describe. However, if you are prepared to assume (or try to arrange by a judiciously chosen transformation) that the distribution of (X1,X2) is bivariate normal, with mean dependent on the value of Y but with the same variance-covariance matrix throughout, then you should be able to make progress along the following lines. This ties in with Peter Dalgaard's suggestion of mix. I shall assume for this explanation that your Y categories take only two values A and B (as red , green), though the method can be directly extended to several categories in Y. The underlying theoretical point is that a linear logistic regression is equivalent to a Bayesian discrimination between two normally-distributed clusters. Let the vector of means for (X1,X2) be mA for group A, and mB for group B; and let the covariance matrix be V. Let x denote (X1,X2). Then P(A|x) = [f(x|A)*p(A)]/[f(x|A)*p(A) + f(x|B)*p(B)] where p(A) and p(B) are the prior probabilities of a group A or a group B item. Now substitute f(x|A) = C*exp(-0.5*(x-mA)'%*%W%*%(x-mA)) and similar for f(x|B); C is the constant 1/sqrt(2*pi*det(V))^k where k is the dimension of x, and W is the inverse of V. Then, with a bit of algebra, P(A|x) = 1/(1 + exp(a + b%*%x)) (a logistic regression) where a is the scalar log(p(B)/p(A)) + 0.5*(mA'%*%W%*%mA - mB'%*%W%*%mB) and b is the vector (mB - mA)'%*%W Now you can come back to the mix package. This is for multiple imputation of missing values in a dataset consisting of variables of two kinds: categorical and continuous. The joint probability model for all the variables is expressed as a product of the multinomial distribution for the categorical variables, with a multivariate normal distribution for the continuous variables where it is assumed that the covariance matrix is the same for every combination of the values of the categorical variables, while the multivariate means may differ at different levels of the categoricals. Hence the underlying model for the mix package is exactly what is needed for the above. The primary output from imputation runs with mix is a set of completed datasets (with missing values filled in). You can then run a logistic regression on each completed dataset, obtaining for each dataset the estimates of the regression parameters and their standard errors. These can then be combined using the function mi.inference in the mix library. You can also, however, extract the parameter values (multinomial probabilities and multivariate means and covariance matrix) used in a particular imputation using the function getparam.mix in the mix library. This function needs parameters s (evaluated by the preliminary processor prelim.mix), and theta, evaluated for each imputation by a data augmentation function such as da.mix. Then you can substitute these in the above formulae for a and b to get a and b directly, without needing to do an explicit logistic regression on the completed dataset. Hoping this helps! Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 29-Oct-04 Time: 13:45:46 -- XFMail -- __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] missing values in logistic regression
(Ted Harding) wrote: On 29-Oct-04 Avril Coghlan wrote: Dear R help list, I am trying to do a logistic regression where I have a categorical response variable Y and two numerical predictors X1 and X2. There are quite a lot of missing values for predictor X2. eg., Y X1 X2 red 0.6 0.2* red 0.5 0.2* red 0.5 NA red 0.5 NA green 0.2 0.1* green 0.1 NA green 0.1 NA green 0.05 0.05 * I am wondering can I combine X1 and X2 in a logistic regression to predict Y, using all the data for X1, even though there are NAs in the X2 data? Or do I have to take only the cases for which there is data for both X1 and X2? (marked with *s above) I don't know of any R routine directly aimed at logistic regression with missing values as you describe. The aregImpute function in the Hmisc package can handle this, using predictive mean matching with weighted multinomial sampling of donor observations' binary covariate values. . . .. Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html