[R] impute missing values in correlated variables: transcan?
I would like to impute missing data in a set of correlated variables (columns of a matrix). It looks like transcan() from Hmisc is roughly what I want. It says, transcan automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. And, By default, transcan imputes NAs with best guess expected values of transformed variables, back transformed to the original scale. But I can't get it to work. I say m1 - matrix(1:20+rnorm(20),5,) # four correlated variables colnames(m1) - paste(R,1:4,sep=) m1[c(2,19)] - NA# simulate some missing data library(Hmisc) transcan(m1,data=m1) and I get Error in rcspline.eval(y, nk = nk, inclx = TRUE) : fewer than 6 non-missing observations with knots omitted I've tried a few other things, but I think it is time to ask for help. The specific problem is a real one. Our graduate admissions committee (4 members) rates applications, and we average the ratings to get an overall rating for each applicant. Sometimes one of the committee members is absent, or late; hence the missing data. The members differ in the way they use the rating scale, in both slope and intercept (if you regress each on the mean). Many decisions end up depending on the second decimal place of the averages, so we want to do better than just averging the non-missing ratings. Maybe I'm just not seeing something really simple. In fact, the problem is simpler than transcan assumes, since we are willing to assume linearity of the regression of each variable on the other variables. Other members proposed solutions that assumed this, but they did not take into account the fact that missing data at the high or low end of each variable (each member's ratings) would change its mean. Jon -- Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron R search page: http://finzi.psych.upenn.edu/ __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] impute missing values in correlated variables: transcan?
At the risk of stirring up a hornet's nest , I'd suggest that means are dangerous in such applications. A nice paper on combining ratings is: Gilbert Bassett and Joseph Persky, Rating Skating, JASA, 1994, 1075-1079. url:www.econ.uiuc.edu/~rogerRoger Koenker email [EMAIL PROTECTED] Department of Economics vox:217-333-4558University of Illinois fax:217-244-6678Champaign, IL 61820 On Nov 30, 2004, at 10:52 AM, Jonathan Baron wrote: I would like to impute missing data in a set of correlated variables (columns of a matrix). It looks like transcan() from Hmisc is roughly what I want. It says, transcan automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. And, By default, transcan imputes NAs with best guess expected values of transformed variables, back transformed to the original scale. But I can't get it to work. I say m1 - matrix(1:20+rnorm(20),5,) # four correlated variables colnames(m1) - paste(R,1:4,sep=) m1[c(2,19)] - NA# simulate some missing data library(Hmisc) transcan(m1,data=m1) and I get Error in rcspline.eval(y, nk = nk, inclx = TRUE) : fewer than 6 non-missing observations with knots omitted I've tried a few other things, but I think it is time to ask for help. The specific problem is a real one. Our graduate admissions committee (4 members) rates applications, and we average the ratings to get an overall rating for each applicant. Sometimes one of the committee members is absent, or late; hence the missing data. The members differ in the way they use the rating scale, in both slope and intercept (if you regress each on the mean). Many decisions end up depending on the second decimal place of the averages, so we want to do better than just averging the non-missing ratings. Maybe I'm just not seeing something really simple. In fact, the problem is simpler than transcan assumes, since we are willing to assume linearity of the regression of each variable on the other variables. Other members proposed solutions that assumed this, but they did not take into account the fact that missing data at the high or low end of each variable (each member's ratings) would change its mean. Jon -- Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron R search page: http://finzi.psych.upenn.edu/ __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] impute missing values in correlated variables: transcan?
On 11/30/04 11:23, roger koenker wrote: At the risk of stirring up a hornet's nest , I'd suggest that means are dangerous in such applications. A nice paper on combining ratings is: Gilbert Bassett and Joseph Persky, Rating Skating, JASA, 1994, 1075-1079. Here is the abstract, which seems to capture what the article says: Among judged sports, figure skating uses a unique method of median ranks for determining placement. This system responds positively to increased marks by each judge and follows majority rule when a majority of judges agree on a skater's rank. It is demonstrated that this is the only aggregation system possessing these two properties. Median ranks provide strong safeguards against manipulation by a minority of judges. These positive features do not require the sacrifice of efficiency in controlling measurement error. In a Monte Carlo study, the median rank system consistently outperforms alternatives when judges' marks are significantly skewed toward an upper limit. I think this is irrelevant. We are using ratings, not rankings. (And there was a small error in my original post. The disturbing effect of missing data at the high or low end would be on the slope rather than the intercept or mean.) Jon -- Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] impute missing values in correlated variables: transcan?
Jonathan Baron wrote: I would like to impute missing data in a set of correlated variables (columns of a matrix). It looks like transcan() from Hmisc is roughly what I want. It says, transcan automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. And, By default, transcan imputes NAs with best guess expected values of transformed variables, back transformed to the original scale. But I can't get it to work. I say m1 - matrix(1:20+rnorm(20),5,) # four correlated variables colnames(m1) - paste(R,1:4,sep=) m1[c(2,19)] - NA# simulate some missing data library(Hmisc) transcan(m1,data=m1) and I get Error in rcspline.eval(y, nk = nk, inclx = TRUE) : fewer than 6 non-missing observations with knots omitted Jonathan - you would need many more observations to be able to fit flexible additive models as transcan does. Also note that single imputation has problems and you may want to consider multiple imputation as done by the Hmisc aregImpute function, if you had more data. Frank I've tried a few other things, but I think it is time to ask for help. The specific problem is a real one. Our graduate admissions committee (4 members) rates applications, and we average the ratings to get an overall rating for each applicant. Sometimes one of the committee members is absent, or late; hence the missing data. The members differ in the way they use the rating scale, in both slope and intercept (if you regress each on the mean). Many decisions end up depending on the second decimal place of the averages, so we want to do better than just averging the non-missing ratings. Maybe I'm just not seeing something really simple. In fact, the problem is simpler than transcan assumes, since we are willing to assume linearity of the regression of each variable on the other variables. Other members proposed solutions that assumed this, but they did not take into account the fact that missing data at the high or low end of each variable (each member's ratings) would change its mean. Jon -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] impute missing values in correlated variables: transcan?
On 11/30/04 13:21, Frank E Harrell Jr wrote: Jonathan Baron wrote: I would like to impute missing data in a set of correlated variables (columns of a matrix). It looks like transcan() from Hmisc is roughly what I want. It says, transcan automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. And, By default, transcan imputes NAs with best guess expected values of transformed variables, back transformed to the original scale. But I can't get it to work. I say m1 - matrix(1:20+rnorm(20),5,) # four correlated variables colnames(m1) - paste(R,1:4,sep=) m1[c(2,19)] - NA# simulate some missing data library(Hmisc) transcan(m1,data=m1) and I get Error in rcspline.eval(y, nk = nk, inclx = TRUE) : fewer than 6 non-missing observations with knots omitted Jonathan - you would need many more observations to be able to fit flexible additive models as transcan does. Also note that single imputation has problems and you may want to consider multiple imputation as done by the Hmisc aregImpute function, if you had more data. Thanks. But they don't _need_ to be so flexible as what transcan does. Linear would be OK, but I can't find an option for that in transcan. We _will_ have more data, about 50 applicants rated by the time we start making decisions. So I tried my little simulation with more data, and it didn't give an error message. So that was the problem. Here is the new one: m1 - matrix(1:80+rnorm(80),,4) colnames(m1) - paste(R,1:4,sep=) m1[c(2,19)] - NA library(Hmisc) t1 - transcan(m1,data=m1,long=T,imputed=T) I've used aregImpute, and I notice it has a defaultlinear option, which is good. Thus, it may work better once I figure out how to get a single value out of it for each missing datum (which doesn't look too hard). This is not about statistical inference, which seems to me to be where the main advantage of multiple imputation lies. But probably it won't do any harm. Jon -- Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron R search page: http://finzi.psych.upenn.edu/ __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html