[R] impute missing values in correlated variables: transcan?

2004-11-30 Thread Jonathan Baron
I would like to impute missing data in a set of correlated
variables (columns of a matrix).  It looks like transcan() from
Hmisc is roughly what I want.  It says, transcan automatically
transforms continuous and categorical variables to have maximum
correlation with the best linear combination of the other
variables. And, By default, transcan imputes NAs with best
guess expected values of transformed variables, back transformed
to the original scale.

But I can't get it to work.  I say

m1 - matrix(1:20+rnorm(20),5,)  # four correlated variables
colnames(m1) - paste(R,1:4,sep=)
m1[c(2,19)] - NA# simulate some missing data
library(Hmisc)
transcan(m1,data=m1)

and I get

Error in rcspline.eval(y, nk = nk, inclx = TRUE) : 
  fewer than 6 non-missing observations with knots omitted

I've tried a few other things, but I think it is time to ask for
help.

The specific problem is a real one.  Our graduate admissions
committee (4 members) rates applications, and we average the
ratings to get an overall rating for each applicant.  Sometimes
one of the committee members is absent, or late; hence the
missing data.  The members differ in the way they use the rating
scale, in both slope and intercept (if you regress each on the
mean).  Many decisions end up depending on the second decimal
place of the averages, so we want to do better than just averging
the non-missing ratings.

Maybe I'm just not seeing something really simple.  In fact, the
problem is simpler than transcan assumes, since we are willing to
assume linearity of the regression of each variable on the other
variables.  Other members proposed solutions that assumed this,
but they did not take into account the fact that missing data at
the high or low end of each variable (each member's ratings)
would change its mean.

Jon
-- 
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron
R search page: http://finzi.psych.upenn.edu/

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] impute missing values in correlated variables: transcan?

2004-11-30 Thread roger koenker
At the risk of stirring up a hornet's nest , I'd suggest that
means are dangerous in such applications.  A nice paper
on combining ratings is:  Gilbert Bassett and Joseph  Persky,
Rating Skating,  JASA, 1994,  1075-1079.
url:www.econ.uiuc.edu/~rogerRoger Koenker
email   [EMAIL PROTECTED]   Department of Economics
vox:217-333-4558University of Illinois
fax:217-244-6678Champaign, IL 61820
On Nov 30, 2004, at 10:52 AM, Jonathan Baron wrote:
I would like to impute missing data in a set of correlated
variables (columns of a matrix).  It looks like transcan() from
Hmisc is roughly what I want.  It says, transcan automatically
transforms continuous and categorical variables to have maximum
correlation with the best linear combination of the other
variables. And, By default, transcan imputes NAs with best
guess expected values of transformed variables, back transformed
to the original scale.
But I can't get it to work.  I say
m1 - matrix(1:20+rnorm(20),5,)  # four correlated variables
colnames(m1) - paste(R,1:4,sep=)
m1[c(2,19)] - NA# simulate some missing data
library(Hmisc)
transcan(m1,data=m1)
and I get
Error in rcspline.eval(y, nk = nk, inclx = TRUE) :
  fewer than 6 non-missing observations with knots omitted
I've tried a few other things, but I think it is time to ask for
help.
The specific problem is a real one.  Our graduate admissions
committee (4 members) rates applications, and we average the
ratings to get an overall rating for each applicant.  Sometimes
one of the committee members is absent, or late; hence the
missing data.  The members differ in the way they use the rating
scale, in both slope and intercept (if you regress each on the
mean).  Many decisions end up depending on the second decimal
place of the averages, so we want to do better than just averging
the non-missing ratings.
Maybe I'm just not seeing something really simple.  In fact, the
problem is simpler than transcan assumes, since we are willing to
assume linearity of the regression of each variable on the other
variables.  Other members proposed solutions that assumed this,
but they did not take into account the fact that missing data at
the high or low end of each variable (each member's ratings)
would change its mean.
Jon
--
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron
R search page: http://finzi.psych.upenn.edu/
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] impute missing values in correlated variables: transcan?

2004-11-30 Thread Jonathan Baron
On 11/30/04 11:23, roger koenker wrote:
At the risk of stirring up a hornet's nest , I'd suggest that
means are dangerous in such applications.  A nice paper
on combining ratings is:  Gilbert Bassett and Joseph  Persky,
Rating Skating,  JASA, 1994,  1075-1079.

Here is the abstract, which seems to capture what the article
says:

Among judged sports, figure skating uses a unique method of
median ranks for determining placement. This system responds
positively to increased marks by each judge and follows majority
rule when a majority of judges agree on a skater's rank. It is
demonstrated that this is the only aggregation system possessing
these two properties. Median ranks provide strong safeguards
against manipulation by a minority of judges. These positive
features do not require the sacrifice of efficiency in
controlling measurement error. In a Monte Carlo study, the median
rank system consistently outperforms alternatives when judges'
marks are significantly skewed toward an upper limit.

I think this is irrelevant.  We are using ratings, not rankings.

(And there was a small error in my original post.  The disturbing
effect of missing data at the high or low end would be on the
slope rather than the intercept or mean.)

Jon
-- 
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] impute missing values in correlated variables: transcan?

2004-11-30 Thread Frank E Harrell Jr
Jonathan Baron wrote:
I would like to impute missing data in a set of correlated
variables (columns of a matrix).  It looks like transcan() from
Hmisc is roughly what I want.  It says, transcan automatically
transforms continuous and categorical variables to have maximum
correlation with the best linear combination of the other
variables. And, By default, transcan imputes NAs with best
guess expected values of transformed variables, back transformed
to the original scale.
But I can't get it to work.  I say
m1 - matrix(1:20+rnorm(20),5,)  # four correlated variables
colnames(m1) - paste(R,1:4,sep=)
m1[c(2,19)] - NA# simulate some missing data
library(Hmisc)
transcan(m1,data=m1)
and I get
Error in rcspline.eval(y, nk = nk, inclx = TRUE) : 
  fewer than 6 non-missing observations with knots omitted
Jonathan - you would need many more observations to be able to fit 
flexible additive models as transcan does.  Also note that single 
imputation has problems and you may want to consider multiple imputation 
as done by the Hmisc aregImpute function, if you had more data.

Frank
I've tried a few other things, but I think it is time to ask for
help.
The specific problem is a real one.  Our graduate admissions
committee (4 members) rates applications, and we average the
ratings to get an overall rating for each applicant.  Sometimes
one of the committee members is absent, or late; hence the
missing data.  The members differ in the way they use the rating
scale, in both slope and intercept (if you regress each on the
mean).  Many decisions end up depending on the second decimal
place of the averages, so we want to do better than just averging
the non-missing ratings.
Maybe I'm just not seeing something really simple.  In fact, the
problem is simpler than transcan assumes, since we are willing to
assume linearity of the regression of each variable on the other
variables.  Other members proposed solutions that assumed this,
but they did not take into account the fact that missing data at
the high or low end of each variable (each member's ratings)
would change its mean.
Jon

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] impute missing values in correlated variables: transcan?

2004-11-30 Thread Jonathan Baron
On 11/30/04 13:21, Frank E Harrell Jr wrote:
Jonathan Baron wrote:
 I would like to impute missing data in a set of correlated
 variables (columns of a matrix).  It looks like transcan() from
 Hmisc is roughly what I want.  It says, transcan automatically
 transforms continuous and categorical variables to have maximum
 correlation with the best linear combination of the other
 variables. And, By default, transcan imputes NAs with best
 guess expected values of transformed variables, back transformed
 to the original scale.

 But I can't get it to work.  I say

 m1 - matrix(1:20+rnorm(20),5,)  # four correlated variables
 colnames(m1) - paste(R,1:4,sep=)
 m1[c(2,19)] - NA# simulate some missing data
 library(Hmisc)
 transcan(m1,data=m1)

 and I get

 Error in rcspline.eval(y, nk = nk, inclx = TRUE) :
   fewer than 6 non-missing observations with knots omitted

Jonathan - you would need many more observations to be able to fit
flexible additive models as transcan does.  Also note that single
imputation has problems and you may want to consider multiple imputation
as done by the Hmisc aregImpute function, if you had more data.

Thanks.  But they don't _need_ to be so flexible as what transcan
does.  Linear would be OK, but I can't find an option for that in
transcan.

We _will_ have more data, about 50 applicants rated by the time
we start making decisions.  So I tried my little simulation with
more data, and it didn't give an error message.  So that was the
problem.  Here is the new one:

m1 - matrix(1:80+rnorm(80),,4)
colnames(m1) - paste(R,1:4,sep=)
m1[c(2,19)] - NA
library(Hmisc)
t1 - transcan(m1,data=m1,long=T,imputed=T)

I've used aregImpute, and I notice it has a defaultlinear
option, which is good.  Thus, it may work better once I figure
out how to get a single value out of it for each missing datum
(which doesn't look too hard).

This is not about statistical inference, which seems to me to be
where the main advantage of multiple imputation lies.  But
probably it won't do any harm.

Jon
-- 
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron
R search page: http://finzi.psych.upenn.edu/

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html