Re: [R] rpart package: why does predict.rpart require values for unused predictors?

2012-08-02 Thread Jean V Adams
Jason,

In the help file for predict.rpart it says, The predictors referred to in 
the right side of formula(object) must be present by name in newdata.
?predict.rpart

So, that's just the way it is.  There are a couple ways to work around 
this, if you wish.  You could create a data frame with all NAs for the 
unused predictor(s).  For example,
newdata2 - data.frame(Disp.=car.test.frame$Disp., 
Weight=car.test.frame$Weight, HP=as.numeric(rep(NA, 
dim(car.test.frame)[1])))
predict(model, newdata=newdata2)

Or, you could refit the model using only the important factors.  For 
example,
model2 - rpart(Mileage ~ Weight + Disp., car.test.frame)
predict(model2, newdata=newdata)

Jean


Jason Roberts jason.robe...@duke.edu wrote on 08/01/2012 05:17:38 PM:
 
 After fitting and pruning an rpart model, it is often the case that one 
or
 more of the original predictors is not used by any of the splits of the
 final tree. It seems logical, therefore, that values for these unused
 predictors would not be needed for prediction. But when predict() is 
called
 on such models, all predictors seem to be required. Why is that, and can 
it
 be easily circumvented?
 
 Consider this example:
 
  model - rpart(Mileage ~ Weight + Disp. + HP, car.test.frame)
  model
 n= 60 
 
 node), split, n, deviance, yval
   * denotes terminal node
 
 1) root 60 1354.58300 24.58333 
   2) Disp.=134 35  154.4 21.4 
 4) Weight=3087.5 22   61.31818 20.40909 *
 5) Weight 3087.5 13   34.92308 23.07692 *
   3) Disp. 134 25  348.96000 29.04000 
 6) Disp.=97.5 16  101.75000 27.12500 *
 7) Disp. 97.5 9   84.2 32.4 *
  newdata - data.frame(Disp.=car.test.frame$Disp.,
 Weight=car.test.frame$Weight)
  predict(model, newdata=newdata)
 Error in eval(expr, envir, enclos) : object 'HP' not found
 
 In this model, Disp. and Weight were used in splits, but HP was not. 
Thus I
 expected to be able to perform predictions by providing values for just
 Disp. and Weight, but predict() failed when I tried that, complaining 
that
 HP was not also provided.
 
 Thanks for any help you can provide. My apologies if I simply do not
 understand how this works.
 
 Best regards,
 
 Jason

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart package: why does predict.rpart require values for unused predictors?

2012-08-02 Thread Jason Roberts
Jean,

Thanks for your quick reply and suggestions! 

 In the help file for predict.rpart it says, The predictors referred to in
 the right side of formula(object) must be present by name in newdata.

I was aware of that statement from the help file. I wondered about the
reason for that requirement. It would be convenient for the caller to not
have to provide values for unused predictors. I wondered whether the
requirement to provide them all was related to something I did not
understand, such as surrogate splits, or whether imposing it simply made
rpart itself easier to implement. (No offence intended to the authors for
taking a shortcut, if indeed they did.)

Are you pretty confident that your suggested workarounds will result in a
model that produces identical predictions? I only ask because I'm aware that
rpart has the ability to use surrogate variables in place of predictors that
are missing. But I do not fully understand how that capability works. I do
not know whether it is only used during fitting and not prediction.

Continuing my example, I can see that printcp produces some output
Variables actually used in tree construction:

 printcp(model)

Regression tree:
rpart(formula = Mileage ~ Weight + Disp. + HP, data = car.test.frame)

Variables actually used in tree construction:
[1] Disp.  Weight

...

I can see in the source for printcp how those variables were obtained. But
when doing predictions, is it really safe to only provide them and not HP,
if I expect that there could be missing values for them? When I call
summary, I can see surrogate splits that reference the HP variable:

 summary(model)
Call:
rpart(formula = Mileage ~ Weight + Disp. + HP, data = car.test.frame)
  n= 60 

  CP nsplit rel errorxerror   xstd
1 0.62840234  0 1.000 1.0326274 0.17828576
2 0.12032318  1 0.3715977 0.5271278 0.08627909
3 0.04293478  2 0.2512745 0.4092689 0.07260291
4 0.0100  3 0.2083397 0.3629544 0.06865150

Node number 1: 60 observations,complexity param=0.6284023
  mean=24.58333, MSE=22.57639 
  left son=2 (35 obs) right son=3 (25 obs)
  Primary splits:
  Disp.   134to the right, improve=0.6284023, (0 missing)
  Weight  2567.5 to the right, improve=0.5953491, (0 missing)
  HP  104.5  to the right, improve=0.4085043, (0 missing)
  Surrogate splits:
  Weight  2747.5 to the right, agree=0.900, adj=0.76, (0 split)
  HP  104.5  to the right, agree=0.817, adj=0.56, (0 split)

...

Assuming that the answer is:

1. The best predictions will be obtained by providing values for the
variables actually used in tree construction plus those used as
surrogates, and:

2. If a variable is neither actually used in tree construction nor as a
surrogate, it can be safely set to NA for the prediction.

Do you know of a way to easily identify the variables used as surrogates?

Thanks again for your help, and sorry to write a book in response,

Jason

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart package: why does predict.rpart require values for unused predictors?

2012-08-01 Thread Jason Roberts
After fitting and pruning an rpart model, it is often the case that one or
more of the original predictors is not used by any of the splits of the
final tree. It seems logical, therefore, that values for these unused
predictors would not be needed for prediction. But when predict() is called
on such models, all predictors seem to be required. Why is that, and can it
be easily circumvented?

Consider this example:

 model - rpart(Mileage ~ Weight + Disp. + HP, car.test.frame)
 model
n= 60 

node), split, n, deviance, yval
  * denotes terminal node

1) root 60 1354.58300 24.58333  
  2) Disp.=134 35  154.4 21.4  
4) Weight=3087.5 22   61.31818 20.40909 *
5) Weight 3087.5 13   34.92308 23.07692 *
  3) Disp. 134 25  348.96000 29.04000  
6) Disp.=97.5 16  101.75000 27.12500 *
7) Disp. 97.5 9   84.2 32.4 *
 newdata - data.frame(Disp.=car.test.frame$Disp.,
Weight=car.test.frame$Weight)
 predict(model, newdata=newdata)
Error in eval(expr, envir, enclos) : object 'HP' not found

In this model, Disp. and Weight were used in splits, but HP was not. Thus I
expected to be able to perform predictions by providing values for just
Disp. and Weight, but predict() failed when I tried that, complaining that
HP was not also provided.

Thanks for any help you can provide. My apologies if I simply do not
understand how this works.

Best regards,

Jason

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.