Re: [R] randomForest and missing data

2007-01-10 Thread Torsten Hothorn


On Tue, 9 Jan 2007, Bálint Czúcz wrote:


There is an improved version of the original random forest algorithm
available in the party package (you can find some additional
information on the details here:
http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper490.pdf ).

I do not know whether it yields a solution to your problem about
missing data, but maybe it's a check worth...



yes, `cforest()' is able to deal with missing values. More specifically, 
the implementation is based on conditional trees (`ctree()') which are 
able to set up surrogate splits.


Torsten


Best regards:

Bálint

On 1/4/07, Darin A. England [EMAIL PROTECTED] wrote:


Does anyone know a reason why, in principle, a call to randomForest
cannot accept a data frame with missing predictor values? If each
individual tree is built using CART, then it seems like this
should be possible. (I understand that one may impute missing values
using rfImpute or some other method, but I would like to avoid doing
that.)

If this functionality were available, then when the trees are being
constructed and when subsequent data are put through the forest, one
would also specify an argument for the use of surrogate rules, just
like in rpart.

I realize this question is very specific to randomForest, as opposed
to R in general, but any comments are appreciated. I suppose I am
looking for someone to say It's not appropriate, and here's why
... or Good idea. Please implement and post your code.

Thanks,

Darin England, Senior Scientist
Ingenix

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest and missing data

2007-01-09 Thread Bálint Czúcz
There is an improved version of the original random forest algorithm
available in the party package (you can find some additional
information on the details here:
http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper490.pdf ).

I do not know whether it yields a solution to your problem about
missing data, but maybe it's a check worth...

Best regards:

Bálint

On 1/4/07, Darin A. England [EMAIL PROTECTED] wrote:

 Does anyone know a reason why, in principle, a call to randomForest
 cannot accept a data frame with missing predictor values? If each
 individual tree is built using CART, then it seems like this
 should be possible. (I understand that one may impute missing values
 using rfImpute or some other method, but I would like to avoid doing
 that.)

 If this functionality were available, then when the trees are being
 constructed and when subsequent data are put through the forest, one
 would also specify an argument for the use of surrogate rules, just
 like in rpart.

 I realize this question is very specific to randomForest, as opposed
 to R in general, but any comments are appreciated. I suppose I am
 looking for someone to say It's not appropriate, and here's why
 ... or Good idea. Please implement and post your code.

 Thanks,

 Darin England, Senior Scientist
 Ingenix

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] randomForest and missing data

2007-01-04 Thread Darin A. England

Does anyone know a reason why, in principle, a call to randomForest
cannot accept a data frame with missing predictor values? If each
individual tree is built using CART, then it seems like this
should be possible. (I understand that one may impute missing values
using rfImpute or some other method, but I would like to avoid doing
that.) 

If this functionality were available, then when the trees are being
constructed and when subsequent data are put through the forest, one
would also specify an argument for the use of surrogate rules, just
like in rpart. 

I realize this question is very specific to randomForest, as opposed
to R in general, but any comments are appreciated. I suppose I am
looking for someone to say It's not appropriate, and here's why
... or Good idea. Please implement and post your code.

Thanks,

Darin England, Senior Scientist
Ingenix

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest and missing data

2007-01-04 Thread Sicotte, Hugues Ph.D.
I don't know about this module, but a general answer is that if you have
missing data, it may affect your model. If your data is missing at
random, then you might be lucky in your model building.

If however your data was not missing at random (e.g. censoring) , you
might build a wrong predictor.

Missing at random or not, that is a question you should answer and deal
with before modeling.

I refer you to a book like
Analysis of Incomplete Multivariate data. By Schafer

If there is a way around that with randomForest, I'd be interested to
know too.

Hugues Sicotte


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Darin A. England
Sent: Thursday, January 04, 2007 3:13 PM
To: r-help@stat.math.ethz.ch
Subject: [R] randomForest and missing data


Does anyone know a reason why, in principle, a call to randomForest
cannot accept a data frame with missing predictor values? If each
individual tree is built using CART, then it seems like this
should be possible. (I understand that one may impute missing values
using rfImpute or some other method, but I would like to avoid doing
that.) 

If this functionality were available, then when the trees are being
constructed and when subsequent data are put through the forest, one
would also specify an argument for the use of surrogate rules, just
like in rpart. 

I realize this question is very specific to randomForest, as opposed
to R in general, but any comments are appreciated. I suppose I am
looking for someone to say It's not appropriate, and here's why
... or Good idea. Please implement and post your code.

Thanks,

Darin England, Senior Scientist
Ingenix

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest and missing data

2007-01-04 Thread Darin A. England
Yes I completely agree with your statements. As far as a way around
it, I would say that CART has some facilities for dealing with
missing data. e.g. when an observation is dropped into the tree and
encounters a split at which the variable is missing, then one option
is to simply not send it further down the tree. One may then obtain
a prediction for that interior node, albeit probably not a very good
one, but it is one way to handle cases with missing values. So, my
thought is that why can't we simply have that capability with
randomForest as well?

Darin

On Thu, Jan 04, 2007 at 03:44:27PM -0600, Sicotte, Hugues   Ph.D. wrote:
 I don't know about this module, but a general answer is that if you have
 missing data, it may affect your model. If your data is missing at
 random, then you might be lucky in your model building.
 
 If however your data was not missing at random (e.g. censoring) , you
 might build a wrong predictor.
 
 Missing at random or not, that is a question you should answer and deal
 with before modeling.
 
 I refer you to a book like
 Analysis of Incomplete Multivariate data. By Schafer
 
 If there is a way around that with randomForest, I'd be interested to
 know too.
 
 Hugues Sicotte
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Darin A. England
 Sent: Thursday, January 04, 2007 3:13 PM
 To: r-help@stat.math.ethz.ch
 Subject: [R] randomForest and missing data
 
 
 Does anyone know a reason why, in principle, a call to randomForest
 cannot accept a data frame with missing predictor values? If each
 individual tree is built using CART, then it seems like this
 should be possible. (I understand that one may impute missing values
 using rfImpute or some other method, but I would like to avoid doing
 that.) 
 
 If this functionality were available, then when the trees are being
 constructed and when subsequent data are put through the forest, one
 would also specify an argument for the use of surrogate rules, just
 like in rpart. 
 
 I realize this question is very specific to randomForest, as opposed
 to R in general, but any comments are appreciated. I suppose I am
 looking for someone to say It's not appropriate, and here's why
 ... or Good idea. Please implement and post your code.
 
 Thanks,
 
 Darin England, Senior Scientist
 Ingenix
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest and missing data

2007-01-04 Thread Weiwei Shi
You can try randomForest in Fortran codes, which has that function
doing missing replacement automatically. There are two ways of
imputations (one is fast and the other is time-consuming) to do that.
I did it long time ago.

the link is below. If you have any question, just let me know.
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

In principle, each individual tree is NOT a cart tree since each
splitting predictor is randomly selected. In my impression, rf is more
like nearest neighbor algorithm. The surrogation is NOT used in rf
implementation. That's why you have to impute it before using it;
while the imputation is not implemented in r-version, in my best
knowledge.
You can check that from reading the original technical report or some
presentation by original authors. I remember there was some slide
comparing rf and CART somewhere.


HTH,

weiwei

On 1/4/07, Darin A. England [EMAIL PROTECTED] wrote:

 Does anyone know a reason why, in principle, a call to randomForest
 cannot accept a data frame with missing predictor values? If each
 individual tree is built using CART, then it seems like this
 should be possible. (I understand that one may impute missing values
 using rfImpute or some other method, but I would like to avoid doing
 that.)

 If this functionality were available, then when the trees are being
 constructed and when subsequent data are put through the forest, one
 would also specify an argument for the use of surrogate rules, just
 like in rpart.

 I realize this question is very specific to randomForest, as opposed
 to R in general, but any comments are appreciated. I suppose I am
 looking for someone to say It's not appropriate, and here's why
 ... or Good idea. Please implement and post your code.

 Thanks,

 Darin England, Senior Scientist
 Ingenix

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.

Did you always know?
No, I did not. But I believed...
---Matrix III

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.