Re: [R] randomForest and missing data
On Tue, 9 Jan 2007, Bálint Czúcz wrote: There is an improved version of the original random forest algorithm available in the party package (you can find some additional information on the details here: http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper490.pdf ). I do not know whether it yields a solution to your problem about missing data, but maybe it's a check worth... yes, `cforest()' is able to deal with missing values. More specifically, the implementation is based on conditional trees (`ctree()') which are able to set up surrogate splits. Torsten Best regards: Bálint On 1/4/07, Darin A. England [EMAIL PROTECTED] wrote: Does anyone know a reason why, in principle, a call to randomForest cannot accept a data frame with missing predictor values? If each individual tree is built using CART, then it seems like this should be possible. (I understand that one may impute missing values using rfImpute or some other method, but I would like to avoid doing that.) If this functionality were available, then when the trees are being constructed and when subsequent data are put through the forest, one would also specify an argument for the use of surrogate rules, just like in rpart. I realize this question is very specific to randomForest, as opposed to R in general, but any comments are appreciated. I suppose I am looking for someone to say It's not appropriate, and here's why ... or Good idea. Please implement and post your code. Thanks, Darin England, Senior Scientist Ingenix __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest and missing data
There is an improved version of the original random forest algorithm available in the party package (you can find some additional information on the details here: http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper490.pdf ). I do not know whether it yields a solution to your problem about missing data, but maybe it's a check worth... Best regards: Bálint On 1/4/07, Darin A. England [EMAIL PROTECTED] wrote: Does anyone know a reason why, in principle, a call to randomForest cannot accept a data frame with missing predictor values? If each individual tree is built using CART, then it seems like this should be possible. (I understand that one may impute missing values using rfImpute or some other method, but I would like to avoid doing that.) If this functionality were available, then when the trees are being constructed and when subsequent data are put through the forest, one would also specify an argument for the use of surrogate rules, just like in rpart. I realize this question is very specific to randomForest, as opposed to R in general, but any comments are appreciated. I suppose I am looking for someone to say It's not appropriate, and here's why ... or Good idea. Please implement and post your code. Thanks, Darin England, Senior Scientist Ingenix __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] randomForest and missing data
Does anyone know a reason why, in principle, a call to randomForest cannot accept a data frame with missing predictor values? If each individual tree is built using CART, then it seems like this should be possible. (I understand that one may impute missing values using rfImpute or some other method, but I would like to avoid doing that.) If this functionality were available, then when the trees are being constructed and when subsequent data are put through the forest, one would also specify an argument for the use of surrogate rules, just like in rpart. I realize this question is very specific to randomForest, as opposed to R in general, but any comments are appreciated. I suppose I am looking for someone to say It's not appropriate, and here's why ... or Good idea. Please implement and post your code. Thanks, Darin England, Senior Scientist Ingenix __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest and missing data
I don't know about this module, but a general answer is that if you have missing data, it may affect your model. If your data is missing at random, then you might be lucky in your model building. If however your data was not missing at random (e.g. censoring) , you might build a wrong predictor. Missing at random or not, that is a question you should answer and deal with before modeling. I refer you to a book like Analysis of Incomplete Multivariate data. By Schafer If there is a way around that with randomForest, I'd be interested to know too. Hugues Sicotte -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Darin A. England Sent: Thursday, January 04, 2007 3:13 PM To: r-help@stat.math.ethz.ch Subject: [R] randomForest and missing data Does anyone know a reason why, in principle, a call to randomForest cannot accept a data frame with missing predictor values? If each individual tree is built using CART, then it seems like this should be possible. (I understand that one may impute missing values using rfImpute or some other method, but I would like to avoid doing that.) If this functionality were available, then when the trees are being constructed and when subsequent data are put through the forest, one would also specify an argument for the use of surrogate rules, just like in rpart. I realize this question is very specific to randomForest, as opposed to R in general, but any comments are appreciated. I suppose I am looking for someone to say It's not appropriate, and here's why ... or Good idea. Please implement and post your code. Thanks, Darin England, Senior Scientist Ingenix __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest and missing data
Yes I completely agree with your statements. As far as a way around it, I would say that CART has some facilities for dealing with missing data. e.g. when an observation is dropped into the tree and encounters a split at which the variable is missing, then one option is to simply not send it further down the tree. One may then obtain a prediction for that interior node, albeit probably not a very good one, but it is one way to handle cases with missing values. So, my thought is that why can't we simply have that capability with randomForest as well? Darin On Thu, Jan 04, 2007 at 03:44:27PM -0600, Sicotte, Hugues Ph.D. wrote: I don't know about this module, but a general answer is that if you have missing data, it may affect your model. If your data is missing at random, then you might be lucky in your model building. If however your data was not missing at random (e.g. censoring) , you might build a wrong predictor. Missing at random or not, that is a question you should answer and deal with before modeling. I refer you to a book like Analysis of Incomplete Multivariate data. By Schafer If there is a way around that with randomForest, I'd be interested to know too. Hugues Sicotte -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Darin A. England Sent: Thursday, January 04, 2007 3:13 PM To: r-help@stat.math.ethz.ch Subject: [R] randomForest and missing data Does anyone know a reason why, in principle, a call to randomForest cannot accept a data frame with missing predictor values? If each individual tree is built using CART, then it seems like this should be possible. (I understand that one may impute missing values using rfImpute or some other method, but I would like to avoid doing that.) If this functionality were available, then when the trees are being constructed and when subsequent data are put through the forest, one would also specify an argument for the use of surrogate rules, just like in rpart. I realize this question is very specific to randomForest, as opposed to R in general, but any comments are appreciated. I suppose I am looking for someone to say It's not appropriate, and here's why ... or Good idea. Please implement and post your code. Thanks, Darin England, Senior Scientist Ingenix __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest and missing data
You can try randomForest in Fortran codes, which has that function doing missing replacement automatically. There are two ways of imputations (one is fast and the other is time-consuming) to do that. I did it long time ago. the link is below. If you have any question, just let me know. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm In principle, each individual tree is NOT a cart tree since each splitting predictor is randomly selected. In my impression, rf is more like nearest neighbor algorithm. The surrogation is NOT used in rf implementation. That's why you have to impute it before using it; while the imputation is not implemented in r-version, in my best knowledge. You can check that from reading the original technical report or some presentation by original authors. I remember there was some slide comparing rf and CART somewhere. HTH, weiwei On 1/4/07, Darin A. England [EMAIL PROTECTED] wrote: Does anyone know a reason why, in principle, a call to randomForest cannot accept a data frame with missing predictor values? If each individual tree is built using CART, then it seems like this should be possible. (I understand that one may impute missing values using rfImpute or some other method, but I would like to avoid doing that.) If this functionality were available, then when the trees are being constructed and when subsequent data are put through the forest, one would also specify an argument for the use of surrogate rules, just like in rpart. I realize this question is very specific to randomForest, as opposed to R in general, but any comments are appreciated. I suppose I am looking for someone to say It's not appropriate, and here's why ... or Good idea. Please implement and post your code. Thanks, Darin England, Senior Scientist Ingenix __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. Did you always know? No, I did not. But I believed... ---Matrix III __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.