Re: [R] Random Forest classification
This is explained in the "Details" section of the help page for partialPlot. Best Andy > -Original Message- > From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jesús Para > Fernández > Sent: Tuesday, April 12, 2016 1:17 AM > To: r-help@r-project.org > Subject: [R] Random Forest classification > > Hi, > > To evaluate the partial influence of a factor with a random Forest, wich > response is OK/NOK I�m using partialPlot, being the x axis the factor axis and > the Y axis is between -1 and 1. What this -1 and 1 means? > > An example: > > https://www.dropbox.com/s/4b92lqxi3592r0d/Captura.JPG?dl=0 > > > Thanks for all!!! > [[alternative HTML version deleted]] Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart and randomforest results
Hi Sonja, How did you build the rpart tree (i.e., what settings did you use in rpart.control)? Rpart by default will use cross validation to prune back the tree, whereas RF doesn't need that. There are other more subtle differences as well. If you want to compare single tree results, you really want to make sure the settings in the two are as close as possible. Also, how did you compute the pseudo R2, on test set, or some other way? Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Schillo, Sonja Sent: Thursday, April 03, 2014 3:58 PM To: Mitchell Maltenfort Cc: r-help@r-project.org Subject: Re: [R] rpart and randomforest results Hi, the random forest should do that, you're totally right. As far as I know it does so by randomly selecting the variables considered for a split (but here we set the option for how many variables to consider at each split to the number of variables available so that I thought that the random forest does not have the chance to randomly select the variables). The next thing that randomforest does is bootstrapping. But here again we set the option to the number of cases we have in the data set so that no bootstrapping should be done. We tried to take all the randomness from the randomforest away. Is that plausible and does anyone have another idea? Thanks Sonja Von: Mitchell Maltenfort [mailto:mmal...@gmail.com] Gesendet: Dienstag, 1. April 2014 13:32 An: Schillo, Sonja Cc: r-help@r-project.org Betreff: Re: [R] rpart and randomforest results Is it possible that the random forest is somehow adjusting for optimism or overfitting? On Apr 1, 2014 7:27 AM, Schillo, Sonja sonja.schi...@uni-due.demailto:sonja.schi...@uni-due.de wrote: Hi all, I have a question on rpart and randomforest results: We calculated a single regression tree using rpart and got a pseudo-r2 of roundabout 10% (which is not too bad compared to a linear regression on this data). Encouraged by this we grew a whole regression forest on the same data set using randomforest. But we got pretty bad pseudo-r2 values for the randomforest (even sometimes negative values for some option settings). We then thought that if we built only one single tree with the randomforest routine we should get a result similar to that of rpart. So we set the options for randomforest to only one single tree but the resulting pseudo-r2 value was negative aswell. Does anyone have a clue as to why the randomforest results are so bad whereas the rpart result is quite ok? Is our assumption that a single tree grown by randomforest should give similar results as a tree grown by rpart wrong? What am I missing here? Thanks a lot for your help! Sonja __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression?
If you are using the code, that's not really using randomForest directly. I don't understand the data structure you have (since you did not show anything) so can't really tell you much. In any case, that warning came from randomForest() when it is run in regression mode but the response has fewer than five distinct values. It may be legitimate regression data, and if so you can safely ignore the warning (that's why it's not an error). It's there to catch the cases when people try to do classification with class labels 1, 2, ..., k and forgot to make it a factor. Best, Andy Liaw -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sean Porter Sent: Thursday, March 20, 2014 3:27 AM To: r-help@r-project.org Subject: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression? Hello everyone, Im relatively new to R and new to the randomForest package and have scoured the archives for help with no luck. I am trying to perform a regression on a set of predictors and response variables to determine the most important predictors. I have 100 response variables collected from 14 sites and 8 predictor variables from the same 14 sites. I run the code to perform the randomForest regression given by Pitcher et al 2011 ( http://gradientforest.r-forge.r-project.org/biodiversity-survey.pdf ). However, after running the code I get the warning: In randomForest.default(m, y, ...) : The response has five or fewer unique values. Are you sure you want to do regression? And it produces a set of 500 regression trees for each of 3 species only when the number of species in the response file is 100. I noticed that in the example by Pitcher they get 500 trees from only 90 species even though they input 110 species in the response data. Why am I getting the warning/how do I solve it, and why is randomForest producing trees for only 3 species when I am looking at 100 species (response variables)? Many thanks Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Variable importance - ANN
You can try something like this: http://pubs.acs.org/doi/abs/10.1021/ci050022a Basically similar idea to what is done in random forests: permute predictor variable one at a time and see how much that degrades prediction performance. Cheers, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Giulia Di Lauro Sent: Wednesday, December 04, 2013 6:42 AM To: r-help@r-project.org Subject: [R] Variable importance - ANN Hi everybody, I created a neural network for a regression analysis with package ANN, but now I need to know which is the significance of each predictor variable in explaining the dependent variable. I thought to analyze the weight, but I don't know how to do it. Thanks in advance, Giulia Di Lauro. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How do I extract Random Forest Terms and Probabilities?
#2 can be done simply with predict(fmi, type=prob). See the help page for predict.randomForest(). Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of arun Sent: Tuesday, November 26, 2013 6:57 PM To: R help Subject: Re: [R] How do I extract Random Forest Terms and Probabilities? Hi, For the first part, you could do: fmi2 - fmi attributes(fmi2$terms) - NULL capture.output(fmi2$terms) #[1] Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width A.k. On Tuesday, November 26, 2013 3:55 PM, Lopez, Dan lopez...@llnl.gov wrote: Hi R Experts, I need your help with two question regarding randomForest. 1. When I run a Random Forest model how do I extract the formula I used so that I can store it in a character vector in a dataframe? For example the dataframe might look like this if I am running models using the IRIS dataset #ModelID,Type, #001,RF,Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width fmi-randomForest(Species~.,iris,mtry=3,ntry=500) #I know one place where the information is in fmi$terms but not sure how to extract just the formula info. Or perhaps there is somewhere else in fmi that I could get this? 2. How do I get the probabilities (probability-like values) from the model that was run? I know for the test set I can use predict. And I know to extract the classifications from the model I use fmi$predicted. But where are the probabilities? Dan Workforce Analyst HRIM - Workforce Analytics Metrics LLNL [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:13}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] interpretation of MDS plot in random forest
Yes, that's part of the intention anyway. One can also use them to do clustering. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Massimo Bressan Sent: Monday, December 02, 2013 6:34 AM To: r-help@r-project.org Subject: [R] interpretation of MDS plot in random forest Given this general example: set.seed(1) data(iris) iris.rf - randomForest(Species ~ ., iris, proximity=TRUE, keep.forest=TRUE) #varImpPlot(iris.rf) #varUsed(iris.rf) MDSplot(iris.rf, iris$Species) I’ve been reading the documentation about random forest (at best of my - poor - knowledge) but I’m in trouble with the correct interpretation of the MDS plot and I hope someone can give me some clues What is intended for “the scaling coordinates of the proximity matrix”? I think to understand that the objective is here to present the distance among species in a parsimonious and visual way (of lower dimensionality) Is therefore a parallelism to what are intended the principal components in a classical PCA? Are the scaling coordinates DIM 1 and DIM2 the eigenvectors of the proximity matrix? If that is correct, how would you find the eigenvalues for that eigenvectors? And what are the eigenvalues repreenting? What are saying these two dimensions in the plot about the different iris species? Their relative distance in terms of proximity within the space DIM1 and DIM2? How to choose for the k parameter (number of dimensions for the scaling coordinates)? And finally how would you explain the plot in simple terms? Thank you for any feedback Best regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Split type in the RandomForest package
Classification trees use the Gini index, whereas the regression trees use sum of squared errors. They are hard-wired into the C/Fortran code, so not easily changeable. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Cheng, Chuan Sent: Monday, September 30, 2013 6:30 AM To: 'R-help@r-project.org' Subject: [R] Split type in the RandomForest package Hi guys, I'm new to Random Forest package and I'd like to know what type of split is used in the package for classification? Or can I configure the package to use different split type (like simple split alongside single attribute axis or linear split based on several attributes etc..) Thanks a lot! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] What is the difference between Mean Decrease Accuracy produced by importance(foo) vs foo$importance in a Random Forest Model?
The difference is importance(..., scale=TRUE). See the help page for detail. If you extract the $importance component from a randomForest object, you do not get the scaling. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Lopez, Dan Sent: Wednesday, November 13, 2013 12:16 PM To: R help (r-help@r-project.org) Subject: [R] What is the difference between Mean Decrease Accuracy produced by importance(foo) vs foo$importance in a Random Forest Model? Hi R Expert Community, My question: What is the difference between Mean Decrease Accuracy produced by importance(foo) vs foo$importance in a Random Forest Model? I ran a Random Forest classification model where the classifier is binary. I stored the model in object FOREST_model. I than ran importance(FOREST_model) and FOREST_model$importance. I usually use the prior but decided to learn more about what is in summary(randomForest ) so I ran the latter. I expected both to produce identical output. Mean Decrease Gini is the only thing that is identical in both. I looked at ? Random Forest and Package 'randomForest' documentation and didn't find any info explaining this difference. I am not including a reproducible example because this is most likely something, perhaps simple, such as one is divided by something (if so, what?), that I am just not aware of. importance(FOREST_model) HC TER MeanDecreaseAccuracy MeanDecreaseGini APPT_TYP_CD_LL0.16025157 -0.521041660 0.1567029712.793624 ORG_NAM_LL0.20886631 -0.952057325 0.20208393 107.137049 NEW_DISCIPLINE0.20685079 -0.960719435 0.2007676286.495063 FOREST_model$importance HC TER MeanDecreaseAccuracy MeanDecreaseGini APPT_TYP_CD_LL0.0049473962 -3.727629e-03 0.0045949805 12.793624 ORG_NAM_LL0.0090715845 -2.401016e-02 0.0077298067 107.137049 NEW_DISCIPLINE0.0130672572 -2.656671e-02 0.0114583178 86.495063 Dan Lopez LLNL, HRIM, Workforce Analytics Metrics [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Nadaraya-Watson kernel
Use KernSmooth (one of the recommended packages that are included in R distribution). E.g., library(KernSmooth) KernSmooth 2.23 loaded Copyright M. P. Wand 1997-2009 x - seq(0, 1, length=201) y - 4 * cos(2*pi*x) + rnorm(x) f - locpoly(x, y, degree=0, kernel=epan, bandwidth=.1) plot(x, y) lines(f, lwd=2) Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ms khulood aljehani Sent: Tuesday, November 05, 2013 9:49 AM To: r-h...@stat.math.ethz.ch Subject: [R] FW: Nadaraya-Watson kernel From: aljehan...@hotmail.com To: r-help@r-project.org Subject: Nadaraya-Watson kernel Date: Tue, 5 Nov 2013 17:42:13 +0300 Hello i want to compute the Nadaraya-Watson kernel estimation when the kernel function is Epanchincov kernel i use the command ksmooth(x, y, kernel=normal, bandwidth ,) the argmunt ( kernel=normal ) accept normal and box kernels i want to compute it if the kerenl = Epanchincov thank you [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Creating 3d partial dependence plots
It needs to be done by hand, in that partialPlot() does not handle more than one variable at a time. You need to modify its code to do that (and be ready to wait even longer, as it can be slow). Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jerrod Parker Sent: Sunday, March 03, 2013 7:08 PM To: r-help@r-project.org Subject: [R] Creating 3d partial dependence plots Help, I've been having a difficult time trying to create 3d partial dependence plots using rgl. It looks like this question has been asked a couple times, but I'm unable to find a clear answer googling. I've tried creating x, y, and z variables by extracting them from the partialPlot output to no avail. I've seen these plots used several times in articles, and I think they would help me a great deal looking at interactions. Could someone provide a coding example using randomForest and rgl? It would be greatly appreciated. Thank you, Jerrod Parker [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How do I make R randomForest model size smaller?
Try the following: set.seed(100) rf1 - randomForest(Species ~ ., data=iris) set.seed(100) rf2 - randomForest(iris[1:4], iris$Species) object.size(rf1) object.size(rf2) str(rf1) str(rf2) You can try it on your own data. That should give you some hints about why the formula interface should be avoided with large datasets. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman Sent: Monday, December 03, 2012 3:43 PM To: r-help@r-project.org Subject: [R] How do I make R randomForest model size smaller? I've been training randomForest models on 7 million rows of data (41 features). Here's an example call: myModel - randomForest(RESPONSE~., data=mydata, ntree=50, maxnodes=30) I thought surely with only 50 trees and 30 terminal nodes that the memory footprint of myModel would be small. But it's 65 megs in a dump file. The object seems to be holding all sorts of predicted, actual, and vote data from the training process. What if I just want the forest and that's it? I want a tiny dump file that I can load later to make predictions off of quickly. I feel like the forest by itself shouldn't be all that large... Anyone know how to strip this sucker down to just something I can make predictions off of going forward? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Different results from random.Forest with test option and using predict function
Without data to reproduce what you saw, we can only guess. One possibility is due to tie-breaking. There are several places where ties can occur and are broken at random, including at the prediction step. One difference between the two ways of doing prediction is that when it's all done within randomForest(), the test set prediction is performed as each tree is grown. If there is any tie that needs to be broken at any prediction step, it will affect the RNG stream used by the subsequent tree growing step. You can also inspect/compare the forest components of the randomForest objects to see if they are the same. At least the first tree in both should be identical. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of tdbuskirk Sent: Monday, December 03, 2012 6:31 PM To: r-help@r-project.org Subject: [R] Different results from random.Forest with test option and using predict function Hello R Gurus, I am perplexed by the different results I obtained when I ran code like this: set.seed(100) test1-randomForest(BinaryY~., data=Xvars, trees=51, mtry=5, seed=200) predict(test1, newdata=cbind(NewBinaryY, NewXs), type=response) and this code: set.seed(100) test2-randomForest(BinaryY~., data=Xvars, trees=51, mtry=5, seed=200, xtest=NewXs, ytest=NewBinarY) The confusion matrices for the two forests I thought would be the same by virtue of the same seed settings, but they differ as do the predicted values as well as the votes. At first I thought it was just the way ties were broken, so I changed the number of trees to an odd number so there are no ties anymore. Can anyone shed light on what I am hoping is a simple oversight? I just can't figure out why the results of the predictions from these two forests applied to the NewBinaryYs and NewX data sets would not be the same. Thanks for any hints and help. Sincerely, Trent Buskirk -- View this message in context: http://r.789695.n4.nabble.com/Different-results-from-random-Forest-with-test-option-and-using-predict-function-tp4651970.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Partial dependence plot in randomForest package (all flat responses)
Not unless we have more information. Please read the Posting Guide to see how to make it easier for people to answer your question. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Oritteropus Sent: Thursday, November 22, 2012 2:02 PM To: r-help@r-project.org Subject: [R] Partial dependence plot in randomForest package (all flat responses) Hi, I'm trying to make a partial plot with package randomForest in R. After I perform my random forest object I type partialPlot(data.rforest, pred.data=act2, x.var=centroid, C) where data.rforest is my randomforest object, act2 is the original dataset, centroid is one of the predictor and C is one of the classes in my response variable. Whatever predictor or response class I try I always get a plot with a straight line (a completely flat response). Similarly, If I set a categorical variable as predictor, I get a barplot with all the bar with the same height. I suppose I'm doing something wrong here because all other analysis on the same rforest object seem correct (e.g. varImp or MDSplot). Is it possible it is related to some option set in random forest object? Can somebody see the problem here? Thanks for your time -- View this message in context: http://r.789695.n4.nabble.com/Partial-dependence-plot-in-randomForest-package-all-flat-responses-tp4650470.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest for multiple categorical variables
How about taking the combination of the two? E.g., gamma = factor(paste(alpha, beta1, sep=:)) and use gamma as the response. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Gyanendra Pokharel Sent: Tuesday, October 16, 2012 10:47 PM To: R-help@r-project.org Subject: [R] Random Forest for multiple categorical variables Dear all, I have the following data set. V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 alpha beta 1111 111 111 11111alpha beta1 2122 122 12 2 12212alpha beta1 3133 133 13 3 13 313alpha beta1 4144 14414 4 144 14 alpha beta1 5155 15515 5 155 15 alpha beta1 6166166 16 6 166 16 alpha beta2 717717717 7 17 7 17 alpha beta2 8188 18 818 818 818alpha beta2 919919919 9 19 9 19alpha beta2 10 20 10 20 10 20 10 20 10 20 alpha beta2 I want to use the randomForest classification. If there is one categorical variable with different classes, we can use randomForest(resp~., data, ), but here I need to classify the data with two categorical variables. Any idea will be great. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest - Extract
1. Not sure what you want. What details are you looking for exactly? If you call predict(trainset) without the newdata argument, you will get the (out-of-bag) prediction of the training set, which is exactly the predicted component of the RF object. 2. If you set type=votes and norm.votes=FALSE, you will get the counts instead of proportions. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Lopez, Dan Sent: Wednesday, September 26, 2012 9:05 PM To: R help (r-help@r-project.org) Subject: [R] Random Forest - Extract Hello, I have two Random Forest (RF) related questions. 1. How do I view the classifications for the detail data of my training data (aka trainset) that I used to build the model? I know there is an object called predicted which I believe is a vector. To view the detail for my testset I use the below-bind the columns together. I was trying to do something similar for my trainset but without putting it through the predict function. Instead taking directly from the randomForest which I stored in FOREST_model. I really need to get to this information to do some comparison of certain cases. RF_DTL-cbind(testset,predict(FOREST_model, testset, type=response)) 2. In the RF model in R the predict function has three possible arguments: response, vote or prob. I noticed vote and prob are identical for all records in my data set. Is this typical? If so then what is the point of having these two arguments? Ease of use? Dan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] interpret the importance output?
The type=1 importance measure in RF compares the prediction error of each tree on the OOB data with the prediction error of the same tree on the OOB data with the values of one variable randomly shuffled. If the variable has no predictive power, then the two should be very close, and there's 50% chance that the difference is negative. If the variable is important, then shuffling the values should significantly degrade the prediction in the form of increased MSE. The importance measure takes mean of the differences of all these individual tree MSEs and then divide by the SD of these differences. With that, I hope it's clear that only v2 and v4 in your example are potentially important. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Johnathan Mercer Sent: Monday, August 27, 2012 11:40 AM To: r-h...@stat.math.ethz.ch Subject: [R] interpret the importance output? importance(rfor.pdp11_t25.comb1,type=1) %IncMSE v1 -0.28956401263 v2 1.92865561147 v3 -0.63443929130 v4 1.58949137047 v5 0.03190940065 I wasn't entirely confident with interpreting these results based on the documentation. Could you please interpret? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Stratified Sampling with randomForest Regression
Yes, you need to modify both the R and the underlying C code. It's the the source package on CRAN (the .tar.gz file). Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Josh Browning Sent: Friday, June 01, 2012 10:48 AM To: r-help@r-project.org Subject: [R] Stratified Sampling with randomForest Regression Hi All, I'm using R's randomForest package (and it's quite awesome!) but I'd really like to do some stratified sampling with a regression problem. However, it appears that the package was designed to only accommodate stratified sampling for classification purposes (see https://stat.ethz.ch/pipermail/r-help/2006-November/117477.html). As Andy suggests in the link just mentioned, I'm trying to modify the source code. However, it appears that I may also need to modify the C code that randomForest is calling, is that correct? If so, how do I access that code? Or, has anyone modified the package to allow for stratified sampling in regression problems? Please let me know if I'm not being clear enough with this question, and thanks for helping me out! Josh [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about random Forest function in R
Hi Kelly, The function has a limitation that it cannot handle any column in your x that is a categorical variable with more than 32 categories. One possibility is to see if you can bin some of the categories into one to get below 32 categories. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Kelly Cool Sent: Tuesday, May 29, 2012 10:47 AM To: r-help@r-project.org Subject: [R] Question about random Forest function in R Hello, I am trying to run the random Forest function on a data.frame using the following code.. myrf - randomForest (y=sample_data_metal, x=Train, importance=TRUE, proximity=TRUE) However, an error occurs saying, can not handle categorical predictors with more than 32 categories. My x=Train data.frame is quite large and my y=sample_data_metal is one column. I'm not sure how to go about fixing this error or if there is even a way to get around this error. Thanks in advance for any help. [[alternative HTML version deleted]] Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Classification_ForestCombination
As long as you can remember that the summaries such as variable importance, OOB predictions, and OOB error rates are not applicable, I think that should be fine. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Nikita Desai Sent: Wednesday, May 23, 2012 1:51 PM To: r-help@R-project.org Subject: [R] Random Forest Classification_ForestCombination Hello, I am aware of the fact that the combine() function in the Random Forest package of R is meant to combine forests built from the same training set, but is there any way to combine trees built on different training sets? Both the training datasets used contain the same variables and classes, but their sizes are different. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random forests prediction
I don't think this is so hard to explain. If you evaluate AUC using either OOB prediction or on a test set (or something like CV or bootstrap), that would be what I expect for most data. When you add more variables (that are, say, less informative) to a model, the model has to look harder to find the informative ones, and thus you pay a penalty. One exception to that is if some of the new variables happen to have very strong interaction with some of the old variables, then you may see improved performance. I've said it several times before, but it seems to be worth repeating: Don't use the training set for evaluating models: that almost never make sense. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of matt Sent: Friday, May 11, 2012 3:43 PM To: r-help@r-project.org Subject: [R] Random forests prediction Hi all, I have a strange problem when applying RF in R. I have a set of variables with which I obtain an AUC of 0.67. I do have a second set of variables that have an AUC of 0.57. When I merge the first and second set of variables, the AUC becomes 0.64. I would expect the prediction to become better as I add variables that do have some predictive power? This is even more strange as the AUC on the training set increased when I added more variables (while the AUC of the validation set thus decreased). Is there anyone who has experienced the same and/or who know what could be the reason? Thanks, Matthijs -- View this message in context: http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] No Data in randomForest predict
It doesn't: You just get an error if there are NAs in the data; e.g., R rf1 = randomForest(iris[1:4], iris[[5]]) R predict(rf1, newdata=data.frame(Sepal.Length=1, Sepal.Width=2, Petal.Length=3, Petal.Width=NA)) Error in predict.randomForest(rf1, newdata = data.frame(Sepal.Length = 1, : missing values in newdata Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jennifer Corcoran Sent: Saturday, May 05, 2012 5:17 PM To: r-help@r-project.org Subject: [R] No Data in randomForest predict I would like to ask a general question about the randomForest predict function and how it handles No Data values. I understand that you can omit No Data values while developing the randomForest object, but how does it handle No Data in the prediction phase? I would like the output to be NA if any (not just all) of the input data have an NA value. It is not clear to me if this is the default or if I need to add an argument in the predict function. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random forests prediction
That's not how RF works at all. The setting of mtry is irrelevant to this. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of matt Sent: Monday, May 14, 2012 10:22 AM To: r-help@r-project.org Subject: Re: [R] Random forests prediction But shouldn't it be resolved when I set mtry to the maximum number of variables? Then the model explores all the variables for the next step, so it will still be able to find the better ones? And then in the later steps it could use the (less important) variables. Matthijs -- View this message in context: http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409p4629944.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Partial Dependence and RandomForest
Note that the partialPlot() function also returns the x-y pairs being plotted, so you can work from there if you wish. As to SD, my guess is you want some sort of confidence interval or band around the curve? I do not know of any theory to produce that, but that may well just be my ignorance. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of jmc Sent: Friday, April 13, 2012 11:20 AM To: r-help@r-project.org Subject: Re: [R] Partial Dependence and RandomForest Thank you Andy. I obviously neglected to read into the help file and, frustratingly, could have known this all along. However, I am still interested in knowing the relative maximum value in the partial plots via query instead of visual interpretation (and possibly getting at other statistical measures like standard deviation). Is it possible to do this? I will keep investigating, but would appreciate a hint in the right direction if you have time. -- View this message in context: http://r.789695.n4.nabble.com/Partial-Dependence-and-RandomForest-tp4549705p4555146.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] loess function take
Alternatively, use only a subset to run loess(), either a random sample or something like every other k-th (sorted) data value, or the quantiles. It's hard for me to imagine that that many data points are going to improve your model much at all (unless you use tiny span). Andy From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Uwe Ligges On 12.04.2012 05:49, arunkumar wrote: Hi The function loess takes very long time if the dataset is very huge I have around 100 records and used only one independent variable. still it takes very long time Any suggestion to reduce the time Use another method that is computationally less expensive for that many observations. Uwe Ligges - Thanks in Advance Arun -- View this message in context: http://r.789695.n4.nabble.com/loess-function-take-tp4550896p4550896.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Partial Dependence and RandomForest
Please read the help page for the partialPlot() function and make sure you learn about all its arguments (in particular, which.class). Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of jmc Sent: Wednesday, April 11, 2012 2:44 PM To: r-help@r-project.org Subject: [R] Partial Dependence and RandomForest Hello all~ I am interested in clarifying something more conceptual, so I won't be providing any data or code here. From what I understand, partial dependence plots can help you understand the relative dependence on a variable, and the subsequent values of that variable, after averaging out the effects of the other input variables. This is great, but what I am interested in knowing is how that relates to each predictor class, not just the overall prediction. Is it possible to plot partial dependence per class? Specifically, I'd like to know the important threshold values of my most important variables. Thank you for your time, -- View this message in context: http://r.789695.n4.nabble.com/Partial-Dependence-and-RandomForest-tp4549705p4549705.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Execution speed in randomForest
Without seeing your code, it's hard to say much more, but do avoid using formula when you have large data. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jason Caroline Shaw Sent: Friday, April 06, 2012 1:20 PM To: jim holtman Cc: r-help@r-project.org Subject: Re: [R] Execution speed in randomForest The CPU time and elapsed time are essentially identical. (That is, the system time is negligible.) Using Rprof, I just ran the code twice. The first time, while randomForest is doing its thing, there are 850 consecutive lines which read: .C randomForest.default randomForest randomForest.formula randomForest Upon running it a second time, this time taking 285 seconds to complete, there are 14201 such lines, with nothing intervening There shouldn't be interference from elsewhere on the machine. This is the only memory- and CPU-intensive process. I don't know how to check what kind of paging is going on, but since the machine has 16GB of memory and I am using maybe 3 or 4 at most, I hope paging is not an issue. I'm on a CentOS 5 box running R 2.15.0. On Fri, Apr 6, 2012 at 12:45 PM, jim holtman jholt...@gmail.com wrote: Are you looking at the CPU or the elapsed time? If it is the elapsed time, then also capture the CPU time to see if it is different. Also consider the use of the Rprof function to see where time is being spent. What else is running on the machine? Are you doing any paging? What type of system are you running on? Use some of the system level profiling tools. If on Windows, then use perfmon. On Fri, Apr 6, 2012 at 11:28 AM, Jason Caroline Shaw los.sh...@gmail.com wrote: I am using the randomForest package. I have found that multiple runs of precisely the same command can generate drastically different run times. Can anyone with knowledge of this package provide some insight as to why this would happen and whether there's anything I can do about it? Here are some details of what I'm doing: - Data: ~80,000 rows, with 10 columns (one of which is the class label) - I randomly select 90% of the data to use to build 500 trees. And this is what I find: - Execution times of randomForest() using the entire dataset (in seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22 - Execution times of randomForest() using the 90% selection: 17.78, 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 -- Note the 3rd, 4th, and 7th. - When the speed is slow, it often stutters, with one or a few trees being produced very quickly, followed by a slow build taking 10 or 20 seconds - The oob results are indistinguishable between the fast and slow runs. I select the 90% of my data by using sample() to generate indices and then subsetting, like: selection - data[sample,]. I thought perhaps this subsetting was getting repeated, rather than storing in memory a new copy of all that data, so I tried circumventing this with eval(data[sample,]). Probably barking up the wrong tree -- it had no effect, and doesn't explain the run-to-run variation (really, I'm just not clear on what eval() is for). I have also tried garbage collecting with gc() between each run, and adding a Sys.sleep() for 5 seconds, but neither of these has helped either. Any ideas? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Imputing missing values using LSmeans (i.e., population marginal means) - advice in R?
Don't know how you searched, but perhaps this might help: https://stat.ethz.ch/pipermail/r-help/2007-March/128064.html -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jenn Barrett Sent: Tuesday, April 03, 2012 1:23 AM To: r-help@r-project.org Subject: [R] Imputing missing values using LSmeans (i.e., population marginal means) - advice in R? Hi folks, I have a dataset that consists of counts over a ~30 year period at multiple (200) sites. Only one count is conducted at each site in each year; however, not all sites are surveyed in all years. I need to impute the missing values because I need an estimate of the total population size (i.e., sum of counts across all sites) in each year as input to another model. head(newdat,40) SITE YEAR COUNT 1 1 1975 12620 2 1 1976 13499 3 1 1977 45575 4 1 1978 21919 5 1 1979 33423 ... 372 1975 4 382 1978 40322 392 1979 7 402 1980 16244 It was suggested to me by a statistician to use LSmeans to do this; however, I do not have SAS, nor do I know anything much about SAS. I have spent DAYS reading about these LSmeans and while (I think) I understand what they are, I have absolutely no idea how to a) calculate them in R and b) how to use them to impute my missing values in R. Again, I've searched the mail lists, internet and literature and have not found any documentation to advise on how to do this - I'm lost. I've looked at popMeans, but have no clue how to use this with predict() - if this is even the route to go. Any advice would be much appreciated. Note that YEAR will be treated as a factor and not a linear variable (i.e., the relationship between COUNT and YEAR is not linear - rather there are highs and lows about every 10 or so years). One thought I did have was to just set up a loop to calculate the least-squares estimates as: Yij = (IYi + JYj - Y)/[(I-1)(J-1)] where I = number of treatments and J = number of blocks (so I = sites and J = years). I found this formula in some stats lecture handouts by UC Davis on unbalanced data and LSMeans...but does it yield the same thing as using the LSmeans estimates? Does it make any sense? Thoughts? Many thanks in advance. Jenn __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about randomForest
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Saruman I dont see how this answered the original question of the poster. He was quite clear: the value of the predictions coming out of RF do not match what comes out of the predict function using the same RF object and the same data. Therefore, what is predict() doing that is different from RF? Yes, RF is making its predictions using OOB, but nowhere does it say way predict() is doing; indeed, it says if newdata is not given, then the results are just the OOB predictions. But newdata=oldata, then predict(newdata) != OOB predictions. So what is it then? Let me make this as clear as I possibly can: If predict() is called without newdata, all it can do is assume prediction on the training set is desired. In that case it returns the OOB prediction. If newdata is given in predict(), it assumes it is new data and thus makes prediction using all trees. If you just feed the training data as newdata, then yes, you will get overfitted predictions. It almost never make sense (to me anyway) to make predictions on the training set. Opens another issue, which is if newdata is close but not exactly oldata, then you get overfitted results? Possibly, depending on how close the new data are to the training set. This applies to nearly _ALL_ methods, not just RF. Andy -- View this message in context: http://r.789695.n4.nabble.com/Question-about-randomForest-tp41 11311p4529770.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory limits for MDSplot in randomForest package
Sam, As you've probably seen, all the MDSplot() function does is feed 1 - proximity to the cmdscale() function. Some suggestion and clarification: 1. If all you want is the proximity matrix, you can run randomForest() with keep.forest=FALSE to save memory. You will likely want to run somewhat large number of trees if you're interested in proximity, and with the large number of data points, the trees are going to be quite large as well. 2. The proximity is nxn, so if you have about 19000 data points, that's a 19000 by 19000 matrix, which takes approx. 2.8GB of memory to store a copy. 3. I tried making up a 19000^2 cross-product matrix, then tried cmdscale(1-xx, k=5). The memory usage seems to peak at around 16.3GB, but I killed it after more than two hours. Thus I suspect it really is the eigen decomposition in cmdscale() on such a large matrix that's taking up the time. My suggestion is to see if you can find some efficient ways of doing eigen decomposition on such large matrices. You might be able to make the proximity matrix sparse (e.g., by thresholding), and see if there are packages that can do the decomposition on the sparse form. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sam Albers Sent: Friday, March 23, 2012 3:31 PM To: r-help@r-project.org Subject: [R] Memory limits for MDSplot in randomForest package Hello, I am struggling to produce an MDS plot using the randomForest package with a moderately large data set. My data set has one categorical response variables, 7 predictor variables and just under 19000 observations. That means my proximity matrix is approximately 133000 by 133000 which is quite large. To train a random forest on this large a dataset I have to use my institutions high performance computer. Using this setup I was able to train a randomForest with the proximity argument set to TRUE. At this point I wanted to construct an MDSplot using the following: MDSplot(nech.rf, nech.d$pd.fl, palette=c(1,2,3), pch=as.numeric(nech.d$pd.fl)) where nech.rf is the randomForest object and nech.d$pd.fl is the classification factor. Now with the architecture listed below, I've been waiting for approximately 2 days for this to run. My issue is that I am not sure if this will ever run. Can anyone recommend a way to tweak the MDSplot function to run a little faster? I tried changing the cmdscale arguments (i.e. eigenvalues) within the MDSplot function a little but that didn't seem to have any effect of the overall running time using a much smaller data set. Or even if someone could comment whether I am dreaming that this will actually ever run? This is probably the best computer that I will have access to so I was hoping that somehow I could get this to run. I was just hoping that someone reading the list might have some experience with randomForests and using large datasets and might be able to comment on my situation. Below the architecture information I have constructed a dummy example to illustrate what I am doing but given the nature of the problem, this doesn't completely reflect my situation. Any help would be much appreciated! Thanks! Sam Computer specs and sessionInfo() OS: Suse Linux Memory: 64 GB Processors: Intel Itanium 2, 64 x 1500 MHz And: sessionInfo() R version 2.6.2 (2008-02-08) ia64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA TE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8 ;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC _MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] randomForest_4.6-6 loaded via a namespace (and not attached): [1] rcompgen_0.1-17 ### # Dummy Example ### require(randomForest) set.seed(17) ## Number of points x - 10 df - rbind( data.frame(var1=runif(x, 10, 50), var2=runif(x, 2, 7), var3=runif(x, 0.2, 0.35), var4=runif(x, 1, 2), var5=runif(x, 5, 8), var6=runif(x, 1, 2), var7=runif(x, 5, 8), cls=factor(CLASS-2) ) , data.frame(var1=runif(x, 10, 50), var2=runif(x, -3, 3), var3=runif(x, 0.1, 0.25), var4=runif(x, 1, 2), var5=runif(x, 5, 8), var6=runif(x, 1, 2), var7=runif(x, 5, 8), cls=factor(CLASS-1) ) ) df.rf-randomForest(y=df[,8],x=df[,1:7], proximity=TRUE, importance=TRUE) MDSplot(df.rf, df$cls, k=2, palette=c(1,2,3,4), pch=as.numeric(df$cls)) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
Re: [R] fitted values with locfit
I believe you are expecting the software to do what it did not claim being able to do. predict.locfit() does not have a type argument, nor can that take on terms. When you specify two variables in the smooth, a bivariate smooth is done, so you get one bivariate smooth function, not the sum of two univariate smooths. If the latter is what you want, use packages that fits additive models. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Soberon Velez, Alexandra Pilar Sent: Monday, March 19, 2012 5:13 AM To: r-help@r-project.org Subject: [R] fitted values with locfit Dear memberships, I'm trying to estimate the following multivariate local regression model using the locfit package: BMI=m1(RCC)+m2(WCC) where (m1) and (m2) are unknown smooth functions. My problem is that once I get the regression done I cannot get the fitted values of each of this smooth functions (m1) and (m2). What I write is the following library(locfit) data(ais) fit2-locfit.raw(x=lp(ais$RCC,h=0.5,deg=1)+lp(ais$WCC,deg=1,h= 0.75),y=ais$BMI,ev=dat(),kt=prod,kern=gauss) g21-predict(fit2,type=terms) If I done this on the computer the results of (g21) is a vector when I should have a matrix with 2 columns (one for each fitted smooth function). Please, somebody knows how can I get the estimated fitted values of both smooth functions (m1) and (m2) using a local linear regression with kernel weights as this example? thanks a lot in advance I'm very desperate. Alexandra [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] job opening at Merck Research Labs, NJ USA
The Biometrics Research department at the Merck Research Laboratories has an open position to be located in Rahway, New Jersey, USA: This position will be responsible for imaging and bio-signal biomarkers projects including analysis of preclinical, early clinical, and experimental medicine imaging and EEG data. Responsibilities include all phases of data analysis from processing of raw imaging and EEG data to derivation of endpoints. Part of the responsibilities is development and implementation of novel statistical methods and software for analysis of imaging and bio-signal data. This position will closely collaborate with Imaging and Clinical Pharmacology departments; Experimental Medicine; Early and Late Stage Development Statistics; and Modeling and Simulation. Publication and presentation of the results is highly encouraged as is collaboration with external experts. Education Minimum Requirement: PhD in Statistics, Applied Mathematics, Physics, Computer Science, Engineering, or related fields Required Experience and Skills: Education should include Statistics related courses or equivalently working experience should involve data analysis and statistical modeling for at least 1 year. Excellent computing skills, R and/or SAS , MATLAB in Linux and Windows environment; working knowledge of parallel computing; C, C++, or Fortran programming. Dissertation or experience in at least one of these areas: statistical image and signal analysis; data mining and machine learning; mathematical modeling in medicine and biology; general statistical research Desired Experience and Skills - education in and/or experience with EEG and Imaging data analysis; stochastic modeling; functional data analysis; familiarity with wavelet analysis and other spectral analysis methods Please apply electronically at: http://www.merck.com/careers/search-and-apply/search-jobs/home.html Click on Experienced Opportunities, and search by Requisition ID: BIO003546 and email CV to: vladimir_svet...@merck.com Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using caegorical variables in package randomForest.
The way to represent categorical variables is with factors. See ?factor. randomForest() will handle factors appropriately, as most modeling functions in R. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of abhishek Sent: Tuesday, March 13, 2012 8:11 AM To: r-help@r-project.org Subject: [R] Using caegorical variables in package randomForest. Hello, I am sorry if there are already post that answers to this question but i tried to find them before making this post. I did not really find relevant posts. I am using randomForest package for building a two class classifier. There are categorical variables and numerical variables in my data. Different categorical variables have different number of categories from 2 to 10. I am not sure about how to represent the categorical data. For example, I am using 0 and 1 for variables that have only two categories. But, i doubt, the program is analysing the values as numerical. Do you have any idea how can i use the c*ategorical variables for building a two class classifier.* I am using a factor consisting of 0 and 1 for the classification target. Thank you for your ideas. - abhishek -- View this message in context: http://r.789695.n4.nabble.com/Using-caegorical-variables-in-pa ckage-randomForest-tp4468923p4468923.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help on reshape function
Just using the reshape() function in base R: df.long = reshape(df, varying=list(names(df)[4:7]), direction=long) This also gives two extra columns (time and id) can can be dropped. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of R. Michael Weylandt Sent: Tuesday, March 06, 2012 8:45 AM To: mails Cc: r-help@r-project.org Subject: Re: [R] Help on reshape function library(reshape2) melt(df, id.vars = c(ID1, ID2, ID3))[, -4] # To drop an extraneous column (but you should take a look and see what it is for future reference) Michael On Tue, Mar 6, 2012 at 6:17 AM, mails mails00...@gmail.com wrote: Hello, I am trying to reshape a data.frame in wide format into long format. Although in the reshape R documentation they programmer list some examples I am struggling to bring my data.frame into long and then transform it back into wide format. The data.frame I look at is: df - data.frame(ID1 = c(1,1,1,1,1,1,1,1,1), ID2 = c(A, A, A, B, B, B, C, C, C), ID3 = c(E, E, E, E, E, E, E, E, E), X1 = c(1,4,3,5,2,4,6,4,2), X2 = c(6,8,9,6,7,8,9,6,7), X3 = c(7,6,7,5,6,5,6,7,5), X4 = c(1,2,1,2,3,1,2,1,2)) df ID1 ID2 ID3 X1 X2 X3 X4 1 1 A E 1 6 7 1 2 1 A E 4 8 6 2 3 1 A E 3 9 7 1 4 1 B E 5 6 5 2 5 1 B E 2 7 6 3 6 1 B E 4 8 5 1 7 1 C E 6 9 6 2 8 1 C E 4 6 7 1 9 1 C E 2 7 5 2 I want to use the reshape function to get the following result: df ID1 ID2 ID3 X 1 1 A E 1 2 1 A E 4 3 1 A E 3 4 1 B E 5 5 1 B E 2 6 1 B E 4 7 1 C E 6 8 1 C E 4 9 1 C E 2 10 1 A E 6 11 1 A E 8 12 1 A E 9 13 1 B E 6 14 1 B E 7 15 1 B E 8 16 1 C E 9 17 1 C E 6 18 1 C E 7 19 1 A E 7 20 1 A E 6 21 1 A E 7 22 1 B E 5 23 1 B E 6 24 1 B E 5 25 1 C E 6 26 1 C E 7 27 1 C E 5 28 1 A E 1 29 1 A E 2 30 1 A E 1 31 1 B E 2 32 1 B E 3 33 1 B E 1 34 1 C E 2 35 1 C E 1 36 1 C E 2 Can anyone help? Cheers -- View this message in context: http://r.789695.n4.nabble.com/Help-on-reshape-function-tp44494 64p4449464.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?
That's why I said you need the book. The details are all in the book. From: Michael [mailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 1:49 PM To: Liaw, Andy Cc: r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Thanks Andy. I am reading the locfit document... but not sure how to do the CV and bandwidth selection... Here is a quote about the function regband: it doesn't seem to be usable? Basically I am looking for a locfit that comes with an automatic bandwidth selection so that I am essentially parameter free for the local-regression step... - regband Bandwidth selectors for local regression. Description Function to compute local regression bandwidths for local linear regression, implemented as a front end to locfit(). This function is included for comparative purposes only. Plug-in selectors are based on flawed logic, make unreasonable and restrictive assumptions and do not use the full power of the estimates available in Locfit. Any relation between the results produced by this function and desirable estimates are entirely coincidental. Usage regband(formula, what = c(CP, GCV, GKK, RSW), deg=1, ...) 2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com If that's the kind of framework you'd like to work in, use locfit, which has the predict() method for evaluating new data. There are several different handwidth selectors in that package for your choosing. Kernel smoothers don't really fit the framework of creating a model object, followed by predicting new data using that fitted model object very well because of it's local nature. Think of k-nn classification, which has similar problem: The model needs to be computed for every data point you want to predict. Andy From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 10:06 AM To: Liaw, Andy Cc: Bert Gunter; r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Thank you Andy! I went thru KernSmooth package but I don't see a way to use the fitted function to do the predict part... data=data.frame(z=z, x=x) datanew=data.frame(z=z, x=x) lmfit=lm(z ~x, data=data) lmforecast=predict(lmfit, newdata=datanew) Am I missing anything here? Thanks! 2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com In short, pick your poison... Is there any particular reason why the tools that shipped with R itself (e.g., kernSmooth) are inadequate for you? I like using the locfit package because it has many tools, including the ones that the author didn't think were optimal. You may need the book to get most mileage out of it though. Andy From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 12:25 AM To: Liaw, Andy Cc: Bert Gunter; r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? $B#I(Bmeant its very slow when I use cv.aic... On Wed, Feb 22, 2012 at 11:24 PM, Michael comtech@gmail.commailto:comtech@gmail.com wrote: Is np an okay package to use? I am worried about the multi-start thing... and also it's very slow... On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com wrote: Bert's question aside (I was going to ask about laundry, but that's much harder than taxes...), my understanding of the situation is that optimal is in the eye of the beholder. There were at least two schools of thought on which is the better way of automatically selecting bandwidth, using plug-in methods or CV-type. The last I check, the jury is still out. Andy -Original Message- From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter Sent: Wednesday, February 22, 2012 6:03 PM To: Michael Cc: r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Would you like it to do your your taxes for you too? :-) Bert Sent from my iPhone -- please excuse typos. On Feb 22, 2012, at 11:46 AM, Michael comtech@gmail.commailto:comtech@gmail.com wrote: Hi all, I am looking for a good and modern Kernel Regression package in R, which has the following features: 1) It has cross-validation 2) It can automatically choose the optimal bandwidth 3) It doesn't have random effect - i.e. if I run the function at different times on the same data-set, the results should be exactly the same... I am trying np, but I am seeing: Multistart 1 of 1 | Multistart 1 of 1 | ... It looks like in order to do the optimization, it's doing multiple-random-start optimization... am I right? Could you please
Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?
In short, pick your poison... Is there any particular reason why the tools that shipped with R itself (e.g., kernSmooth) are inadequate for you? I like using the locfit package because it has many tools, including the ones that the author didn't think were optimal. You may need the book to get most mileage out of it though. Andy From: Michael [mailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 12:25 AM To: Liaw, Andy Cc: Bert Gunter; r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? $B#I(Bmeant its very slow when I use cv.aic... On Wed, Feb 22, 2012 at 11:24 PM, Michael comtech@gmail.commailto:comtech@gmail.com wrote: Is np an okay package to use? I am worried about the multi-start thing... and also it's very slow... On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com wrote: Bert's question aside (I was going to ask about laundry, but that's much harder than taxes...), my understanding of the situation is that optimal is in the eye of the beholder. There were at least two schools of thought on which is the better way of automatically selecting bandwidth, using plug-in methods or CV-type. The last I check, the jury is still out. Andy -Original Message- From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter Sent: Wednesday, February 22, 2012 6:03 PM To: Michael Cc: r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Would you like it to do your your taxes for you too? :-) Bert Sent from my iPhone -- please excuse typos. On Feb 22, 2012, at 11:46 AM, Michael comtech@gmail.commailto:comtech@gmail.com wrote: Hi all, I am looking for a good and modern Kernel Regression package in R, which has the following features: 1) It has cross-validation 2) It can automatically choose the optimal bandwidth 3) It doesn't have random effect - i.e. if I run the function at different times on the same data-set, the results should be exactly the same... I am trying np, but I am seeing: Multistart 1 of 1 | Multistart 1 of 1 | ... It looks like in order to do the optimization, it's doing multiple-random-start optimization... am I right? Could you please give me some pointers? I did some google search but there are so many packages that do this... I just wanted to find the best/modern one to use... Thank you! [[alternative HTML version deleted]] __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:27}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?
If that's the kind of framework you'd like to work in, use locfit, which has the predict() method for evaluating new data. There are several different handwidth selectors in that package for your choosing. Kernel smoothers don't really fit the framework of creating a model object, followed by predicting new data using that fitted model object very well because of it's local nature. Think of k-nn classification, which has similar problem: The model needs to be computed for every data point you want to predict. Andy From: Michael [mailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 10:06 AM To: Liaw, Andy Cc: Bert Gunter; r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Thank you Andy! I went thru KernSmooth package but I don't see a way to use the fitted function to do the predict part... data=data.frame(z=z, x=x) datanew=data.frame(z=z, x=x) lmfit=lm(z ~x, data=data) lmforecast=predict(lmfit, newdata=datanew) Am I missing anything here? Thanks! 2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com In short, pick your poison... Is there any particular reason why the tools that shipped with R itself (e.g., kernSmooth) are inadequate for you? I like using the locfit package because it has many tools, including the ones that the author didn't think were optimal. You may need the book to get most mileage out of it though. Andy From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com] Sent: Thursday, February 23, 2012 12:25 AM To: Liaw, Andy Cc: Bert Gunter; r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? $B#I(Bmeant its very slow when I use cv.aic... On Wed, Feb 22, 2012 at 11:24 PM, Michael comtech@gmail.commailto:comtech@gmail.com wrote: Is np an okay package to use? I am worried about the multi-start thing... and also it's very slow... On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com wrote: Bert's question aside (I was going to ask about laundry, but that's much harder than taxes...), my understanding of the situation is that optimal is in the eye of the beholder. There were at least two schools of thought on which is the better way of automatically selecting bandwidth, using plug-in methods or CV-type. The last I check, the jury is still out. Andy -Original Message- From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter Sent: Wednesday, February 22, 2012 6:03 PM To: Michael Cc: r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Would you like it to do your your taxes for you too? :-) Bert Sent from my iPhone -- please excuse typos. On Feb 22, 2012, at 11:46 AM, Michael comtech@gmail.commailto:comtech@gmail.com wrote: Hi all, I am looking for a good and modern Kernel Regression package in R, which has the following features: 1) It has cross-validation 2) It can automatically choose the optimal bandwidth 3) It doesn't have random effect - i.e. if I run the function at different times on the same data-set, the results should be exactly the same... I am trying np, but I am seeing: Multistart 1 of 1 | Multistart 1 of 1 | ... It looks like in order to do the optimization, it's doing multiple-random-start optimization... am I right? Could you please give me some pointers? I did some google search but there are so many packages that do this... I just wanted to find the best/modern one to use... Thank you! [[alternative HTML version deleted]] __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you
Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?
Bert's question aside (I was going to ask about laundry, but that's much harder than taxes...), my understanding of the situation is that optimal is in the eye of the beholder. There were at least two schools of thought on which is the better way of automatically selecting bandwidth, using plug-in methods or CV-type. The last I check, the jury is still out. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter Sent: Wednesday, February 22, 2012 6:03 PM To: Michael Cc: r-help Subject: Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth? Would you like it to do your your taxes for you too? :-) Bert Sent from my iPhone -- please excuse typos. On Feb 22, 2012, at 11:46 AM, Michael comtech@gmail.com wrote: Hi all, I am looking for a good and modern Kernel Regression package in R, which has the following features: 1) It has cross-validation 2) It can automatically choose the optimal bandwidth 3) It doesn't have random effect - i.e. if I run the function at different times on the same data-set, the results should be exactly the same... I am trying np, but I am seeing: Multistart 1 of 1 | Multistart 1 of 1 | ... It looks like in order to do the optimization, it's doing multiple-random-start optimization... am I right? Could you please give me some pointers? I did some google search but there are so many packages that do this... I just wanted to find the best/modern one to use... Thank you! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Package
You should be able to use the Rgui menu to install packages. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Niratha Sent: Wednesday, February 01, 2012 5:16 AM To: r-help@r-project.org Subject: [R] Random Forest Package Hi, I have installed R version 2.14 in windows 7 . I want to use randomForest package. I installed Rtools and MikTex 2.9, but i am not possible to read description file and also its not possible to build package. when i give this command in windows R CMD IINSTALL --build randomForest its shows the error R CMD is not recognized as an internal or external command. Thanks Niratha -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Package-tp4347424p 4347424.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest: proximity for new objects using an existing rf
There's an alternative, but it may not be any more efficient in time or memory... You can run predict() on the training set once, setting nodes=TRUE. That will give you a n by ntree matrix of which node of which tree the data point falls in. For any new data, you would run predict() with nodes=TRUE, then compute the proximity by hand by counting how often any given pair landed in the same terminal node of each tree. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Kilian Sent: Wednesday, February 01, 2012 5:39 AM To: r-help@r-project.org Subject: [R] randomForest: proximity for new objects using an existing rf Dear all, using an existing random forest, I would like to calculate the proximity for a new test object, i.e. the similarity between the new object and the old training objects which were used for building the random forest. I do not want to build a new random forest based on both old and new objects. Currently, my workaround is to calculate the proximites of a combined data set consisting of training and new objects like this: model - randomForest(Xtrain, Ytrain) # build random forest nnew - nrow(Xnew) # number of new objects Xcombi - rbind(Xnew, Xtrain) # combine new objects and training objects predcombi - predict(model, Xcombi, proximity=TRUE) # calculate proximities proxcombi - predcombi$proximity # get proximities of combined dataset proxnew - proxcombi[(1:nnew),-(1:nnew)] # get proximities of new objects only But this approach causes a lot of wasted computation time as I am not interested in the proximities among the training objects themselves but only among the training objects and the new objects. With 1000 training objects and 5 new objects, I have to calculate a 1005x1005 proximity matrix to get the essential 5x1000 matrix of the new objects only. Am I doing something wrong? I read through the documentation but could not find another solution. Any advice would be highly appreciated. Thanks in advance! Kilian [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)
Hi Ista, When you write a package, you have to anticipate what users will throw at the code. I can insist that users only input matriices where none of the column names are empty, but that's not what I wish to impose on the users. I can add the name if it's empty, but as a user I don't want a function to do that, either. That's why I need to look for a workaround. Using which() seems rather clumsy for the purpose, as I need to combine those with the non-empty ones, and preserving ordering would be a mess. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ista Zahn Sent: Wednesday, February 01, 2012 5:45 AM To: r-help@r-project.org Subject: Re: [R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X) Hi Andy, On Tuesday, January 31, 2012 08:44:13 AM Liaw, Andy wrote: I'm not exactly sure if this is a problem with indexing by name; i.e., is the following behavior by design? The problem is that names or dimnames that are empty seem to be treated differently, and one can't index by them: R junk = 1:3 R names(junk) = c(a, b, ) R junk a b 1 2 3 R junk[] NA NA R junk = matrix(1:4, 2, 2) R colnames(junk) = c(a, ) R junk[, ] Error: subscript out of bounds You can index them by number, e.g., junk[, 2] and you can use which() to find the numbers where the colname is empty. junk[, which(colnames(junk) == )] I may need to find workaround... Going back to the original issue with predict, I don't think you need a workaround. I think you need give your matrix some colnames. Best, Ista -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Czerminski, Ryszard Sent: Wednesday, January 25, 2012 10:39 AM To: r-help@r-project.org Subject: [R] Error in predict.randomForest ... subscript out of bounds with NULL name in X RF trains fine with X, but fails on prediction library(randomForest) chirps - c(20,16.0,19.8,18.4,17.1,15.5,14.7,17.1,15.4,16.2,15,17.2,16,17,14.1) temp - c(88.6,71.6,93.3,84.3,80.6,75.2,69.7,82,69.4,83.3,78.6,82.6,80 .6,83.5,76 .3) X - cbind(1,chirps) rf - randomForest(X, temp) yp - predict(rf, X) Error in predict.randomForest(rf, X) : subscript out of bounds BTW: Just find out that apparently predict() does not like NULL name in X, because this works fine: one - rep(1, length(chirps)) X - cbind(one,chirps) rf - randomForest(X, temp) yp - predict(rf, X) Ryszard Czerminski AstraZeneca Pharmaceuticals LP 35 Gatehouse Drive Waltham, MA 02451 USA 781-839-4304 ryszard.czermin...@astrazeneca.com -- Confidentiality Notice: This message is private and may ...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bivariate Partial Dependence Plots in Random Forests
The reason that it's not implemented is because of computational cost. Some users had done it on their own using the same idea. It's just that it takes too much memory for even moderately sized data. It can be done much more efficiently in MART because computational shortcuts were used. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Lucie Bland Sent: Friday, January 27, 2012 5:01 AM To: r-help@r-project.org Subject: [R] Bivariate Partial Dependence Plots in Random Forests Hello, I was wondering if anyone knew of an R function/R code to plot bivariate (3 dimensional) partial dependence plots in random forests (randomForest package). It is apparently possible using the rgl package (http://esapubs.org/archive/ecol/E088/173/appendix-C.htm) or there may be a more direct function such as the pairplot() in MART (multiple additive regression trees)? Many thanks, Lucie My Computer: HP Z400 Workstation, 16.0 GB, Windows 7 Professional, Intel(R) Xeon(R) CPU, W365 3.20 GHz 3.19 GHz 64bit My R version: R version 2.14.1 (2011-12-22) 64 bit The Zoological Society of London is incorporated by Royal Charter Principal Office England. Company Number RC000749 Registered address: Regent's Park, London, England NW1 4RY Registered Charity in England and Wales no. 208728 __ ___ This e-mail has been sent in confidence to the named=2...{{dropped:21}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)
I'm not exactly sure if this is a problem with indexing by name; i.e., is the following behavior by design? The problem is that names or dimnames that are empty seem to be treated differently, and one can't index by them: R junk = 1:3 R names(junk) = c(a, b, ) R junk a b 1 2 3 R junk[] NA NA R junk = matrix(1:4, 2, 2) R colnames(junk) = c(a, ) R junk[, ] Error: subscript out of bounds I may need to find workaround... -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Czerminski, Ryszard Sent: Wednesday, January 25, 2012 10:39 AM To: r-help@r-project.org Subject: [R] Error in predict.randomForest ... subscript out of bounds with NULL name in X RF trains fine with X, but fails on prediction library(randomForest) chirps - c(20,16.0,19.8,18.4,17.1,15.5,14.7,17.1,15.4,16.2,15,17.2,16,17,14.1) temp - c(88.6,71.6,93.3,84.3,80.6,75.2,69.7,82,69.4,83.3,78.6,82.6,80 .6,83.5,76 .3) X - cbind(1,chirps) rf - randomForest(X, temp) yp - predict(rf, X) Error in predict.randomForest(rf, X) : subscript out of bounds BTW: Just find out that apparently predict() does not like NULL name in X, because this works fine: one - rep(1, length(chirps)) X - cbind(one,chirps) rf - randomForest(X, temp) yp - predict(rf, X) Ryszard Czerminski AstraZeneca Pharmaceuticals LP 35 Gatehouse Drive Waltham, MA 02451 USA 781-839-4304 ryszard.czermin...@astrazeneca.com -- Confidentiality Notice: This message is private and may ...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Variable selection based on both training and testing data
Variable section is part of the training process-- it chooses the model. By definition, test data is used only for testing (evaluating chosen model). If you find a package or function that does variable selection on test data, run from it! Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jin Minming Sent: Monday, January 30, 2012 8:14 AM To: r-help@r-project.org Subject: [R] Variable selection based on both training and testing data Dear all, The variable selection in regression is usually determined by the training data using AIC or F value, such as stepAIC. Is there some R package that can consider both the training and test dataset? For example, I have two separate training data and test data. Firstly, a regression model is obtained by using training data, and then this model is tested by using test data. This process continues in order to find some possible optimal models in terms of RMSE or R2 for both training and test data. Thanks, Jim __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] What is the function for smoothing splines with the smoothing parameter selected by generalized maximum likelihood?
See the gss package on CRAN. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of ali_protocol Sent: Monday, January 09, 2012 7:13 AM To: r-help@r-project.org Subject: [R] What is the function for smoothing splines with the smoothing parameter selected by generalized maximum likelihood? Dear all, I am new to R, and I am a biotechnologist, I want to fit a smoothing spline with smoothing parameter selected by generalized maximum likelihood. I was wondering what function implement this, and, if possible how I can find the fitted results for a certain point (or predict from the fitted spline if this is the correct language) -- View this message in context: http://r.789695.n4.nabble.com/What-is-the-function-for-smoothi ng-splines-with-the-smoothing-parameter-selected-by-generalized- maxi-tp4278275p4278275.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables
Tree based models (such as RF) are invriant to monotonic transformations in the predictor (x) variables, because they only use the ranks of the variables, not their actual values. More specifically, they look for splits that are at the mid-points of unique values. Thus the resulting trees are basically identical regardless of how you transform the x variables. Of course, the only, probably minor, differences is, e.g., mid-points can be different between the original and transformed data. While this doesn't impact the training data, it can impact the prediction on test data (although difference should be slight). Transformation of the response variable is quite another thing. RF needs it just as much as others if the situation calls for it. Cheers, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of gianni lavaredo Sent: Monday, December 05, 2011 1:41 PM To: r-help@r-project.org Subject: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables Dear Researches, sorry for the easy and common question. I am trying to justify the idea of RandomForest don't require a transformations (e.g. logarithmic) of variables, comparing this non parametrics method with e.g. the linear regressions. In leteruature to study my phenomena i need to apply a logarithmic trasformation to describe my model, but i found RF don't required this approach. Some people could suggest me text or bibliography to study? thanks in advance Gianni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables
You should see no differences beyond what you'd get by running RF a second time with a different random number seed. Best, Andy From: gianni lavaredo [mailto:gianni.lavar...@gmail.com] Sent: Monday, December 05, 2011 2:19 PM To: Liaw, Andy Cc: r-help@r-project.org Subject: Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables about the because they only use the ranks of the variables. Using a leave-one-out, in each interaction the the predictor variable ranks change slightly every time RF builds the model, especially for the variables with low importance. Is It correct to justify this because there are random splitting? Thanks in advance Gianni On Mon, Dec 5, 2011 at 7:59 PM, Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com wrote: Tree based models (such as RF) are invriant to monotonic transformations in the predictor (x) variables, because they only use the ranks of the variables, not their actual values. More specifically, they look for splits that are at the mid-points of unique values. Thus the resulting trees are basically identical regardless of how you transform the x variables. Of course, the only, probably minor, differences is, e.g., mid-points can be different between the original and transformed data. While this doesn't impact the training data, it can impact the prediction on test data (although difference should be slight). Transformation of the response variable is quite another thing. RF needs it just as much as others if the situation calls for it. Cheers, Andy -Original Message- From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On Behalf Of gianni lavaredo Sent: Monday, December 05, 2011 1:41 PM To: r-help@r-project.orgmailto:r-help@r-project.org Subject: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables Dear Researches, sorry for the easy and common question. I am trying to justify the idea of RandomForest don't require a transformations (e.g. logarithmic) of variables, comparing this non parametrics method with e.g. the linear regressions. In leteruature to study my phenomena i need to apply a logarithmic trasformation to describe my model, but i found RF don't required this approach. Some people could suggest me text or bibliography to study? thanks in advance Gianni [[alternative HTML version deleted]] __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:26}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forests in R
The first version of the package was created by re-writing the main program in the original Fortran as C, and calls other Fortran subroutines that were mostly untouched, so dynamic memory allocation can be done. Later versions have most of the Fortran code translated/re-written in C. Currently the only Fortran part is the node splitting in classification trees. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Peter Langfelder Sent: Thursday, December 01, 2011 12:33 AM To: Axel Urbiz Cc: R-help@r-project.org Subject: Re: [R] Random Forests in R On Wed, Nov 30, 2011 at 7:48 PM, Axel Urbiz axel.ur...@gmail.com wrote: I understand the original implementation of Random Forest was done in Fortran code. In the source files of the R implementation there is a note C wrapper for random forests: get input from R and drive the Fortran routines.. I'm far from an expert on this...does that mean that the implementation in R is through calls to C functions only (not Fortran)? So, would knowing C be enough to understand this code, or Fortran is also necessary? I haven't seen the C and Fortran code for Random Forest but I understand the note to say that R code calls some C functions that pre-process (possibly re-format etc) the data, then call the actual Random Forest method that's written in Fortran, then possibly post-process the output and return it to R. It would imply that to understand the actual Random Forest code, you will have to read the Fortran source code. Best, Peter __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about randomForest
Not only that, but in the same help page, same Value section, it says: predicted the predicted values of the input data based on out-of-bag samples so people really should read the help pages instead of speculate... If the error rates were not based on OOB samples, they would drop to (near) 0 rather quickly, as each tree is intentially overfitting its training set. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Weidong Gu Sent: Sunday, November 27, 2011 10:56 AM To: Matthew Francis Cc: r-help@r-project.org Subject: Re: [R] Question about randomForest Matthew, Your intepretation of calculating error rates based on the training data is incorrect. In Andy Liaw's help file err.rate-- (classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th. My understanding is that the error rate is calculated by throwing the OOB cases(after a few trees, all cases in the original data would serve as OOB for some trees) to all the trees up to the i-th which they are OOB and get the majority vote. The plot of a rf object indicates that OOB error declines quickly after the ensemble becomes sizable and increase variation in trees works! ( If they are based on the training sets, you wouldn't see such a drop since each tree is overfitting to the training set) Weidong On Sun, Nov 27, 2011 at 3:21 AM, Matthew Francis mattjamesfran...@gmail.com wrote: Thanks for the help. Let me explain in more detail how I think that randomForest works so that you (or others) can more easily see the error of my ways. The function first takes a random sample of the data, of the size specified by the sampsize argument. With this it fully grows a tree resulting in a horribly over-fitted classifier for the random sub-set. It then repeats this again with a different sample to generate the next tree and so on. Now, my understanding is that after each tree is constructed, a test prediction for the *whole* training data set is made by combining the results of all trees (so e.g. for classification the majority votes of all individual tree predictions). From this an error rate is determined (applicable to the ensemble applied to the training data) and reported in the err.rate member of the returned randomForest object. If you look at the error rate (or plot it using the default plot method) you see that it starts out very high when only 1 or a few over-fitted trees are contributing, but once the forest gets larger the error rate drops since the ensemble is doing its job. It doesn't make sense to me that this error rate is for a sub-set of the data, since the sub-set in question changes at each step (i.e. at each tree construction)? By doing cross-validation test making 'training' and 'test' sets from the data I have, I do find that I get error rates on the test sets comparable to the error rate that is obtained from the prediction member of the returned randomForest object. So that does seem to be the 'correct' error. By my understanding the error reported for the ith tree is that obtained using all trees up to and including the ith tree to make an ensemble prediction. Therefore the final error reported should be the same as that obtained using the predict.randomForest function on the training set, because by my understanding that should return an identical result to that used to generate the error rate for the final tree constructed?? Sorry that is a bit long winded, but I hope someone can point out where I'm going wrong and set me straight. Thanks! On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu anopheles...@gmail.com wrote: Hi Matthew, The error rate reported by randomForest is the prediction error based on out-of-bag OOB data. Therefore, it is different from prediction error on the original data since each tree was built using bootstrap samples (about 70% of the original data), and the error rate of OOB is likely higher than the prediction error of the original data as you observed. Weidong On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis mattjamesfran...@gmail.com wrote: I've been using the R package randomForest but there is an aspect I cannot work out the meaning of. After calling the randomForest function, the returned object contains an element called prediction, which is the prediction obtained using all the trees (at least that's my understanding). I've checked that this prediction set has the error rate as reported by err.rate. However, if I send the training data back into the the predict.randomForest function I find I get a different result to the stored set of predictions. This is true for both classification and regression. I find the predictions obtained this
Re: [R] tuning random forest. An unexpected result
Gianni, You should not tune ntree in cross-validation or other validation methods, and especially should not be using OOB MSE to do so. 1. At ntree=1, you are using only about 36% of the data to assess the performance of a single random tree. This number can vary wildly. I'd say don't bother looking at OOB measure of anything with ntree 30. If you want an exercise in probability, compute the number of trees you need to have the desired probability that all n data points are out-of-bag at least k times, and don't look at ntree k. 2. If you just plot the randomForest object using the generic plot() function, you will see that it gives you the vector of MSEs for ntree=1 to the max. That's why you need not use other methods such as cross-validation. 3. As mentioned in the article you cited, RF is insentive to ntree, and they settled on ntree=250. Also as we mentioned in the R News article, too many trees does not degrade prediction performance, only computational cost (which is trivial even for moderate size of data set). 4. It is not wise to optimize parameters of a model like that. When all of the MSE estimates are within a few percent of each other, you're likely just chasing noise in the evaluation process. Just my $0.02... Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of gianni lavaredo Sent: Thursday, November 17, 2011 6:29 PM To: r-help@r-project.org Subject: [R] tuning random forest. An unexpected result Dear Researches, I am using RF (in regression way) for analize several metrics extract from image. I am tuning RF setting a loop using different range of mtry, tree and nodesize using the lower value of MSE-OOB mtry from 1 to 5 nodesize from1 to 10 tree from 1 to 500 using this paper as refery Palmer, D. S., O'Boyle, N. M., Glen, R. C., Mitchell, J. B. O. (2007). Random Forest Models To Predict Aqueous Solubility. Journal of Chemical Information and Modeling, 47, 150-158. my problem is the following using data(airquality) : the tunning parameters with the lower value is: print(result.mtry.df[result.mtry.df$RMSE == min(result.mtry.df$RMSE),]) *RMSE = 15.44751 MSE = 238.6257 mtry = 3 nodesize = 5 tree = 35* the numer of tree is very low, different respect how i can read in several pubblications And the second value lower is a tunning parameters with *tree = 1* print(head(result.mtry.df[ with(result.mtry.df, order(MSE)), ])) RMSE MSE mtry nodesize tree 12035 15.44751 238.625735 35 *18001 15.44861 238.6595471 *7018 16.02354 256.753925 18 20031 16.02536 256.812151 31 11037 16.02862 256.916533 37 11612 16.05162 257.654434 112 i am wondering if i wrong in the setting or there are some aspects i don't conseder. thanks for attention and thanks in advance for suggestions and help Gianni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] gsDesign
Hi Dongli, Questions about usage of specific contributed packages are best directed toward the package maintainer/author first, as they are likely the best sources of information, and they don't necessarily subscribe to or keep up with the daily deluge of R-help messages. (In this particular case, I'm quite sure the package maintainer for gsDesign doesn't keep up with R-help.) Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Dongli Zhou Sent: Monday, November 14, 2011 6:13 PM To: Marc Schwartz Cc: r-help@r-project.org Subject: Re: [R] gsDesign Hi, Marc, Thank you very much for the reply. I'm using the gsDesign function to create an object of type gsDesign. But the inputs do not include the 'ratio' argument. Dongli On Nov 14, 2011, at 5:50 PM, Marc Schwartz marc_schwa...@me.com wrote: On Nov 14, 2011, at 4:11 PM, Dongli Zhou wrote: I'm trying to use gsDesign for a noninferiority trial with binary endpoint. Did anyone know how to specify the trial with different sample sizes for two treatment groups? Thanks in advance! Hi, Presuming that you are using the nBinomial() function, see the 'ratio' argument, which defines the desired sample size ratio between the two groups. See ?nBinomial and the examples there, which does include one using the 'ratio' argument. HTH, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest - NaN in %IncMSE
You are not giving anyone much to go on. Please read the posting guide and see how to ask your question in a way that's easier for others to answer. At the _very_ least, show what commands you used, what your data looks like, etc. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Katharine Miller Sent: Tuesday, September 20, 2011 1:43 PM To: r-help@r-project.org Subject: [R] randomForest - NaN in %IncMSE Hi I am having a problem using varImpPlot in randomForest. I get the error message Error in plot.window(xlim = xlim, ylim = ylim, log = ) : need finite 'xlim' values When print $importance, several variables have NaN under %IncMSE. There are no NaNs in the original data. Can someone help me figure out what is happening here? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] class weights with Random Forest
The current classwt option in the randomForest package has been there since the beginning, and is different from how the official Fortran code (version 4 and later) implements class weights. It simply account for the class weights in the Gini index calculation when splitting nodes, exactly as how a single CART tree is done when given class weights. Prof. Breiman came up with the newer class weighting scheme implemented in the newer version of his Fortran code after we found that simply using the weights in the Gini index didn't seem to help much in extremely unbalanced data (say 1:100 or worse). If using weighted Gini helps in your situation, by all means do it. I can only say that in the past it didn't give us the result we were expecting. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of James Long Sent: Tuesday, September 13, 2011 2:10 AM To: r-help@r-project.org Subject: [R] class weights with Random Forest Hi All, I am looking for a reference that explains how the randomForest function in the randomForest package uses the classwt parameter. Here: http://tolstoy.newcastle.edu.au/R/e4/help/08/05/12088.html Andy Liaw suggests not using classwt. And according to: http://r.789695.n4.nabble.com/R-help-with-RandomForest-classwt -option-td817149.html it has not been implemented as of 2007. However it improved classification performance for a problem I am working on, more than adjusting the sampsize parameter. So I'm wondering if it has been implemented recently (since 2007) or if there is a detailed explanation of what this unimplemented version is doing. Thanks! James [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest memory footprint
It looks like you are building a regression model. With such a large number of rows, you should try to limit the size of the trees by setting nodesize to something larger than the default (5). The issue, I suspect, is the fact that the size of the largest possible tree has about 2*nodesize nodes, and each node takes a row in a matrix to store. Multiply that by the number of trees you are trying to build, and you see how the memory can be gobbled up quickly. Boosted trees don't usually run into this problem because one usually boost very small trees (usually no more than 10 terminal nodes per tree). Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman Sent: Wednesday, September 07, 2011 2:46 PM To: r-help@r-project.org Subject: [R] randomForest memory footprint Hello, I am attempting to train a random forest model using the randomForest package on 500,000 rows and 8 columns (7 predictors, 1 response). The data set is the first block of data from the UCI Machine Learning Repo dataset Record Linkage Comparison Patterns with the slight modification that I dropped two columns with lots of NA's and I used knn imputation to fill in other gaps. When I load in my dataset, R uses no more than 100 megs of RAM. I'm running a 64-bit R with ~4 gigs of RAM available. When I execute the randomForest() function, however I get memory complaints. Example: summary(mydata1.clean[,3:10]) cmp_fname_c1 cmp_lname_c1 cmp_sex cmp_bd cmp_bm cmp_by cmp_plz is_match Min. :0. Min. :0. Min. :0. Min. :0. Min. :0. Min. :0. Min. :0.0 FALSE:572820 1st Qu.:0.2857 1st Qu.:0.1000 1st Qu.:1. 1st Qu.:0. 1st Qu.:0. 1st Qu.:0. 1st Qu.:0.0 TRUE : 2093 Median :1. Median :0.1818 Median :1. Median :0. Median :0. Median :0. Median :0.0 Mean :0.7127 Mean :0.3156 Mean :0.9551 Mean :0.2247 Mean :0.4886 Mean :0.2226 Mean :0.00549 3rd Qu.:1. 3rd Qu.:0.4286 3rd Qu.:1. 3rd Qu.:0. 3rd Qu.:1. 3rd Qu.:0. 3rd Qu.:0.0 Max. :1. Max. :1. Max. :1. Max. :1. Max. :1. Max. :1. Max. :1.0 mydata1.rf.model2 - randomForest(x = mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100) Error: cannot allocate vector of size 877.2 Mb In addition: Warning messages: 1: In dim(data) - dim : Reached total allocation of 3992Mb: see help(memory.size) 2: In dim(data) - dim : Reached total allocation of 3992Mb: see help(memory.size) 3: In dim(data) - dim : Reached total allocation of 3992Mb: see help(memory.size) 4: In dim(data) - dim : Reached total allocation of 3992Mb: see help(memory.size) Other techniques such as boosted trees handle the data size just fine. Are there any parameters I can adjust such that I can use a value of 100 or more for ntree? Thanks, John __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest partial dependence plot variable names
See if the following is close to what you're looking for. If not, please give more detail on what you want to do. data(airquality) airquality - na.omit(airquality) set.seed(131) ozone.rf - randomForest(Ozone ~ ., airquality, importance=TRUE) imp - importance(ozone.rf) # get the importance measures impvar - rownames(imp)[order(imp[, 1], decreasing=TRUE)] # get the sorted names op - par(mfrow=c(2, 3)) for (i in seq_along(impvar)) { partialPlot(ozone.rf, airquality, impvar[i], xlab=impvar[i], main=paste(Partial Dependence on, impvar[i]), ylim=c(30, 70)) } par(op) Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Katharine Miller Sent: Thursday, August 04, 2011 4:38 PM To: r-help@r-project.org Subject: [R] randomForest partial dependence plot variable names Hello, I am running randomForest models on a number of species. I would like to be able to automate the printing of dependence plots for the most important variables in each model, but I am unable to figure out how to enter the variable names into my code. I had originally thought to extract them from the $importance matrix after sorting by metric (e.g. %IncMSE), but the importance matrix is n by 2 - containing only the data for each metric (%IncMSE and IncNodePurity). It is clearly linked to the variable names, but I am unsure how to extract those names for use in scripting. Any assistance would be greatly appreciated as I am currently typing the variable names into each partialPlot call for every model I run.and that is taking a LONG time. Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] convert a splus randomforest object to R
You really need to follow the suggestions in the posting guide to get the best help from this list. Which versions of randomForest are you using in S-PLUS and R? Which version of R are you using? When you restore the object into R, what does str(object) say? Have you also tried dump()/source() as the R Data Import/Export manual suggests? Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Zhiming Ni Sent: Tuesday, August 02, 2011 8:11 PM To: r-help@r-project.org Subject: [R] convert a splus randomforest object to R Hi, I have a randomforest object cost.rf that was created in splus 8.0, now I need to use this trained RF model in R. So in Splus, I dump the RF file as below data.dump(cost.rf, file=cost.rf.txt, oldStyle=T) then in R, restore the dumped file, library(foreign) data.restore(cost.rf.txt) it works fine and able to restore the cost.rf object. But when I try to pass a new data through this randomforest object using predict() function, it gives me error message. in R: library(randomForest) set.seed(2211) pred - predict(cost.rf, InputData[ , ]) Error in object$forest$cutoff : $ operator is invalid for atomic vectors Looks like after restoring the dump file, the object is not compatible in R. Have anyone successfully converted a splus randomforest object to R? what will be the appropriate method to do this? Thanks in advance. Jimmy == This communication contains information that is confidential, and solely for the use of the intended recipient. It may contain information that is privileged and exempt from disclosure under applicable law. If you are not the intended recipient of this communication, please be advised that any disclosure, copying, distribution or use of this communication is strictly prohibited. Please also immediately notify SCAN Health Plan at 1-800-247-5091, x5263 and return the communication to the originating address. Thank You. == [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] squared pie chart - is there such a thing?
Has anyone suggested mosaic displays? That's the closest I can think of as a square pie chart... -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Naomi Robbins Sent: Sunday, July 24, 2011 7:09 AM To: Thomas Levine Cc: r-help@r-project.org Subject: Re: [R] squared pie chart - is there such a thing? I don't usually use stacked bar charts since it is difficult to compare lengths that don't have a common baseline. Naomi On 7/23/2011 11:14 PM, Thomas Levine wrote: How about just a stacked bar plot? barplot(matrix(c(3,5,3),3,1),horiz=T,beside=F) Tom On Fri, Jul 22, 2011 at 7:14 AM, Naomi Robbinsnbrgra...@optonline.net wrote: Hello! It's a shoot in the dark, but I'll try. If one has a total of 100 (e.g., %), and three components of the total, e.g., mytotal=data.frame(x=50,y=30,z=20), - one could build a pie chart with 3 sectors representing x, y, and z according to their proportions in the total. I am wondering if it's possible to build something very similar, but not on a circle but in a square - such that the total area of the square is the sum of the components and the components (x, y, and z) are represented on a square as shapes with right angles (squares, rectangles, L-shapes, etc.). I realize there are many possible positions and shapes - even for 3 components. But I don't really care where components are located within the square - as long as they are there. Is there a package that could do something like that? Thanks a lot! - I included waffle charts in Creating More Effective Graphs. The reaction was very negative; many readers let me know that they didn't like them. To create them I just drew a table in Word with 10 rows and 10 columns. Then I shaded the backgrounds of cells so for your example we would shade 50 cells one color, 30 another, and 20 a third color. Naomi - Naomi B. Robbins 11 Christine Court Wayne, NJ 07470 973-694-6009 na...@nbr-graphs.commailto:na...@nbr-graphs.com http://www.nbr-graphs.com Author of Creating More Effective Graphs http://www.nbr-graphs.com/bookframe.html // [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- -- Naomi B. Robbins NBR 11 Christine Court Wayne, NJ 07470 Phone: (973) 694-6009 na...@nbr-graphs.com mailto:na...@nbr-graphs.com http://www.nbr-graphs.com http://www.nbr-graphs.com/ Follow me at http://www.twitter.com/nbrgraphs Author of /Creating More Effective Graphs http://www.nbr-graphs.com/bookframe.html/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] *not* using attach() *but* in one case ....
From: Prof Brian Ripley Hmm, load() does have an 'envir' argument. So you could simply use that and with() (which is pretty much what attach() does internally). If people really wanted a lazy approach, with() could be extended to allow file names (as attach does). I'm not sure if laziness like this should be encouraged. If I may bring up another black hole: IMHO the formula interface allows too much flexibility (perhaps to allow some laziness?) that beginners and even non-beginners fall into its various traps a bit too often, and sometimes not even aware of it. It would be great if there's a way to (optionally?) limit the scope of where a formula looks for variables. Just my $0.02... Andy On Thu, 19 May 2011, Martin Maechler wrote: [modified 'Subject' on purpose; Good mail readers will still thread correctly, using the 'References' and 'In-Reply-To' headers, however, unfortunately, in my limited experience, good mail readers seem to disappear more and more .. ] Peter Ehlers ehl...@ucalgary.ca on Tue, 17 May 2011 06:08:30 -0700 writes: On 2011-05-17 02:22, Timothy Bates wrote: Dear Bryony: the suggestion was not to change the name of the data object, but to explicitly tell glm.nb what dataset it should look in to find the variables you mention in the formula. so the salient difference is: m1- glm.nb(Cells ~ Cryogel*Day, data = side) instead of attach(side) m1- glm.nb(Cells ~ Cryogel*Day) This works for other functions also, but not uniformly as yet (how I wish it did and I could say hist(x, data=side) Instead of hist(side$x) this inconsistency encourages the need for attach() Only if the user hasn't yet been introduced to the with() function, which is linked to on the ?attach page. Note also this sentence from the ?attach page: attach can lead to confusion. I can't remember the last time I needed attach(). Peter Ehlers Well, then you don't know *THE ONE* case where modern users of R should use attach() ... as I have been teaching for a while, but seem not have got enought students listening ;-) ... --- Use it instead of load() {for save()d R objects} --- The advantage of attach() over load() there is that loaded objects (and there maye be a bunch!), are put into a separate place in the search path and will not accidentally overwrite objects in the global workspace. Of course, there are still quite a few situations {e.g. in typical BATCH use of R for simulations, or Sweaving, etc} where load() is good enough, and the extras of using attach() are not worth it. But the unconditional do not use attach() is not quite ok, at least not when you talk to non-beginners. Martin Maechler, ETH Zurich -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Rotation Forest in R
I don't have access to that article, but just reading the abstract, it should be quite easy to do by writing a wrapper function that calls randomForest(). I've done so with random projections before. One limitation to methods like these is that they only apply to all numeric data. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Mario Beolco Sent: Thursday, April 07, 2011 7:55 PM To: r-help@r-project.org Subject: [R] Rotation Forest in R Dear R users, I was wondering whether you could tell me if there are any R functions or packages that can implement Rotation Forest (not Random Forests) algorithm: http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2006.211 thanks in advance, Mario __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Difference in mixture normals and one density
Is something like this what you're looking for? R library(nor1mix) R nmix2 - norMix(c(2, 3), sig2=c(25, 4), w=c(.2, .8)) R dnorMix(1, nmix2) - dnorm(1, 2, 5) [1] 0.03422146 Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jim Silverton Sent: Monday, April 04, 2011 10:01 AM To: r-help@r-project.org Subject: Re: [R] Difference in mixture normals and one density Hello, I am trying to find out if R can do the following: I have a mixture of normals say f = 0.2*Normal(2, 5) + 0.8*Normal(3,2) How do I find the difference in the densities at any particular point of f and at Normal(2,5)? -- Thanks, Jim. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ok to use glht() when interaction is NOT significant?
Just to add my ever depreciating $0.02 USD: Keep in mind that the significance testing paradigm puts a constraint on false positive rate, and let false negative rate float. What you should consider is whether that makes sense in your situation. All too often this is not carefully considered, and sometimes people will do not-very-kosher things to compensate for the conservativism of the significance testing. If you want to stay with the formality of the protected tests, you should first check the overall F-test of the entire model and make sure that's significant before you look at the individual terms in the model. It's not sufficient for A1 and A2 to be significantly different at B2 and not at B1 to say that there's significant interaction, but that the difference between A1 and A2 at B1 has to be significantly different that that at B2. That's the definition of the interaction in the 2x2 case. If you have a priori interest in the comparison of A1 vs. A2 at B2, then you can test it as a pre-planned contrast and not worry too much about protection or multiplicity. HTH, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of array chip Sent: Tuesday, March 08, 2011 1:31 AM To: Bert Gunter Cc: r-h...@stat.math.ethz.ch Subject: Re: [R] ok to use glht() when interaction is NOT significant? Hi Bert, thank you for your thoughtful and humorous comments, :-) It is scientifically meaningful to do those comparisons, and the results of these comparisons actually make sense to our hypothesis, i.e. one is significant at B2 level while the other is not at B1 level. Just unfortunately, the overall F test for interaction is not significant. I understand formally one should not do these post-hoc comparisons under non-significant interaction term. But should I really stop comparing under this situation, especially when these comparisons conform to our hypothesis? I am encouraged to see that you said For exploratory purposes, such post hoc comparisons might lead to great science. However, my concern is these results may not pass reviewers when sent out for publication. BTW, I am non-US reader, so I did google never inhaled. :-) John From: Bert Gunter gunter.ber...@gene.com Cc: r-h...@stat.math.ethz.ch Sent: Mon, March 7, 2011 9:20:11 PM Subject: Re: [R] ok to use glht() when interaction is NOT significant? Inline below Hi, let's say I have a simple ANOVA model with 2 factors A (level A1 and A2) and B (level B1 and B2) and their interaction: aov(y~A*B, data=dat) It turns out that the interaction term is not significant (e.g. P value = 0.2), but if I used glht() to compare A1 vs. A2 within each level of B, I found that the comparison is not significant when B=B1, but is very significant (P0.01) when B=B2. My question is whether it's legal to do this post-hoc comparison when the interaction is NOT significant? Can I still make the claim that there is a significant difference between A1 and A2 when B=B2? (I am serious here). Don't know what legal means. Why do you want to make the claim? When does it **ever** mean anything scientifically meaningful to make it? What is the **scientific** question of interest? Are the data unbalanced? Have you plotted the data to tell you what's going on? Warning: I come from the school (maybe I'm the only student...) that believes all such formal post hoc comparisons are pointless, silly, wastes of effort. Note the word: formal -- that is pretending the P values mean anything, For exploratory purposes, which can certainly include producing P values as well as graphs, such post hoc comparisons might lead to great science. It's the formal part that I reject and that you seem to be hung up on. Note also: If you're a Bayesian and can put priors on everything, you can spit out posteriors and Bayes factors to your heart's content. Really! -- no need to sweat multiplicity even. Of course, I speak here only as an observer, having never actually inhaled myself.* Cheers, Bert *Apologies to all non-US and younger readers. This is a smart-aleck reference to an infamous dumb remark from a recent famous, smart former U.S. president. Google never inhaled for details. Thanks John [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics 467-7374 http://devo.gene.com/groups/devo/depts/ncb/home.shtml [[alternative HTML version deleted]] __
Re: [R] Coefficient of Determination for nonlinear function
As far as I can tell, Uwe is not even fitting a model, but instead just solving a nonlinear equation, so I don't know why he wants a R^2. I don't see a statistical model here, so I don't know why one would want a statistical measure. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter Sent: Friday, March 04, 2011 11:21 AM To: uwe.wolf...@uni-ulm.de; r-help@r-project.org Subject: Re: [R] Coefficient of Determination for nonlinear function The coefficient of determination, R^2, is a measure of how well your model fits versus a NULL model, which is that the data are constant. In nonlinear models, as opposed to linear models, such a null model rarely makes sense. Therefore the coefficient of determination is generally not meaningful in nonlinear modeling. Yet another way in which linear and nonlinear models fundamentally differ. -- Bert On Fri, Mar 4, 2011 at 5:40 AM, Uwe Wolfram uwe.wolf...@uni-ulm.de wrote: Dear Subscribers, I did fit an equation of the form 1 = f(x1,x2,x3) using a minimization scheme. Now I want to compute the coefficient of determination. Normally I would compute it as r_square = 1- sserr/sstot with sserr = sum_i (y_i - f_i) and sstot = sum_i (y_i - mean(y)) sserr is clear to me but how can I compute sstot when there is no such thing than differing y_i. These are all one. Thus mean(y)=1. Therefore, sstot is 0. Thank you very much for your efforts, Uwe -- Uwe Wolfram Dipl.-Ing. (Ph.D Student) __ Institute of Orthopaedic Research and Biomechanics Director and Chair: Prof. Dr. Anita Ignatius Center of Musculoskeletal Research Ulm University Hospital Ulm Helmholtzstr. 14 89081 Ulm, Germany Phone: +49 731 500-55301 Fax: +49 731 500-55302 http://www.biomechanics.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics 467-7374 http://devo.gene.com/groups/devo/depts/ncb/home.shtml __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] lm - log(variable) - skip log(0)
You need to use == instead of = for testing equality. While you're at it, you should check for positive values, not just screening out 0s. This works for me: R mydata = data.frame(x=0:10, y=runif(11)) R fm = lm(y ~ log(x), mydata, subset=x0) Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of agent dunham Sent: Friday, February 25, 2011 6:24 AM To: r-help@r-project.org Subject: [R] lm - log(variable) - skip log(0) I want to do a lm regression, some of the variables are going to be affected with log, I would like not no take into account the values which imply doing log(0) for just one variable I have done the following but it doesn't work: lmod1.lm - lm(log(dat$inaltu)~log(dat$indiam),subset=(!(dat$indiam %in% c(0,1))) and obtain: Error en lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases lmod1.lm - lm(log(dat$inaltu)~log(dat$indiam),subset=(!(dat$indiam = 0)), na.action=na.exclude) and obtain Error en lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf en llamada a una función externa (arg 1) Thanks, u...@host.com -- View this message in context: http://r.789695.n4.nabble.com/lm-log-variable-skip-log-0-tp332 4263p3324263.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Cross Validation
Exactly as Max said. See the rfcv() function in the latest version of randomForest, as well as the reference in the help page for that function. OOB estimate is as accurate as CV estimate _if_ you run straight RF. Most other methods do not have this feature. However, if you start adding steps such as feature selections, all bets are off. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of mxkuhn Sent: Tuesday, February 22, 2011 7:17 PM To: ronzhao Cc: r-help@r-project.org Subject: Re: [R] Random Forest Cross Validation If you want to get honest estimates of accuracy, you should repeat the feature selection within the resampling (not the test set). You will get different lists each time, but that's the point. Right now you are not capturing that uncertainty which is why the oob and test set results differ so much. The list you get int the original training set is still the real list. The resampling results help you understand how much you might be overfitting the *variables*. Max On Feb 22, 2011, at 4:39 PM, ronzhao yzhaoh...@gmail.com wrote: Thanks, Max. Yes, I did some feature selections in the training set. Basically, I selected the top 1000 SNPs based on OOB error and grow the forest using training set, then using the test set to validate the forest grown. But if I do the same thing in test set, the top SNPs would be different than those in training set. That may be difficult to interpret. -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-t p3314777p3320094.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tri-cube and gaussian weights in loess
Locfit() in the locfit package has a slightly more modern implementation of loess, and is much more flexible in that it has a lot of options to tweak. One such option is the kernel. There are seven to choose from. Andy From: wisdomtooth From what I understand, loess in R uses the standard tri-cube function. SAS/INSIGHT offers loess with Gaussian weights. Is there a function in R that does the same? Also, can anyone offer any references comparing properties between tri-cube and Gaussian weights in LOESS? Thanks. - André -- View this message in context: http://r.789695.n4.nabble.com/tri-cube-and-gaussian-weights-in -loess-tp3263934p3263934.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to measure/rank variable importance when using rpart?
Check out caret::varImp.rpart(). It's described in the original CART book. Andy From: Tal Galili Hello all, When building a CART model (specifically classification tree) using rpart, it is sometimes interesting to know what is the importance of the various variables introduced to the model. Thus, my question is: *What common measures exists for ranking/measuring variable importance of participating variables in a CART model? And how can this be computed using R (for example, when using the rpart package)* For example, here is some dummy code, created so you might show your solutions on it. This example is structured so that it is clear that variable x1 and x2 are important while (in some sense) x1 is more important then x2 (since x1 should apply to more cases, thus make more influence on the structure of the data, then x2). set.seed(31431) n - 400 x1 - rnorm(n) x2 - rnorm(n) x3 - rnorm(n) x4 - rnorm(n) x5 - rnorm(n) X - data.frame(x1,x2,x3,x4,x5) y - sample(letters[1:4], n, T) y - ifelse(X[,2] -1 , b, y) y - ifelse(X[,1] 0 , a, y) require(rpart) fit - rpart(y~., X) plot(fit); text(fit) info.gain.rpart(fit) # your function - telling us on each variable how important it is (references are always welcomed) Thanks! Tal Contact Details:--- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest: too many elements specified?
I grep for n, n) in all the R code of the package (current version), and the only place that happens is in creating proximity. Can you do a traceback() and see where it happens? You should seriously consider upgrading R and the packages... Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Czerminski, Ryszard Sent: Thursday, January 20, 2011 1:08 PM To: r-h...@stat.math.ethz.ch Subject: [R] randomForest: too many elements specified? I getting Error in matrix(0, n, n) : too many elements specified while building randomForest model, which looks like memory allocation error. Software versions are: randomForest 4.5-25, R version 2.7.1 Dataset is big (~90K rows, ~200 columns), but this is on a big machine ( ~120G RAM) and I call randomForest like this: randomForest(x,y) i.e. in supervised mode and not requesting proximity matrix, therefore answer from Andy Liaw to an email reporting the same problems in 2005 (see below) is probably not directly applicable, still it looks like it is too big data set for this dataset/machine combination. How does memory usage in randomForest scale with dataset size? Is there a way to build global rf model with dataset of this size? Best regards, Ryszard Ryszard Czerminski AstraZeneca Pharmaceuticals LP 35 Gatehouse Drive Waltham, MA 02451 USA 781-839-4304 ryszard.czermin...@astrazeneca.com RE: [R] randomForest: too many element specified? Liaw, Andy Mon, 17 Jan 2005 05:56:28 -0800 From: luk When I run randonForest with a 169453x5 matrix, I got the following message. Error in matrix(0, n, n) : matrix: too many elements specified Can you please advise me how to solve this problem? Thanks, Lu 1. When asking new questions, please don't reply to other posts. 2. When asking questions like these, please do show the commands you used. My guess is that you asked for the proximity matrix, or is running unsupervised randomForest (by not providing a response vector). This will requires a couple of n by n matrices to be created (on top of other things), n being 169453 in this case. To store a 169453 x 169453 matrix in double precision, you need 169453^2 * 8 bytes, or or nearly 214 GB of memory. Even if you have that kind of hardware, I doubt you'll be able to make much sense out of the result. Andy -- Confidentiality Notice: This message is private and may ...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Where is a package NEWS.Rd located?
I was communicating with Kevin off-list. The problem seems to be run time, not install time. News() calls tools:::.build_news_db(), and the 2nd line of that function is: nfile - file.path(dir, inst, NEWS.Rd) and that's the problem: an installed package shouldn't have an inst/ subdirectory, right? Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Duncan Murdoch Sent: Thursday, January 06, 2011 2:30 PM To: Kevin Wright Cc: R list Subject: Re: [R] Where is a package NEWS.Rd located? On 06/01/2011 2:19 PM, Kevin Wright wrote: Yes, exactly. But the problem is with NEWS.Rd, not NEWS. I'm not sure who you are arguing with, but if you do file a bug report, please also put together a simple reproducible example, e.g. a small package containing NEWS.Rd in the inst directory (which is where the docs say it should go) and code that shows why this is bad. Don't just talk about internal functions used for building packages; as far as we can tell so far tools:::.build_news_db is doing exactly what it should be doing. Duncan Murdoch pkg/inst/NEWS.Rd is moved to pkg/NEWS.Rd at build time, but for installed packages, news tried to load pkg/inst/NEWS.Rd. I'm going to file a bug report. Kevin On Thu, Jan 6, 2011 at 7:29 AM, Kevin Wrightkw.s...@gmail.com wrote: If you look at tools:::.build_news_db, the plain text NEWS file is searched for in pkg/NEWS and pkg/inst/NEWS, but NEWS.Rd in only searched for in pkg/inst/NEWS.Rd. Looks like a bug to me. I *think*. Thanks, Kevin On Thu, Jan 6, 2011 at 7:09 AM, Kevin Wrightkw.s...@gmail.com wrote: Hopefully a quick question. My package has a NEWS.Rd file that is not being found by news. The news function calls tools:::.build_news_db which has this line: nfile- file.path(dir, inst, NEWS.Rd) So it appears that the news function is searching for mypackage/inst/NEWS.Rd. However, Writing R extensions says The contents of the inst subdirectory will be copied recursively to the installation directory During the installation, mypackage/inst/NEWS.Rd is copied into the mypackage directory, not mypackage/inst. What am I doing wrong, or is this a bug? Kevin Wright -- Kevin Wright -- Kevin Wright __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
Note that that isn't exactly what I recommended. If you look at the example in the help page for combine(), you'll see that it is combining RF objects trained on the same data; i.e., instead of having one RF with 500 trees, you can combine five RFs trained on the same data with 100 trees each into one 500-tree RF. The way you are using combine() is basically using sample size to limit tree size, which you can do by playing with the nodesize argument in randomForest() as I suggested previously. Either way is fine as long as you don't see prediction performance degrading. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of apresley Sent: Tuesday, January 04, 2011 6:30 PM To: r-help@r-project.org Subject: Re: [R] randomForest speed improvements Andy, Thanks for the reply. I had no idea I could combine them back ... that actually will work pretty well. We can have several worker threads load up the RF's on different machines and/or cores, and then re-assemble them. RMPI might be an option down the road, but would be a bit of overhead for us now. Using the method of combine() ... I was able to drastically reduce the amount of time to build randomForest objects. IE, using about 25,000 rows (6 columns), it takes maybe 5 minutes on my laptop. Using 5 randomForest objects (each with 5k rows), and then combining them, takes 1 minute. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements- tp3172523p3174621.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
From: Liaw, Andy Note that that isn't exactly what I recommended. If you look at the example in the help page for combine(), you'll see that it is combining RF objects trained on the same data; i.e., instead of having one RF with 500 trees, you can combine five RFs trained on the same data with 100 trees each into one 500-tree RF. The way you are using combine() is basically using sample size to limit tree size, which you can do by playing with the nodesize argument in randomForest() as I suggested previously. Either way is fine as long as you don't see prediction performance degrading. I should also mention that another way you can do something similar is by making use of the sampsize argument in randomForest(). For example, if you call randomForest() with sampsize=500, it will randomly draw 500 data points to grow each tree. This way you don't even need to run the RFs separately and combine them. Andy Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of apresley Sent: Tuesday, January 04, 2011 6:30 PM To: r-help@r-project.org Subject: Re: [R] randomForest speed improvements Andy, Thanks for the reply. I had no idea I could combine them back ... that actually will work pretty well. We can have several worker threads load up the RF's on different machines and/or cores, and then re-assemble them. RMPI might be an option down the road, but would be a bit of overhead for us now. Using the method of combine() ... I was able to drastically reduce the amount of time to build randomForest objects. IE, using about 25,000 rows (6 columns), it takes maybe 5 minutes on my laptop. Using 5 randomForest objects (each with 5k rows), and then combining them, takes 1 minute. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements- tp3172523p3174621.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
If you have multiple cores, one poor man's solution is to run separate forests in different R sessions, save the RF objects, load them into the same session and combine() them. You can do this less clumsily if you use things like Rmpi or other distributed computing packages. Another consideration is to increase nodesize (which reduces the sizes of trees). The problem with numeric predictors for tree-based algorithms is that the number of computations to find the best splitting point increases by that much _at each node_. Some algorithms try to save on this by using only certain quantiles. The current RF code doesn't do this. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of apresley Sent: Monday, January 03, 2011 6:28 PM To: r-help@r-project.org Subject: Re: [R] randomForest speed improvements I haven't tried changing the mtry or ntree at all ... though I suppose with only 6 variables, and tens-of-thousands of rows, we can probably do less than 500 tree's (the default?). Although tossing the forest does speed things up a bit, seems to be about 15 - 20% faster in some cases, I need to keep the forest to do the prediction, otherwise, it complains that there is no forest component in the object. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements- tp3172523p3172834.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest: help with combine() function
combine() is meant to be used on randomForest objects that were built from identical training data. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Dennis Duro Sent: Friday, December 10, 2010 11:59 PM To: r-help@r-project.org Subject: [R] randomForest: help with combine() function I've built two RF objects (RF1 and RF2) and have tried to combine them, but I get the following error: Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) : non-conformable arrays In addition: Warning message: In rf$oob.times + rflist[[i]]$oob.times : longer object length is not a multiple of shorter object length Both RF models use the same variables, although the NAs in both models likely differ (using na.roughfix in both models). I assume this is part of the reason that my arrays are non-conformable. If so, does anyone have any suggestions on how to combine in such a situation? How similar do RFs have to be in order to combine? Cheers __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest: How to append ID column along with predictions
The order in the output correspond to the order of the input. I will patch the code so that it grabs the row names of the input (if exist). If you specify type=prob, it already labels the rows by the input row names. -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Dennis Duro Sent: Tuesday, December 07, 2010 11:46 AM To: r-help@r-project.org Subject: [R] randomForest: How to append ID column along with predictions Hi all, When running a prediction using RF on another data, I get two columns returned: row number(?) and predicted class. Is there a way of appending the unique row value from an ID column in the dataframe to the predictions instead of the row number? I'm assuming that the returned results follow the data frame in that the first result returned equals the first entry in the dataframe. i.e., instead of a prediction output like this: 1, ants 2, ants 3, bees 4, ants I'd like the first column to pull IDs from the dataframe associated with each row (row number in parenthesis for illustration): (1) 1130, ants (2) 1130, ants (3) 2139, bees (4) 1130, ants This is likely a simple procedure, but I haven't been able to get anything to work. Any help would be appreciated! Cheers, Dennis __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest parameters for image classification
1. Memory issue: You may want to try to increase nodesize (e.g., to 5, 11, or even 21) and see if that degrades performance. If not, you should be able to grow more trees with the larger nodesize. Another option is to use the sampsize argument to have randomForest() do the random subsampling for you (on a per tree basis, rather than one random subset for the entire forest). 2. predict() giving NA: Have no idea why you are calling predict() that way. The first argument of all predict() methods that I know about (not just for randomForest) needs to be a model object, then followed by the data you want to predict, not the other way around. Andy -Original Message- From: Deschamps, Benjamin [mailto:benjamin.descha...@agr.gc.ca] Sent: Tuesday, November 16, 2010 11:16 AM To: r-help@r-project.org Cc: Liaw, Andy Subject: RE: [R] randomForest parameters for image classification I have modified my code since asking my original question. The classifier is now generated correctly (with a good, low error rate, as expected). However, I am running into two issues: 1) I am getting an error at the prediction stage, I get only NA's when I try to run data down the forest; 2) I run out of memory when generating the forest with more than 200 trees due to the large block of memory already occupied by the training data Here is my code: library(raster) library(randomForest) # Set some user variables fn = image.pix outraster = output.pix training_band = 2 validation_band = 1 # Get the training data myraster = stack(fn) training_class = subset(myraster, training_band) training_class[training_class == 0] = NA training_class = Which(training_class != 0, cells=TRUE) training_data = extract(myraster, training_class) training_response = as.factor(as.vector(training_data[,training_band])) training_predictors = training_data[,3:nlayers(myraster)] remove(training_data) # Create and save the forest r_tree = randomForest(training_predictors, y=training_response, ntree = 200, keep.forest=TRUE) # Runs out of memory with ntree ~200 remove(training_predictors, training_response) # Classify the whole image predictor_data = subset(myraster, 3:nlayers(myraster)) layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)] predictions = predict(predictor_data, r_tree, filename=outraster, format=PCIDSK, overwrite=TRUE, progress=text, type=response) #All NA!? remove(predictor_data) See also a thread I started on http://stackoverflow.com/questions/4186507/rgdal-efficiently-r eading-lar ge-multiband-rasters about improving the efficiency of collecting the training data... Thanks, Benjamin -Original Message- From: Liaw, Andy [mailto:andy_l...@merck.com] Sent: November 11, 2010 7:02 AM To: Deschamps, Benjamin; r-help@r-project.org Subject: RE: [R] randomForest parameters for image classification Please show us the code you used to run randomForest, the output, as well as what you get with other algorithms (on the same random subset for comparison). I have yet to see a dataset where randomForest does _far_ worse than other methods. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Deschamps, Benjamin Sent: Tuesday, November 09, 2010 10:52 AM To: r-help@r-project.org Subject: [R] randomForest parameters for image classification I am implementing an image classification algorithm using the randomForest package. The training data consists of 31000+ training cases over 26 variables, plus one factor predictor variable (the training class). The main issue I am encountering is very low overall classification accuracy (a lot of confusion between classes). However, I know from other classifications (including a regular decision tree classifier) that the training and validation data is sound and capable of producing good accuracies). Currently, I am using the default parameters (500 trees, mtry not set (default), nodesize = 1, replace=TRUE). Does anyone have experience using this with large datasets? Currently I need to randomly sample my training data because giving it the full 31000+ cases returns an out of memory error; the same thing happens with large numbers of trees. From what I read in the documentation, perhaps I do not have enough trees to fully capture the training data? Any suggestions or ideas will be greatly appreciated. Benjamin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attach...{{dropped:26
Re: [R] randomForest parameters for image classification
Please show us the code you used to run randomForest, the output, as well as what you get with other algorithms (on the same random subset for comparison). I have yet to see a dataset where randomForest does _far_ worse than other methods. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Deschamps, Benjamin Sent: Tuesday, November 09, 2010 10:52 AM To: r-help@r-project.org Subject: [R] randomForest parameters for image classification I am implementing an image classification algorithm using the randomForest package. The training data consists of 31000+ training cases over 26 variables, plus one factor predictor variable (the training class). The main issue I am encountering is very low overall classification accuracy (a lot of confusion between classes). However, I know from other classifications (including a regular decision tree classifier) that the training and validation data is sound and capable of producing good accuracies). Currently, I am using the default parameters (500 trees, mtry not set (default), nodesize = 1, replace=TRUE). Does anyone have experience using this with large datasets? Currently I need to randomly sample my training data because giving it the full 31000+ cases returns an out of memory error; the same thing happens with large numbers of trees. From what I read in the documentation, perhaps I do not have enough trees to fully capture the training data? Any suggestions or ideas will be greatly appreciated. Benjamin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Contract programming position at Merck (NJ, USA)
Job: Scientific programmer at Merck, Biostatistics, Rahway, NJ, USA [Job Description] This position works closely with statisticians to process and analyze ultrasound, MRI, and radiotelemetry longitudinal studies using a series of programs developed in R and Mathworks/Matlab. This position provides support for the analysis of several pre-clinical and clinical functional MRI studies by preprocessing and processing data using the software FSL. Qualified candidates must have a proficiency and experience with statistical software and technical computing packages including Matlab, R, SAS, and S-Plus as well as familiarity with medical image concepts (e.g., functional MRI) and an understanding of analysis tools for fMRI (FSL, SPM). This is contract position for an ongoing need in Biometrics Research. It is a term contract position (1 year) with the possibility to extend up to 2 years in length based on continued business need and available budget. If you are interested, please contact: amy_gilles...@merck.com Notice: This e-mail message, together with any attachme...{{dropped:14}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] to determine the variable importance in svm
The caret package has answers to all your questions. -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Neeti Sent: Tuesday, October 26, 2010 10:42 AM To: r-help@r-project.org Subject: [R] to determine the variable importance in svm hii everyone!! i have two questions: 1) How to obtain a variable (attribute) importance using e1071:SVM (or other svm methods)? 2) how to validate the results of svm? currently i am using the following code to determine the error. library(ipred) for(i in 1:20) error.model1[i]- errorest(Species~.,data=trainset,model=svm)$error summary(error.model1) ## not able to understand errorest result.. if anyone know the better method to analyse my result please let me know. ### library(mda) cmat - confusion(pred.1,species_test) -- View this message in context: http://r.789695.n4.nabble.com/to-determine-the-variable-import ance-in-svm-tp3013817p3013817.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest AUC
The OOB error estimates in RF is one really nifty feature that alleviate the need for additional cross-validation or resampling. I've done some empirical comparison between OOB estimates and 10-fold CV estimates, and they are basically the same. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Claudia Beleites Sent: Saturday, October 23, 2010 3:39 PM To: r-help@r-project.org Subject: Re: [R] Random Forest AUC Dear List, Just curiosity (disclaimer: I never used random forests till now for more than a little playing around): Is there no out-of-bag estimate available? I mean, there are already ca. 1/e trees where a (one) given sample is out-of-bag, as Andy explained. If now the voting is done only over the oob trees, I should get a classical oob performance measure. Or is the oob estimate internally used up by some kind of optimization (what would that be, given that the trees are grown till the end?)? Hoping that I do not spoil the pedagogic efforts of the list in teaching Ravishankar to do his homework reasoning himself... Claudia Am 23.10.2010 20:49, schrieb Changbin Du: I think you should use 10 fold cross validation to judge your performance on the validation parts. What you did will be overfitted for sure, you test on the same training set used for your model buliding. On Sat, Oct 23, 2010 at 6:39 AM, mxkuhnmxk...@gmail.com wrote: I think the issue is that you really can't use the training set to judge this (without resampling). For example, k nearest neighbors are not known to over fit, but a 1nn model will always perfectly predict the training data. Max On Oct 23, 2010, at 9:05 AM, Liaw, Andyandy_l...@merck.com wrote: What Breiman meant is that as the model gets more complex (i.e., as the number of trees tends to infinity) the geneeralization error (test set error) does not increase. This does not hold for boosting, for example; i.e., you can't boost forever, which nececitate the need to find the optimal number of iterations. You don't need that with RF. -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of vioravis Sent: Saturday, October 23, 2010 12:15 AM To: r-help@r-project.org Subject: Re: [R] Random Forest AUC Thanks Max and Andy. If the Random Forest is always giving an AUC of 1, isn't it over fitting??? If not, how do you differentiate this from over fitting??? I believe Random forests are claimed to never over fit (from the following link). http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.ht http://www.stat.berkeley.edu/%7Ebreiman/RandomForests/cc_home.ht m#features Ravishankar R -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3008157.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest AUC
What Breiman meant is that as the model gets more complex (i.e., as the number of trees tends to infinity) the geneeralization error (test set error) does not increase. This does not hold for boosting, for example; i.e., you can't boost forever, which nececitate the need to find the optimal number of iterations. You don't need that with RF. -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of vioravis Sent: Saturday, October 23, 2010 12:15 AM To: r-help@r-project.org Subject: Re: [R] Random Forest AUC Thanks Max and Andy. If the Random Forest is always giving an AUC of 1, isn't it over fitting??? If not, how do you differentiate this from over fitting??? I believe Random forests are claimed to never over fit (from the following link). http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.ht m#features Ravishankar R -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3008157.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest AUC
Let me expand on what Max showed. For the most part, performance on training set is meaningless. (That's the case for most algorithms, but especially so for RF.) In the default (and recommended) setting, the trees are grown to the maximum size, which means that quite likely there's only one data point in most terminal nodes, and the prediction at the terminal nodes are determined by the majority class in the node, or the lone data point. Suppose that is the case all the time; i.e., in all trees all terminal nodes have only one data point. A particular data point would be in-bag in about 64% of the trees in the forest, and every one of those trees has the correct prediction for that data point. Even if all the trees where that data points are out-of-bag gave the wrong prediction, by majority vote of all trees, you still get the right answer in the end. Thus basically the perfect prediction on train set for RF is by design. Generally, good training prediction is just self-fulfilling prophecy. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of vioravis Sent: Friday, October 22, 2010 1:20 AM To: r-help@r-project.org Subject: [R] Random Forest AUC Guys, I used Random Forest with a couple of data sets I had to predict for binary response. In all the cases, the AUC of the training set is coming to be 1. Is this always the case with random forests? Can someone please clarify this? I have given a simple example, first using logistic regression and then using random forests to explain the problem. AUC of the random forest is coming out to be 1. data(iris) iris - iris[(iris$Species != setosa),] iris$Species - factor(iris$Species) fit - glm(Species~.,iris,family=binomial) train.predict - predict(fit,newdata = iris,type=response) library(ROCR) plot(performance(prediction(train.predict,iris$Species),tpr, fpr),col = red) auc1 - performance(prediction(train.predict,iris$Species),auc)@y.va lues[[1]] legend(bottomright,legend=c(paste(Logistic Regression (AUC=,formatC(auc1,digits=4,format=f),),sep=)), col=c(red), lty=1) library(randomForest) fit - randomForest(Species ~ ., data=iris, ntree=50) train.predict - predict(fit,iris,type=prob)[,2] plot(performance(prediction(train.predict,iris$Species),tpr, fpr),col = red) auc1 - performance(prediction(train.predict,iris$Species),auc)@y.va lues[[1]] legend(bottomright,legend=c(paste(Random Forests (AUC=,formatC(auc1,digits=4,format=f),),sep=)), col=c(red), lty=1) Thank you. Regards, Ravishankar R -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3006649.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RandomForest Proximity Matrix
From: Michael Lindgren Greetings R Users! I am posting to inquire about the proximity matrix in the randomForest R-package. I am having difficulty pushing very large data through the algorithm and it appears to hang on the building of the prox matrix. I have read on Dr. Breiman's website that in the original code a choice can be made between using an N x N matrix OR to increase the ability to compute large datasets an N x T matrix can be created. The N refers to the number of samples and the T refers to the number of trees in the forest. It is a sentence in the FORTRAN documentation and nothing else is stated about it... My question is, does the randomForest module in R allow for this choice in proximity matrices generated by the algorithm? If so, can someone please point me in the direction of how to implement it? That would be great! The R package is based on version 3.3 of the Fortran code, with some new features grafted on. Unfortunately the sparse proximity matrix is one of the features that hasn't been added in the R version. The truth is that I find the way it's done in the Fortran code not terribly satisfying, but do not know any other better way of doing it. Andy Many thanks in advance and best wishes from Alaska! Michael -- Michael Lindgren GIS Technician / Programmer EWHALE Lab - Institute of Arctic Biology University of Alaska 419 IRVING I Fairbanks, AK 99775-7000 Email: malindg...@alaska.edu Phone: 907 474 7959 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Force evaluation of variable when calling partialPlot
The plot titles aren't pretty, but the following works for me: R library(randomForest) randomForest 4.5-37 Type rfNews() to see new features/changes/bug fixes. R set.seed(1004) R iris.rf - randomForest(iris[-5], iris[[5]], ntree=1001) R par(mfrow=c(2,2)) R for (i in 1:4) partialPlot(iris.rf, iris, names(iris)[i]) Andy From: Ben Bond-Lamberty Dear R Users, I'm using the randomForest package and would like to generate partial dependence plots, one after another, for a variety of variables: m - randomForest( s, ... ) varnames - c( var1, var2, var3, var4 ) # var1..4 are all in data frame s for( v in varnames ) { partialPlot( x=m, pred.data=s, x.var=v ) } ...but this doesn't work, with partialPlot complaining that it can't find the variable v. I think I need to force the evaluation of the loop variable v so that partialPlot sees the correct variable names, but am stumped after trying eval and similar functions. Any suggestions on how to do this? Googling has not turned up anything very useful. Thanks, Ben __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest - PartialPlot - reg
In a partial dependence plot, only the relative scale, not absolute scale, of the y-axis is meaningful. I.e., you can compare the range of the curves between partial dependence plots of two different variables, but not the actual numbers on the axis. The range is compressed compared to the original data because of averaging. For classification, the function is computed in the logit scale, so it's not necessarily positive. High does mean higher probability for the target class. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Vijayan Padmanabhan Sent: Wednesday, September 22, 2010 11:47 PM To: r-help Subject: [R] randomForest - PartialPlot - reg Dear R Group I am not sure if this is the right forum to raise this query, but i would rather give it a try and aim for reaching the right person who might be a part of this group who can help. I have a query on interpretation of PartialPlot in package randomForest. In my earlier queries in this regard, I probably did not give sufficient explanation to elicit the intended details in the explanations being provided.. Hence I am resending the query with examples and bit more details. In a scenario where a set of continuous variables vs a class response is being modeled by RF, say the iris example.. using the following code, how do I interpret the partial plot that is generated? library(randomForest) data(iris) set.seed(543) iris.rf - randomForest(Species~., iris) partialPlot(iris.rf, iris, Sepal.Length, setosa) How is the y-axis values to be understood? A straight forward Textual interpretation of the output from the experts in this area, would help me understand this concept of marginal effect being plotted for the variable Sepal.Length on the which.class=setosa. Thanks for your help. Regards Vijayan Padmanabhan What is expressed without proof can be denied without proof - Euclide. Can you avoid printing this? Think of the environment before printing the email. -- - Please visit us at www.itcportal.com ** This Communication is for the exclusive use of the intended recipient (s) and shall not attach any liability on the originator or ITC Ltd./its Subsidiaries/its Group Companies. If you are the addressee, the contents of this email are intended for your use only and it shall not be forwarded to any third party, without first obtaining written authorisation from the originator or ITC Ltd./its Subsidiaries/its Group Companies. It may contain information which is confidential and legally privileged and the same shall not be used or dealt with by any third party in any manner whatsoever without the specific consent of ITC Ltd./its Subsidiaries/its Group Companies. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest - partialPlot - Reg
From: Vijayan Padmanabhan Dear R Group I had an observation that in some cases, when I use the randomForest model to create partialPlot in R using the package randomForest the y-axis displays values that are more than -1! It is a classification problem that i was trying to address. Any insights as to how the y axis can display value more than -1 for some variables? Am i missing something! Yes, the Detail section of the help page for partialPlot, or specifically, what the function is plotting for a classification model. Andy Thanks Regards Vijayan Padmanabhan Can you avoid printing this? Think of the environment before printing the email. -- - Please visit us at www.itcportal.com ** This Communication is for the exclusive use of the intended recipient (s) and shall not attach any liability on the originator or ITC Ltd./its Subsidiaries/its Group Companies. If you are the addressee, the contents of this email are intended for your use only and it shall not be forwarded to any third party, without first obtaining written authorisation from the originator or ITC Ltd./its Subsidiaries/its Group Companies. It may contain information which is confidential and legally privileged and the same shall not be used or dealt with by any third party in any manner whatsoever without the specific consent of ITC Ltd./its Subsidiaries/its Group Companies. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Passing a function as a parameter...
One possibility: R f = function(x, f) eval(as.call(list(as.name(f), x))) R f(1:10, mean) [1] 5.5 R f(1:10, max) [1] 10 Andy From: Jonathan Greenberg R-helpers: If I want to pass a character name of a function TO a function, and then have that function executed, how would I do this? I want an arbitrary version of the following, where any function can be used (e.g. I don't want the if-then statement here): apply_some_function - function(data,function_name) { if(function_name==mean) { return(mean(data)) } if(function_name==min) { return(min(data)) } } apply_some_function(1:10,mean) apply_some_function(1:10,min) Basically, I want the character name of the function used to actually execute that function. Thanks! --j [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] OT: Is randomization for targeted cancer therapies ethical?
From: jlu...@ria.buffalo.edu Clearly inferior treatments are unethical. The Big Question is: What constitute clearly? Who or How to decide what is clearly? I'm sure there are plenty of people who don't understand much Statistics and are perfectly willing to say the results on the two cousins show the conventional treatment is clearly inferior. Sure, on these two cousins we can say so, but what about others? Donald Berry at MD Anderson in Houston TX and Jay Kadane at Carnegie Mellon have been working on more ethical designs within the Bayesian framework. In particular, response adaptive designs reduce the assignment to and continuation of patients on inferior treatments. I've heard LJ Wei talked about this kinds of designs (don't remember if they are Bayesian) more than a dozen years ago. Don't know how common they are in use. Andy Bert Gunter gunter.ber...@gene.com Sent by: r-help-boun...@r-project.org 09/20/2010 01:31 PM To r-help@r-project.org cc Subject [R] OT: Is randomization for targeted cancer therapies ethical? Hi Folks: **Off Topic** Those interested in clinical trials may find the following of interest: http://www.nytimes.com/2010/09/19/health/research/19trial.html It concerns the ethicality of randomizing those with life-threatening disease to relatively ineffective SOC when new biologically targeted therapies appear to be more effective. While the context may be new, the debate, itself, is not: Tukey wrote (or maybe it was talked -- I can't remember for sure) about this about 30 years ago. I'm sure many other also have done so. Cheers, Bert -- Bert Gunter Genentech Nonclinical Biostatistics __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Decision Tree in Python or C++?
For Python, check out the project orange: http://www.ailab.si/orange/doc/catalog/Classify/ClassificationTree.htm Not sure about C++, but OpenDT is in C: http://opendt.sourceforge.net/ Looks like OpenCV has both Python and C++ interface (didn't see Python interace to decision tree, though): http://opencv.willowgarage.com/documentation/cpp/decision_trees.html Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Wensui Liu Sent: Saturday, September 04, 2010 5:14 PM To: noclue_ Cc: r-help@r-project.org Subject: Re: [R] Decision Tree in Python or C++? for python, please check http://onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html On Sat, Sep 4, 2010 at 11:21 AM, noclue_ tim@netzero.net wrote: Have anybody used Decision Tree in Python or C++? (or written their own decision tree implementation in Python or C++)? My goal is to run decision tree on 8 million obs as training set and score 7 million in test set. I am testing 'rpart' package on a 64-bit-Linux + 64-bit-R environment. But it seems that rpart is either not stable or running out of memory very quickly. (Is it because R is passing everything as copy instead of as object reference?) Any idea would be greatly appreciated! Have a nice weekend! -- View this message in context: http://r.789695.n4.nabble.com/Decision-Tree-in-Python-or-C-tp2 526810p2526810.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- == WenSui Liu wens...@paypal.com statcompute.spaces.live.com == __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Open position at Merck (NJ, USA)
Job description: Computational statistician/biometrician The Biometrics Research Department at Merck Research Laboratories, Merck Co., Inc. in Rahway, NJ, is seeking a highly motivated statistician/data analyst to work in its basic research, drug discovery, preclinical and early clinical development areas. The applicant should have broad expertise in statistical computing. Experience and/or education relevant to signal processing, image processing, pattern recognition, machine learning, or bioinformatics are preferred. The position will involve providing statistical, mathematical, and software development support for one or more of the following areas: medical imaging, biological signal analysis including EEG , MS proteomics, and computational chemistry. We are looking for a Ph.D. with a background and/or experience in at least one of the following fields: Statistics, Electrical/Computer or Biomedical Engineering, Computer Science, Applied Mathematics, or Physics. Advanced computer programming skills (including, but not limited to R, S-PLUS, Matlab, C/C++) and excellent communication skills are essential. An ability to lead statistical analysis efforts within a multidisciplinary team is required. The position may also involve general statistical consulting and training. Our dedication to delivering quality medicines in innovative ways and our commitment to bringing out the best in our people are just some of the reasons why we're ranked among Fortune magazine's 100 Best Companies to Work for in America. We offer a competitive salary, an outstanding benefits package, and a professional work environment with a company known for scientific excellence. To apply, please forward your CV or resume and cover letter to vladimir_svetnik(at)merck.com ATTENTION: Open Position Vladimir Svetnik, Ph.D. Biometrics Research Dept. Merck Research Laboratories PO Box 2000, RY33-300 Rahway, NJ 07065-0900 USA Notice: This e-mail message, together with any attachme...{{dropped:14}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RandomForests Limitations? Work Arounds?
You're not giving us much to go on, so the info I can give is correspondingly vague. I take it you are using RF in unsupervised mode. What RF does in this case is simply generate a second part of the data that have the same marginal distribution as the data you have, but the variables are independent. It then runs classification treating your data as one class and the generated data as the other class. The output is the proximity matrix, which you can use as the similarity matrix for clustering. Given that, you know that RF has to basically use twice as much memory to store the data. That's one place where it can take lots of memory. The second place is the storage of the proximity matrix itself: If you have n rows in your data, the proximity matrix is n by n. For moderate n this is going to be the part that takes up lots of memory. Just in case you haven't seen/heard: avoid the formula interface (i.e., randomForest(~., data=mydata, ...) because that can really soak up memory. Yes, 64-bit OS and 64-bit R can help, but only if you have the RAM to take advantage of the platform. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Michael Lindgren Sent: Tuesday, September 07, 2010 4:28 PM To: r-help@r-project.org Subject: [R] RandomForests Limitations? Work Arounds? Greetings, I want to inquire about the memory limitations of the randomForest package. I am attempting to perform clustering analysis using RF but I keep getting the message that RF cannot allocate a vector of a given size. I am currently using the 32-bit version of R to run this analysis, are there fewer memory issues when using the 64-bit version of R? Mainly I want to be able to run RF on a very large dataset, but keep having to take very small sample sizes to do so. Any advice is more than appreciated. Best, Michael [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] predict.loess and NA/NaN values
From: Philipp Pagel In a current project, I am fitting loess models to subsets of data in order to use the loess predicitons for normalization (similar to what is done in many microarray analyses). While working on this I ran into a problem when I tried to predict from the loess models and the data contained NAs or NaNs. I tracked down the problem to the fact that predict.loess will not return a value at all when fed with such values. A toy example: x - rnorm(15) y - x + rnorm(15) model.lm - lm(y~x) model.loess - loess(y~x) predict(model.lm, data.frame(x=c(0.5, Inf, -Inf, NA, NaN))) predict(model.loess, data.frame(x=c(0.5, Inf, -Inf, NA, NaN))) The behaviour of predict.lm meets my expectation: I get a vector of length 5 where the unpredictable ones are NA or NaN. predict.loess on the other hand returns only 3 values quietly skipping the last two. I was unable to find anything in the manual page that explains this behaviour or says how to change it. So I'm asking the community: Is there a way to fix this or do I have to code around it? This is not much help, but I did a bit of digging by using debug(stats:::predict.loess) And then step through the function line-by-line. Apparently the Problem happens before the actual prediction is done. The code as.matrix(model.frame(delete.response(terms(object)), newdata)) already omitted the NA and NaN. The problem is that that's the default behavior of model.frame(). Consulting ?model.frame, I see that you can override this by setting the na.action attribute of the data frame passed to it. Thus I tried setting na.dat = data.frame(x=c(0.5, Inf, -Inf, NA, NaN)) attr(na.dat, na.action) = na.pass This does make the as.matrix(model.frame()) line retain the NA and NaN, but it bombs in the prediction at the subsequent step. I guess It really doesn't like NA as inputs. What you can do is patch the code to add the NAs back after the Prediction step (which many predict() methods do). Cheers, Andy This is in R 2.11.1 (Linux), by the way. Thanks in advance Philipp -- Dr. Philipp Pagel Lehrstuhl für Genomorientierte Bioinformatik Technische Universität München Wissenschaftszentrum Weihenstephan Maximus-von-Imhof-Forum 3 85354 Freising, Germany http://webclu.bio.wzw.tum.de/~pagel/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Learning ANOVA
From: Stephen Liu Hi JesperHybel, Thanks for your advice. If you're trying to follow the youtube video you have a typing mistake here: InsectSprays.aov -(test01$count ~ test01$spray) I think this should be: InsectSprays.aov -aov(test01$count ~ test01$spray) Your advice works for me. Sorry I missed aov before(test01$count ~ test01$spray) I just want to offer another point: If you see any tutorial/document/book advising you to use model formula as above; e.g., anything like df$var1 ~ df$var2 + df$var3 Just run away from it as fast as you can, and try to wipe it from your memory. That's about the worst way to use a model formula, and very likely to give you what may seem to be strange problems down the road. Well-written model fitting functions should be called like this: modelfn(var1 ~ var2 + var3, data=df, ...) Andy InsectSprays.aov - aov(test01$count ~ test01$spray) summary(InsectSprays.aov) Df Sum Sq Mean Sq F valuePr(F) test01$spray 5 2668.8 533.77 34.702 2.2e-16 *** Residuals66 1015.2 15.38 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 TukeyHSD(InsectSprays.aov) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = test01$count ~ test01$spray) $`test01$spray` difflwr upr p adj B-A 0.833 -3.866075 5.532742 0.9951810 C-A -12.417 -17.116075 -7.717258 0.000 D-A -9.583 -14.282742 -4.883925 0.014 E-A -11.000 -15.699409 -6.300591 0.000 F-A 2.167 -2.532742 6.866075 0.7542147 C-B -13.250 -17.949409 -8.550591 0.000 D-B -10.417 -15.116075 -5.717258 0.002 E-B -11.833 -16.532742 -7.133925 0.000 F-B 1.333 -3.366075 6.032742 0.9603075 D-C 2.833 -1.866075 7.532742 0.4920707 E-C 1.417 -3.282742 6.116075 0.9488669 F-C 14.583 9.883925 19.282742 0.000 E-D -1.417 -6.116075 3.282742 0.9488669 F-D 11.750 7.050591 16.449409 0.000 F-E 13.167 8.467258 17.866075 0.000 I made a comparison of my result with example(InsectSprays). They looks the same. I also compared plot(InsectSprays.aov). A further question how to plot 4 graphs simultaneously? Instead of reading them, individually. I read ?plot but unable to resolve. Also how to save InsectSprays.aov? I think I can only save it as InsectSprays.csv. I can't find write.aov command. Thanks TIA B.R. satimis - Original Message From: JesperHybel jesperhy...@hotmail.com To: r-help@r-project.org Sent: Sat, August 14, 2010 2:09:48 AM Subject: Re: [R] Learning ANOVA If you're trying to follow the youtube video you have a typing mistake here: InsectSprays.aov -(test01$count ~ test01$spray) I think this should be: InsectSprays.aov -aov(test01$count ~ test01$spray) youre missing the functioncall aov on the right hand side of the assignment operator '-'. the results of the application of function aov() is stored in InsectSprays.aov and is accessed through summary(InsectSprays.aov) since you missed the functioncall you cannot apply TukeyHSD() to InsectSprays.aov I think -- View this message in context: http://r.789695.n4.nabble.com/Learning-ANOVA-tp2323660p2324590.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Learning ANOVA
From: Stephen Liu Hi folks, R on Ubuntu 10.04 64 bit. Performed following steps on R:- ### to access to the object data(InsectSprays) ### create a .csv file write.csv(InsectSprays, InsectSpraysCopy.csv) On another terminal $ sudo updatedb $ locate InsectSpraysCopy.csv /home/userA/InsectSpraysCopy.csv ### Read in some data test01 - read.csv(file.choose(), header=TRUE) Enter file name: /home/userA/InsectSpraysCopy.csv ### Look at the data test01 X count spray 1 110 A [snipped] Note the names of the variables here. They don't match what you tried to use in your boxplot() call below. Where did you get the idea that there are DO and Stream in the test01 data frame? Andy ### Create a side-by-side boxplot of the data boxplot(test01$DO ~ test01$Stream) Error in model.frame.default(formula = test01$DO ~ test01$Stream) : invalid type (NULL) for variable 'test01$DO' I was stucked here. Pls help. TIA B.R. Stephen L - Original Message From: Stephen Liu sati...@yahoo.com To: r-help@r-project.org Sent: Fri, August 13, 2010 11:34:31 AM Subject: [R] Learning ANOVA Hi folks, File to be used is on; data(InsectSprays) I can't figure out where to insert it on following command; test01 - read.csv(fil.choose(), header=TRUE) Please help. TIA B.R. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error on random forest variable importance estimates
From: Pierre Dubath Hello, I am using the R randomForest package to classify variable stars. I have a training set of 1755 stars described by (too) many variables. Some of these variables are highly correlated. I believe that I understand how randomForest works and how the variable importance are evaluated (through variable permutations). Here are my questions. 1) variable importance error? Is there any ways to estimate the error on the MeanDecreaseAccuracy? In other words, I would like to know how significant are MeanDecreaseAccuracy differences (and display horizontal error bars in the VarImpPlot output). If you really want to do it, one possibility is to do permutation test: Permute your response, say, 1000 or 2000 times, run RF on each of these permuted response, and use the importance measures as samples from the null distribution. I have notice that even with relatively large number of trees, I have variation in the importance values from one run to the next. Could this serve as a measure of the errors/uncertainties? Yes. 2) how to deal with variable correlation? so far, I am iterating, selecting the most important variable first, removing all other variable that have a high correlation (say higher than 80%), taking the second most important variable left, removing variables with high-correlation with any of the first two variables, and so on... (also using some astronomical insight as to which variables are the most important!) Is there a better way to deal with correlation in randomForest? (I suppose that using many correlated variables should not be a problem for randomForest, but it is for my understanding of the data and for other algorithms). That depends a lot on what you're trying to do. RF can tolerate problematic data, but that doesn't mean it will magically give you good answers. Trying to draw conclusions about effects when there are highly correlated (and worse, important) variables is a tricky business. 3) How many variables should eventually be used? I have made successive runs, adding one variable at a time from the most to the least important (not-too-correlated) variables. I then plot the error rate (err.rate) as a function of the number of variable used. As this number increase, the error first sharply decrease, but relatively soon it reaches a plateau . I assume that the point of inflexion can be use to derive the minimum number of variable to be used. Is that a sensible approach? Is there any other suggestion? A measure of the error on err.rate would also here really help. Is there any idea how to estimate this? From the variation between runs or with the help of importanceSD somehow? One approach is described in the following paper (in the Proceedings of MCS 2004): http://www.springerlink.com/content/9n61mquugf9tungl/ Best, Andy Thanks very much in advance for any help. Pierre Dubath __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Collinearity in Moderated Multiple Regression
Seems to me it may be worth stating what may be elementary to some on this list: - If all relevant variables are included in the model and the true model is indeed linear, then all least squares estimated coefficients are unbiased. [ David Ruppert once said about the three kinds of lies: Lies, damn lies, and Y~N(Xb, s^2). ] - If some variables with non-zero true coefficients are omitted in the fitted model, the estimated coefficients of those variables in the model may be biased, with the exception when the omitted variables are orthogonal to those in the model (i.e., 0 correlations). - If x1 and x2 are correlated, you'd have a tough enough time separating their effects on y, let alone trying to assess their interaction effect on y. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter Sent: Tuesday, August 03, 2010 4:52 PM To: Michael Haenlein Cc: r-help@r-project.org Subject: Re: [R] Collinearity in Moderated Multiple Regression biased regression coefficients is nonsense. The coefficients are unbiased: their expectation (in the appropriate model) is the true value of the parameters (when estimated by, e.g. least squares). The problem is model selection. I suggest you consult a local statistician, as you seem confused about the basic concepts. Bert Gunter Genentech Nonclinical Biostatistics On Tue, Aug 3, 2010 at 1:42 PM, Michael Haenlein haenl...@escpeurope.eu wrote: Thanks for all your comments! @Dennis: Are there any thresholds that I can use to evaluate the Variance Inflation Factor? I think I learned at some point that VIF should be less than 10, but probably that is too conservative? You mentioned in your example that a VIF of 13 is not big enough to raise a red flag. So is the cut-off more around 15 or 20? @Bert: The purpose of my regression is inference, that is to know whether and to which extent x1, x2 and x1*x2 influence y. It's less about prediction than about understanding the relative impact of different variables. So, if I get your message correctly, correlation among the predictors is likely to be an issue in my case as it leads to biased regression coefficients (which is what I feared). Thanks, Michael -Original Message- From: Bert Gunter [mailto:gunter.ber...@gene.com] Sent: Tuesday, August 03, 2010 22:37 To: Dennis Murphy Cc: haenl...@gmail.com; r-help@r-project.org Subject: Re: [R] Collinearity in Moderated Multiple Regression Absolutely right. But I think it's also worth adding that when the predictors _are_ correlated, the estimates of their coefficients depend on which are included in the model. This means that one should generally not try to interpret the individual coefficients, e.g. as a way to assess their relative importance. Rather, they should just be viewed as the machinery that produces the prediction surface, and that is what one needs to consider to understand the model. In my experience, this elementary fact is not understood by many (most?) nonstatistical practicioners using multiple regression -- and this ignorance gets them into a world of trouble. -- Bert Bert Gunter Genentech Nonclinical Biostatistics On Tue, Aug 3, 2010 at 12:57 PM, Dennis Murphy djmu...@gmail.com wrote: Hi: On Tue, Aug 3, 2010 at 6:51 AM, haenl...@gmail.com wrote: I'm sorry -- I think I chose a bad example. Let me start over again: I want to estimate a moderated regression model of the following form: y = a*x1 + b*x2 + c*x1*x2 + e No intercept? What's your null model, then? Based on my understanding, including an interaction term (x1*x2) into the regression in addition to x1 and x2 leads to issues of multicollinearity, as x1*x2 is likely to covary to some degree with x1 (and x2). Is it possible you're confusing interaction with multicollinearity? You've stated that x1 and x2 are weakly correlated; the product term is going to be correlated with each of its constituent covariates, but unless that correlation is above 0.9 (some say 0.95) in magnitude, multicollinearity is not really a substantive issue. As others have suggested, if you're concerned about multicollinearity, then fit the interaction model and use the vif() function from package car or elsewhere to check for it. Multicollinearity has to do with ill-conditioning in the model matrix; interaction means that the response y is influenced by the product of x1 and x2 covariates as well as the individual covariates. They are not the same thing. Perhaps an example will help. Here's your x1 and x2 with a manufactured response: df - data.frame(x1 = rep(1:3, each = 3), x2 = rep(1:3, 3)) df$y - 0.5 + df$x1 + 1.2 * df$x2 + 2.5 * df$x1 * df$x2 + rnorm(9) # Response is generated to produce a significant interaction df x1 x2 y 1 1 1 5.968255 2 1 2 7.566212 3 1 3 13.420006 4 2 1 9.025791 5 2 2 16.382381
Re: [R] Problems with normality req. for ANOVA
As a matter of fact, I would say both Bert and I encounter designed experiments a lot more than observational studies, yet we speak from experience that those things that Bert mentioned happen on a daily basis. When you talk to experimenters, ask your questions carefully and you'll see these things crop up. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of David Winsemius Sent: Monday, August 02, 2010 3:35 PM To: Bert Gunter Cc: r-help@r-project.org; wwreith Subject: Re: [R] Problems with normality req. for ANOVA In a general situation of observational studies, your point is undoubtedly true, and apparently you believe it to be true even in the setting of designed experiments. Perhaps I should have confined myself to my first sentence. -- David. On Aug 2, 2010, at 2:05 PM, Bert Gunter wrote: David et. al: I take issue with this. It is the lack of independence that is the major issue. In particular, clustering, split-plotting, and so forth due to convenience order experimentation, lack of randomization, exogenous effects like the systematic effects due to measurement method/location have the major effect on inducing bias and distorting inference. Normality and unequal variances typically pale to insignificance compared to this. Obviously, IMHO. Note 1: George Box noted this at least 50 years ago in the early '60's when he and Jenkins developed arima modeling. Note 2: If you can, have a look at Jack Youden's classic paper Enduring Values, which comments to some extent on these issues, here: http://www.jstor.org/pss/1266913 Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics On Mon, Aug 2, 2010 at 10:32 AM, David Winsemius dwinsem...@comcast.net wrote: On Aug 2, 2010, at 9:33 AM, wwreith wrote: I am conducting an experiment with four independent variables each of which has three or more factor levels. The sample size is quite large i.e. several thousand. The dependent variable data does not pass a normality test but visually looks close to normal so is there a way to compute the affect this would have on the p-value for ANOVA or is there a way to perform an nonparametric test in R that will handle this many independent variables. Simply saying ANOVA is robust to small departures from normality is not going to be good enough for my client. The statistical assumption of normality for linear models do not apply to the distribution of the dependent variable, but rather to the residuals after a model is estimated. Furthermore, it is the homoskedasticity assumption that is more commonly violated and also greater threat to validity. (And if you don't already know both of these points, then you desperately need to review your basic modeling practices.) I need to compute an error amount for ANOVA or find a nonparametric equivalent. You might get a better answer if you expressed the first part of that question in unambiguous terminology. What is error amount? For the second part, there is an entire Task View on Robust Statistical Methods. -- David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Collinearity in Moderated Multiple Regression
If the collinearity you're seeing arose from the addition of a product (interaction) term, I do not think penalization is the best answer. What is the goal of your analysis? If it's prediction, then I wouldn't worry about this type of collinearity. If you're interested in inference, I'd try some transformation to reduce (but not necessarily eliminate) the effect of collinearity. Mean centering is the simplest, but not the only thing you can do. Just my $0.02... Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Michael Haenlein Sent: Tuesday, August 03, 2010 10:44 AM To: 'Nikhil Kaza' Cc: r-help@r-project.org Subject: Re: [R] Collinearity in Moderated Multiple Regression Thanks very much -- it seems that Ridge Regression can do what I'm looking for! Best, Michael -Original Message- From: Nikhil Kaza [mailto:nikhil.l...@gmail.com] Sent: Tuesday, August 03, 2010 16:21 To: haenl...@gmail.com Cc: r-help@r-project.org (r-help@R-project.org) Subject: Re: [R] Collinearity in Moderated Multiple Regression My usual strategy of dealing with multicollinearity is to drop the offending variable or transform one them. I would also check vif functions in car and Design. I think you are looking for lm.ridge in MASS package. Nikhil Kaza Asst. Professor, City and Regional Planning University of North Carolina nikhil.l...@gmail.com On Aug 3, 2010, at 9:51 AM, haenl...@gmail.com wrote: I'm sorry -- I think I chose a bad example. Let me start over again: I want to estimate a moderated regression model of the following form: y = a*x1 + b*x2 + c*x1*x2 + e Based on my understanding, including an interaction term (x1*x2) into the regression in addition to x1 and x2 leads to issues of multicollinearity, as x1*x2 is likely to covary to some degree with x1 (and x2). One recommendation I have seen in this context is to use mean centering, but apparently this does not solve the problem (see: Echambadi, Raj and James D. Hess (2007), Mean-centering does not alleviate collinearity problems in moderated multiple regression models, Marketing science, 26 (3), 438 - 45). So my question is: Which R function can I use to estimate this type of model. Sorry for the confusion caused due to my previous message, Michael On Aug 3, 2010 3:42pm, David Winsemius dwinsem...@comcast.net wrote: I think you are attributing to collinearity a problem that is due to your small sample size. You are predicting 9 points with 3 predictor terms, and incorrectly concluding that there is some inconsistency because you get an R^2 that is above some number you deem surprising. (I got values between 0.2 and 0.4 on several runs. Try: x1 x2 x3 y model summary(model) # Multiple R-squared: 0.04269 -- David. On Aug 3, 2010, at 9:10 AM, Michael Haenlein wrote: Dear all, I have one dependent variable y and two independent variables x1 and x2 which I would like to use to explain y. x1 and x2 are design factors in an experiment and are not correlated with each other. For example assume that: x1 x2 cor(x1,x2) The problem is that I do not only want to analyze the effect of x1 and x2 on y but also of their interaction x1*x2. Evidently this interaction term has a substantial correlation with both x1 and x2: x3 cor(x1,x3) cor(x2,x3) I therefore expect that a simple regression of y on x1, x2 and x1*x2 will lead to biased results due to multicollinearity. For example, even when y is completely random and unrelated to x1 and x2, I obtain a substantial R2 for a simple linear model which includes all three variables. This evidently does not make sense: y model summary(model) Is there some function within R or in some separate library that allows me to estimate such a regression without obtaining inconsistent results? Thanks for your help in advance, Michael Michael Haenlein Associate Professor of Marketing ESCP Europe Paris, France [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide
Re: [R] randomForest outlier return NA
There's a bug in the code. If you add row names to the X matrix befor you call randomForest(), you'd get: R summary (outlier(mdl.rf) ) Min. 1st Qu. MedianMean 3rd Qu.Max. -1.0580 -0.5957 0. 0.6406 1.2650 9.5200 I'll fix this in the next release. Thanks for reporting. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Pau Carrio Gaspar Sent: Wednesday, July 14, 2010 6:36 AM To: r-help@r-project.org Subject: [R] randomForest outlier return NA Dear R-users, I have a problem with randomForest{outlier}. After running the following code ( that produces a silly data set and builds a model with randomForest ): ### library(randomForest) set.seed(0) ## build data set X - rbind( matrix( runif(n=400,min=-1,max=1), ncol = 10 ) , rep(1,times= 10 ) ) Y - matrix( nrow = nrow(X), ncol = 1) for( i in (1:nrow(X))){ Y[i,1] - sign( sum ( X[i,])) } ## build model mdl.rf - randomForest( x = X, y = as.factor(Y) , proximity=TRUE , mtry = 10 , ntree = 500) summary (outlier(mdl.rf) ) ### I get the following output: Min. 1st Qu. MedianMean 3rd Qu.Max.NA's 41 Can anyone explain why the output of outlier only returns NA's ? Thanks Pau [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] anyone know why package RandomForest na.roughfix is so slow??
I'll incorporate some of these ideas into the next release. Thanks! Best, Andy -Original Message- From: h.wick...@gmail.com [mailto:h.wick...@gmail.com] On Behalf Of Hadley Wickham Sent: Thursday, July 01, 2010 8:08 PM To: Mike Williamson Cc: Liaw, Andy; r-help Subject: Re: [R] anyone know why package RandomForest na.roughfix is so slow?? Here's another version that's a bit easier to read: na.roughfix2 - function (object, ...) { res - lapply(object, roughfix) structure(res, class = data.frame, row.names = seq_len(nrow(object))) } roughfix - function(x) { missing - is.na(x) if (!any(missing)) return(x) if (is.numeric(x)) { x[missing] - median.default(x[!missing]) } else if (is.factor(x)) { freq - table(x) x[missing] - names(freq)[which.max(freq)] } else { stop(na.roughfix only works for numeric or factor) } x } I'm cheating a bit because as.data.frame is so slow. Hadley On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson this.is@gmail.com wrote: Jim, Andy, Thanks for your suggestions! I found some time today to futz around with it, and I found a home made script to fill in NA values to be much quicker. For those who are interested, instead of using: dataSet - na.roughfix(dataSet) I used: origCols - names(dataSet) ## Fix numeric values... dataSet - as.data.frame(lapply(dataSet, FUN=function(x) { if(!is.numeric(x)) { x } else { ifelse(is.na(x), median(x, na.rm=TRUE), x) } } ), row.names=row.names(dataSet) ) ## Fix factors... dataSet - as.data.frame(lapply(dataSet, FUN=function(x) { if(!is.factor(x)) { x } else { levels(x)[ifelse(!is.na (x),x,table(max(table(x))) ) ] } } ), row.names=row.names(dataSet) ) names(dataSet) - origCols In one case study that I ran, the na.roughfix() algo took 296 seconds whereas the homemade one above took 16 seconds. Regards, Mike Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here. -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy andy_l...@merck.com wrote: You need to isolate the problem further, or give more detail about your data. This is what I get: R nr - 2134 R nc - 14037 R x - matrix(runif(nr*nc), nr, nc) R n.na - round(nr*nc/10) R x[sample(nr*nc, n.na)] - NA R system.time(x.fixed - na.roughfix(x)) user system elapsed 8.44 0.39 8.85 R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB ram. Andy -- *From:* Mike Williamson [mailto:this.is@gmail.com] *Sent:* Thursday, July 01, 2010 12:48 PM *To:* Liaw, Andy *Cc:* r-help *Subject:* Re: [R] anyone know why package RandomForest na.roughfix is so slow?? Andy, You're right, I didn't supply any code, because my call was very simple and it was the call itself at question. However, here is the associated code I am using: naFixTime - system.time( { if (fltrResponse) { ## TRUE: there are no NA's in the response... cleared via earlier steps message(paste(iAm,: Missing values will now be imputed...\n, sep=)) try( dataSet - rfImpute(dataSet[,!is.element(names(dataSet), response)], dataSet[,response]) ) } else { ## In this case, there is no response column in the data set message(paste(iAm,: Missing values will now be filled in with median, values or most frequent levels, sep=)) try( dataSet - na.roughfix(dataSet) ) } } ) As you can see, the na.roughfix call is made as simply as possible: I supply the entire dataSet (only parameters, no responses). I am not doing the prediction here (that is done later, and the prediction itself is not taking very long). Here are some calculation times that I experienced: # rows # cols time to run na.roughfix === === 2046 2833 ~ 2 minutes 2066 5626 ~ 6 minutes 2134 14037 ~ 30 minutes These numbers are on a Windows server using the 64-bit version of 'R'. Regards
Re: [R] anyone know why package RandomForest na.roughfix is so slow??
You have not shown any code on exactly how you use na.roughfix(), so I can only guess. If you are doing something like: randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) I would not be surprised that it's taking very long on large datasets. Most likely it's caused by the formula interface, not na.roughfix() itself. If that is your case, try doing the imputation beforehand and run randomForest() afterward; e.g., myroughfixed - na.roughfix(mybigdata) randomForest(myroughfixed[list.of.predictor.columns], myroughfixed[[myresponse]],...) HTH, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Mike Williamson Sent: Wednesday, June 30, 2010 7:53 PM To: r-help Subject: [R] anyone know why package RandomForest na.roughfix is so slow?? Hi all, I am using the package random forest for random forest predictions. I like the package. However, I have fairly large data sets, and it can often take *hours* just to go through the na.roughfix call, which simply goes through and cleans up any NA values to either the median (numerical data) or the most frequent occurrence (factors). I am going to start doing some comparisons between na.roughfix() and some apply() functions which, it seems, are able to do the same job more quickly. But I hesitate to duplicate a function that is already in the package, since I presume the na.roughfix should be as quick as possible and it should also be well tailored to the requirements of random forest. Has anyone else seen that this is really slow? (I haven't noticed rfImpute to be nearly as slow, but I cannot say for sure: my predict data sets are MUCH larger than my model data sets, so cleaning the prediction data set simply takes much longer.) If so, any ideas how to speed this up? Thanks! Mike Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here. -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] anyone know why package RandomForest na.roughfix is so slow??
You need to isolate the problem further, or give more detail about your data. This is what I get: R nr - 2134 R nc - 14037 R x - matrix(runif(nr*nc), nr, nc) R n.na - round(nr*nc/10) R x[sample(nr*nc, n.na)] - NA R system.time(x.fixed - na.roughfix(x)) user system elapsed 8.440.398.85 R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB ram. Andy From: Mike Williamson [mailto:this.is@gmail.com] Sent: Thursday, July 01, 2010 12:48 PM To: Liaw, Andy Cc: r-help Subject: Re: [R] anyone know why package RandomForest na.roughfix is so slow?? Andy, You're right, I didn't supply any code, because my call was very simple and it was the call itself at question. However, here is the associated code I am using: naFixTime - system.time( { if (fltrResponse) { ## TRUE: there are no NA's in the response... cleared via earlier steps message(paste(iAm,: Missing values will now be imputed...\n, sep=)) try( dataSet - rfImpute(dataSet[,!is.element(names(dataSet), response)], dataSet[,response]) ) } else { ## In this case, there is no response column in the data set message(paste(iAm,: Missing values will now be filled in with median, values or most frequent levels, sep=)) try( dataSet - na.roughfix(dataSet) ) } } ) As you can see, the na.roughfix call is made as simply as possible: I supply the entire dataSet (only parameters, no responses). I am not doing the prediction here (that is done later, and the prediction itself is not taking very long). Here are some calculation times that I experienced: # rows # cols time to run na.roughfix === === 2046 2833 ~ 2 minutes 2066 5626 ~ 6 minutes 2134 14037 ~ 30 minutes These numbers are on a Windows server using the 64-bit version of 'R'. Regards, Mike Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here. -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy andy_l...@merck.com wrote: You have not shown any code on exactly how you use na.roughfix(), so I can only guess. If you are doing something like: randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) I would not be surprised that it's taking very long on large datasets. Most likely it's caused by the formula interface, not na.roughfix() itself. If that is your case, try doing the imputation beforehand and run randomForest() afterward; e.g., myroughfixed - na.roughfix(mybigdata) randomForest(myroughfixed[list.of.predictor.columns], myroughfixed[[myresponse]],...) HTH, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Mike Williamson Sent: Wednesday, June 30, 2010 7:53 PM To: r-help Subject: [R] anyone know why package RandomForest na.roughfix is so slow?? Hi all, I am using the package random forest for random forest predictions. I like the package. However, I have fairly large data sets, and it can often take *hours* just to go through the na.roughfix call, which simply goes through and cleans up any NA values to either the median (numerical data) or the most frequent occurrence (factors). I am going to start doing some comparisons between na.roughfix() and some apply() functions which, it seems, are able to do the same job more quickly. But I hesitate to duplicate a function that is already in the package, since I presume the na.roughfix should be as quick as possible and it should also be well tailored to the requirements of random forest. Has anyone else seen that this is really slow? (I haven't noticed rfImpute to be nearly as slow, but I cannot say for sure: my predict data sets are MUCH larger than my model data sets, so cleaning the prediction data set simply takes much longer.) If so, any ideas how to speed this up? Thanks
Re: [R] Linear Discriminant Analysis in R
cobler_squad needs more basic help than doing lda. The data input just doesn't make sense. If vowel_feature is a data frame, than G - vowel_feature[15] creates another data frame containing the 15th variable in vowel_feature, so G is the name of a data frame, not a variable in a data frame. The lda() call makes even less sense. I wonder if he had tried to go through the examples in the help file and try to understand how it is used? Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Joris Meys Sent: Friday, May 28, 2010 8:50 AM To: cobbler_squad Cc: r-help@r-project.org Subject: Re: [R] Linear Discriminant Analysis in R Could you provide us with data to test the code? use dput (and limit the size!) eg: dput(vowel_features) dput(mask_features) Without this information, it's impossible to say what's going wrong. It looks like you're doing something wrong in the selection. What should vowel_features[15] return? Did you check it's actually what you want? Did you use str(G) to check the type? Cheers Joris On Thu, May 27, 2010 at 5:28 PM, cobbler_squad la.f...@gmail.com wrote: Joris, You are a life saver. Based on two sample files above, I think lda should go something like this: vowel_features - read.table(file = mappings_for_vowels.txt) mask_features - data.frame(as.matrix(read.table(file = 3dmaskdump_ICA_37_Combined.txt))) G - vowel_features[15] cvc_lda - lda(G~ vowel_features[15], data=mask_features, na.action=na.omit, CV=TRUE) ERROR: Error in model.frame.default(formula = G ~ vowel_features[15], data = mask_features, : invalid type (list) for variable 'G' I am clearly doing something wrong declaring G (how should I declare grouping in R when I need to use one column from vowel_feature file)? Sorry for stupid questions and thank you for being so helpful! - again, sample files that I am working with: mappings_for_vowels.txt: V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 1E 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 2o 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 3I 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 4^ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 5@ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 and the mask_features file is: V42 V43 V44 V45 V46 V47 V48 V49 [1,] 2.890891625 2.881188521 2.88778 -2.882606612 -2.77341 2.879834384 2.886483229 2.883815864 [2,] 2.763404707 2.756198683 2.761863881 -2.756827983 -2.762268531 2.754305072 2.760017050 2.758399799 [3,] 0.556614506 0.556377530 0.556247414 -0.556300910 -0.556098321 0.557495060 0.557383073 0.556867424 [4,] 0.367065248 0.366962036 0.366870087 -0.366794442 -0.366644148 0.366613343 0.366537320 0.366953464 [5,] 0.423692393 0.421835623 0.421741829 -0.421897460 -0.421659824 0.421567705 0.421465738 0.422407838 -- View this message in context: http://r.789695.n4.nabble.com/Linear-Discriminant-Analysis-in-R-tp2231 922p223.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.