Re: [R] randomForest speed improvements
Note that that isn't exactly what I recommended. If you look at the example in the help page for combine(), you'll see that it is combining RF objects trained on the same data; i.e., instead of having one RF with 500 trees, you can combine five RFs trained on the same data with 100 trees each into one 500-tree RF. The way you are using combine() is basically using sample size to limit tree size, which you can do by playing with the nodesize argument in randomForest() as I suggested previously. Either way is fine as long as you don't see prediction performance degrading. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of apresley Sent: Tuesday, January 04, 2011 6:30 PM To: r-help@r-project.org Subject: Re: [R] randomForest speed improvements Andy, Thanks for the reply. I had no idea I could combine them back ... that actually will work pretty well. We can have several worker threads load up the RF's on different machines and/or cores, and then re-assemble them. RMPI might be an option down the road, but would be a bit of overhead for us now. Using the method of combine() ... I was able to drastically reduce the amount of time to build randomForest objects. IE, using about 25,000 rows (6 columns), it takes maybe 5 minutes on my laptop. Using 5 randomForest objects (each with 5k rows), and then combining them, takes 1 minute. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements- tp3172523p3174621.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
From: Liaw, Andy Note that that isn't exactly what I recommended. If you look at the example in the help page for combine(), you'll see that it is combining RF objects trained on the same data; i.e., instead of having one RF with 500 trees, you can combine five RFs trained on the same data with 100 trees each into one 500-tree RF. The way you are using combine() is basically using sample size to limit tree size, which you can do by playing with the nodesize argument in randomForest() as I suggested previously. Either way is fine as long as you don't see prediction performance degrading. I should also mention that another way you can do something similar is by making use of the sampsize argument in randomForest(). For example, if you call randomForest() with sampsize=500, it will randomly draw 500 data points to grow each tree. This way you don't even need to run the RFs separately and combine them. Andy Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of apresley Sent: Tuesday, January 04, 2011 6:30 PM To: r-help@r-project.org Subject: Re: [R] randomForest speed improvements Andy, Thanks for the reply. I had no idea I could combine them back ... that actually will work pretty well. We can have several worker threads load up the RF's on different machines and/or cores, and then re-assemble them. RMPI might be an option down the road, but would be a bit of overhead for us now. Using the method of combine() ... I was able to drastically reduce the amount of time to build randomForest objects. IE, using about 25,000 rows (6 columns), it takes maybe 5 minutes on my laptop. Using 5 randomForest objects (each with 5k rows), and then combining them, takes 1 minute. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements- tp3172523p3174621.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
If you have multiple cores, one poor man's solution is to run separate forests in different R sessions, save the RF objects, load them into the same session and combine() them. You can do this less clumsily if you use things like Rmpi or other distributed computing packages. Another consideration is to increase nodesize (which reduces the sizes of trees). The problem with numeric predictors for tree-based algorithms is that the number of computations to find the best splitting point increases by that much _at each node_. Some algorithms try to save on this by using only certain quantiles. The current RF code doesn't do this. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of apresley Sent: Monday, January 03, 2011 6:28 PM To: r-help@r-project.org Subject: Re: [R] randomForest speed improvements I haven't tried changing the mtry or ntree at all ... though I suppose with only 6 variables, and tens-of-thousands of rows, we can probably do less than 500 tree's (the default?). Although tossing the forest does speed things up a bit, seems to be about 15 - 20% faster in some cases, I need to keep the forest to do the prediction, otherwise, it complains that there is no forest component in the object. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements- tp3172523p3172834.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
Andy, Thanks for the reply. I had no idea I could combine them back ... that actually will work pretty well. We can have several worker threads load up the RF's on different machines and/or cores, and then re-assemble them. RMPI might be an option down the road, but would be a bit of overhead for us now. Using the method of combine() ... I was able to drastically reduce the amount of time to build randomForest objects. IE, using about 25,000 rows (6 columns), it takes maybe 5 minutes on my laptop. Using 5 randomForest objects (each with 5k rows), and then combining them, takes 1 minute. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements-tp3172523p3174621.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] randomForest speed improvements
Hi there, We're trying to use randomForest to do some predictions. The test-harness for our code is pretty straightforward: library ('randomForest'); data202 - read.csv (random.csv, header=TRUE); x- data202[1:5,1:6]; y- data202[1:5,8]; y- y[,drop=TRUE]; x2 - data202[50001:6,1:6]; y2 - data202[50001:6,8]; y2 - y2[,drop=TRUE]; RFobject - randomForest(x,y,na.action=na.roughfix); p - predict (RFobject, x2); In this case, the CSV contains 10 columns, of which 1-6 are numeric in nature (day of week, week of month, etc...) and column 8 is the target (sales, a numeric number). randomForest does fine with the data, our issue is how long it takes. In this case, about 5,000 rows of data seems to take just a few seconds, but going to 50,000 rows doesn't take 5x the time, it takes perhaps 30 or 40 minutes. We've downloaded and tried RT-Rank, which is a multi-threaded version of RandomForest, and this seems to produce the same (or slightly better) predictions, but also gets done fairly quickly. What can we do to improve the speed of this data computation? The system we're on is a dual quad-core Intel CPU @ 2.33Ghz, and with 16GB of RAM ... we're using the stock R RPM for CentOS 5.5. Thanks! -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements-tp3172523p3172523.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
Have you tried adjusting: mtry - the number of parameters to try per tree ntree - the number of trees grown keep.forest - logical on whether to store tree Specifically, I found huge improvements in speed by switching keep.forest to FALSE in the past when I didn't actually need the forest post analysis. -- Jonathan P. Daily Technician - USGS Leetown Science Center 11649 Leetown Road Kearneysville WV, 25430 (304) 724-4480 Is the room still a room when its empty? Does the room, the thing itself have purpose? Or do we, what's the word... imbue it. - Jubal Early, Firefly r-help-boun...@r-project.org wrote on 01/03/2011 02:59:29 PM: [image removed] [R] randomForest speed improvements apresley to: r-help 01/03/2011 03:03 PM Sent by: r-help-boun...@r-project.org Hi there, We're trying to use randomForest to do some predictions. The test-harness for our code is pretty straightforward: library ('randomForest'); data202 - read.csv (random.csv, header=TRUE); x- data202[1:5,1:6]; y- data202[1:5,8]; y- y[,drop=TRUE]; x2 - data202[50001:6,1:6]; y2 - data202[50001:6,8]; y2 - y2[,drop=TRUE]; RFobject - randomForest(x,y,na.action=na.roughfix); p - predict (RFobject, x2); In this case, the CSV contains 10 columns, of which 1-6 are numeric in nature (day of week, week of month, etc...) and column 8 is the target (sales, a numeric number). randomForest does fine with the data, our issue is how long it takes. In this case, about 5,000 rows of data seems to take just a few seconds, but going to 50,000 rows doesn't take 5x the time, it takes perhaps 30 or 40 minutes. We've downloaded and tried RT-Rank, which is a multi-threaded version of RandomForest, and this seems to produce the same (or slightly better) predictions, but also gets done fairly quickly. What can we do to improve the speed of this data computation? The system we're on is a dual quad-core Intel CPU @ 2.33Ghz, and with 16GB of RAM ... we're using the stock R RPM for CentOS 5.5. Thanks! -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/ randomForest-speed-improvements-tp3172523p3172523.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest speed improvements
I haven't tried changing the mtry or ntree at all ... though I suppose with only 6 variables, and tens-of-thousands of rows, we can probably do less than 500 tree's (the default?). Although tossing the forest does speed things up a bit, seems to be about 15 - 20% faster in some cases, I need to keep the forest to do the prediction, otherwise, it complains that there is no forest component in the object. -- Anthony -- View this message in context: http://r.789695.n4.nabble.com/randomForest-speed-improvements-tp3172523p3172834.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.