Re: [R] repeating an analysis
Hi All, For those still interested the code submitted by Phil (see below) worked a treat and produced a vector with the optimal 'nsplit' collated from 50 runs of the rpart model. I then produced a histogram for the vector called answer and chose my modal number for nsplits from which I had my appropriate tree size. many thanks to all who answered Andy On Wed, Oct 13, 2010 at 10:30 AM, Phil Spector wrote: > Andrew - > I think > > answer = replicate(50,{fit1 <- rpart(CHAB~.,data=chabun, method="anova", > > control=rpart.control(minsplit=10, > cp=0.01, xval=10)); > x = printcp(fit1); > x[which.min(x[,'xerror']),'nsplit']}) > > will put the numbers you want into answer, but there was no reproducible > example to test it on. Unfortunately, I don't know of any way to surpress > the printing from printcp(). > >- Phil Spector > Statistical Computing Facility > Department of Statistics > UC Berkeley > spec...@stat.berkeley.edu > > > > > > On Wed, 13 Oct 2010, Andrew Halford wrote: > > Hi All, >> >> I have to say upfront that I am a complete neophyte when it comes to >> programming. Nevertheless I enjoy the challenge of using R because of its >> incredible statistical resources. >> >> My problem is this .I am running a regression tree analysis using >> "rpart" and I need to run the calculation repeatedly (say n=50 times) to >> obtain a distribution of results from which I will pick the median one to >> represent the most parsimonious tree size. Unfortunately rpart does not >> contain this ability so it will have to be coded for. >> >> Could anyone help me with this? I have provided the code (and relevant >> output) for the analysis I am running. I need to run it n=50 times and >> from >> each output pick the appropriate tree size and post it to a datafile where >> I >> can then look at the frequency distribution of tree sizes. >> >> Here is the code and output from a single run >> >> fit1 <- rpart(CHAB~.,data=chabun, method="anova", >>> >> control=rpart.control(minsplit=10, cp=0.01, xval=10)) >> >>> printcp(fit1) >>> >> >> Regression tree: >> rpart(formula = CHAB ~ ., data = chabun, method = "anova", control = >> rpart.control(minsplit = 10, >> cp = 0.01, xval = 10)) >> Variables actually used in tree construction: >> [1] EXP LAT POC RUG >> Root node error: 35904/33 = 1088 >> n= 33 >> CP nsplit rel error xerrorxstd >> 1 0.539806 0 1.0 1.0337 0.41238 >> 2 0.050516 1 0.46019 1.2149 0.38787 >> 3 0.016788 2 0.40968 1.2719 0.41280 >> 4 0.010221 3 0.39289 1.1852 0.38300 >> 5 0.01 4 0.38267 1.1740 0.38333 >> >> Each time I re-run the model I will get a slightly different output. I >> want >> to extract the nsplit number corresponding to the lowest xerror for each >> run >> of the model (in this case it is for nsplit = 0) over 50 runs and then >> look >> at the distribution of nsplits after 50 runs. >> >> Any help appreciated. >> >> >> Andy >> >> >> -- >> Andrew Halford >> Associate Researcher >> Marine Laboratory >> University of Guam >> Ph: +1 671 734 2948 >> >>[[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> -- Andrew Halford Ph.D Associate Research Scientist Marine Laboratory University of Guam Ph: +1 671 734 2948 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] repeating an analysis
thanks Phil, I have your solution and another which I will attempt in the next day or so and will post results to the list then. cheers andy On Wed, Oct 13, 2010 at 10:30 AM, Phil Spector wrote: > Andrew - > I think > > answer = replicate(50,{fit1 <- rpart(CHAB~.,data=chabun, method="anova", > > control=rpart.control(minsplit=10, > cp=0.01, xval=10)); > x = printcp(fit1); > x[which.min(x[,'xerror']),'nsplit']}) > > will put the numbers you want into answer, but there was no reproducible > example to test it on. Unfortunately, I don't know of any way to surpress > the printing from printcp(). > >- Phil Spector > Statistical Computing Facility > Department of Statistics > UC Berkeley > spec...@stat.berkeley.edu > > > > > > On Wed, 13 Oct 2010, Andrew Halford wrote: > > Hi All, >> >> I have to say upfront that I am a complete neophyte when it comes to >> programming. Nevertheless I enjoy the challenge of using R because of its >> incredible statistical resources. >> >> My problem is this .I am running a regression tree analysis using >> "rpart" and I need to run the calculation repeatedly (say n=50 times) to >> obtain a distribution of results from which I will pick the median one to >> represent the most parsimonious tree size. Unfortunately rpart does not >> contain this ability so it will have to be coded for. >> >> Could anyone help me with this? I have provided the code (and relevant >> output) for the analysis I am running. I need to run it n=50 times and >> from >> each output pick the appropriate tree size and post it to a datafile where >> I >> can then look at the frequency distribution of tree sizes. >> >> Here is the code and output from a single run >> >> fit1 <- rpart(CHAB~.,data=chabun, method="anova", >>> >> control=rpart.control(minsplit=10, cp=0.01, xval=10)) >> >>> printcp(fit1) >>> >> >> Regression tree: >> rpart(formula = CHAB ~ ., data = chabun, method = "anova", control = >> rpart.control(minsplit = 10, >> cp = 0.01, xval = 10)) >> Variables actually used in tree construction: >> [1] EXP LAT POC RUG >> Root node error: 35904/33 = 1088 >> n= 33 >> CP nsplit rel error xerrorxstd >> 1 0.539806 0 1.0 1.0337 0.41238 >> 2 0.050516 1 0.46019 1.2149 0.38787 >> 3 0.016788 2 0.40968 1.2719 0.41280 >> 4 0.010221 3 0.39289 1.1852 0.38300 >> 5 0.01 4 0.38267 1.1740 0.38333 >> >> Each time I re-run the model I will get a slightly different output. I >> want >> to extract the nsplit number corresponding to the lowest xerror for each >> run >> of the model (in this case it is for nsplit = 0) over 50 runs and then >> look >> at the distribution of nsplits after 50 runs. >> >> Any help appreciated. >> >> >> Andy >> >> >> -- >> Andrew Halford >> Associate Researcher >> Marine Laboratory >> University of Guam >> Ph: +1 671 734 2948 >> >>[[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> -- Andrew Halford Ph.D Associate Researcher Scientist Marine Laboratory University of Guam Ph: +1 671 734 2948 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] repeating an analysis
I think you want something like this: optimal.nSplit = rep(NA, 50) # This will hold the result for (run in 1:50) { fit1 = rpart(...) cpTable = fit1$cptable bestRow = which.min(cpTable[, "xerror"]); optimal.nSplit[run] = cpTable[bestRow, "nsplit"] } In any case, look at ?rpart ?printcp ?rpart.object Peter On Tue, Oct 12, 2010 at 4:50 PM, Andrew Halford wrote: > Hi All, > > I have to say upfront that I am a complete neophyte when it comes to > programming. Nevertheless I enjoy the challenge of using R because of its > incredible statistical resources. > > My problem is this .I am running a regression tree analysis using > "rpart" and I need to run the calculation repeatedly (say n=50 times) to > obtain a distribution of results from which I will pick the median one to > represent the most parsimonious tree size. Unfortunately rpart does not > contain this ability so it will have to be coded for. > > Could anyone help me with this? I have provided the code (and relevant > output) for the analysis I am running. I need to run it n=50 times and from > each output pick the appropriate tree size and post it to a datafile where I > can then look at the frequency distribution of tree sizes. > > Here is the code and output from a single run > >> fit1 <- rpart(CHAB~.,data=chabun, method="anova", > control=rpart.control(minsplit=10, cp=0.01, xval=10)) >> printcp(fit1) > > Regression tree: > rpart(formula = CHAB ~ ., data = chabun, method = "anova", control = > rpart.control(minsplit = 10, > cp = 0.01, xval = 10)) > Variables actually used in tree construction: > [1] EXP LAT POC RUG > Root node error: 35904/33 = 1088 > n= 33 > CP nsplit rel error xerror xstd > 1 0.539806 0 1.0 1.0337 0.41238 > 2 0.050516 1 0.46019 1.2149 0.38787 > 3 0.016788 2 0.40968 1.2719 0.41280 > 4 0.010221 3 0.39289 1.1852 0.38300 > 5 0.01 4 0.38267 1.1740 0.38333 > > Each time I re-run the model I will get a slightly different output. I want > to extract the nsplit number corresponding to the lowest xerror for each run > of the model (in this case it is for nsplit = 0) over 50 runs and then look > at the distribution of nsplits after 50 runs. > > Any help appreciated. > > > Andy > > > -- > Andrew Halford > Associate Researcher > Marine Laboratory > University of Guam > Ph: +1 671 734 2948 > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] repeating an analysis
Andrew - I think answer = replicate(50,{fit1 <- rpart(CHAB~.,data=chabun, method="anova", control=rpart.control(minsplit=10, cp=0.01, xval=10)); x = printcp(fit1); x[which.min(x[,'xerror']),'nsplit']}) will put the numbers you want into answer, but there was no reproducible example to test it on. Unfortunately, I don't know of any way to surpress the printing from printcp(). - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Wed, 13 Oct 2010, Andrew Halford wrote: Hi All, I have to say upfront that I am a complete neophyte when it comes to programming. Nevertheless I enjoy the challenge of using R because of its incredible statistical resources. My problem is this .I am running a regression tree analysis using "rpart" and I need to run the calculation repeatedly (say n=50 times) to obtain a distribution of results from which I will pick the median one to represent the most parsimonious tree size. Unfortunately rpart does not contain this ability so it will have to be coded for. Could anyone help me with this? I have provided the code (and relevant output) for the analysis I am running. I need to run it n=50 times and from each output pick the appropriate tree size and post it to a datafile where I can then look at the frequency distribution of tree sizes. Here is the code and output from a single run fit1 <- rpart(CHAB~.,data=chabun, method="anova", control=rpart.control(minsplit=10, cp=0.01, xval=10)) printcp(fit1) Regression tree: rpart(formula = CHAB ~ ., data = chabun, method = "anova", control = rpart.control(minsplit = 10, cp = 0.01, xval = 10)) Variables actually used in tree construction: [1] EXP LAT POC RUG Root node error: 35904/33 = 1088 n= 33 CP nsplit rel error xerrorxstd 1 0.539806 0 1.0 1.0337 0.41238 2 0.050516 1 0.46019 1.2149 0.38787 3 0.016788 2 0.40968 1.2719 0.41280 4 0.010221 3 0.39289 1.1852 0.38300 5 0.01 4 0.38267 1.1740 0.38333 Each time I re-run the model I will get a slightly different output. I want to extract the nsplit number corresponding to the lowest xerror for each run of the model (in this case it is for nsplit = 0) over 50 runs and then look at the distribution of nsplits after 50 runs. Any help appreciated. Andy -- Andrew Halford Associate Researcher Marine Laboratory University of Guam Ph: +1 671 734 2948 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.