[R] rpart weight prior
Hi! Could you please explain the difference between prior and weight in rpart? It seems to be the same. But in this case why including a weight option in the latest versions? For an unbalanced sampling what is the best to use : weight, prior or the both together? Thanks a lot. Aurélie Davranche. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart weight prior
On Sun, 8 Jul 2007, Aurélie Davranche wrote: Hi! Could you please explain the difference between prior and weight in rpart? It seems to be the same. But in this case why including a weight option in the latest versions? For an unbalanced sampling what is the best to use : weight, prior or the both together? The 'weight' argument (sic) has been there for a decade, and is not the same as the 'prior' param. The help file (which you seem unfamiliar with) says weights: optional case weights. parms: optional parameters for the splitting function. Anova splitting has no parameters. Poisson splitting has a single parameter, the coefficient of variation of the prior distribution on the rates. The default value is 1. Exponential splitting has the same parameter as Poisson. For classification splitting, the list can contain any of: the vector of prior probabilities (component 'prior'), the loss matrix (component 'loss') or the splitting index (component 'split'). The priors must be positive and sum to 1. The loss matrix must have zeros on the diagonal and positive off-diagonal elements. The splitting index can be 'gini' or 'information'. The default priors are proportional to the data counts, the losses default to 1, and the split defaults to 'gini'. The rpart technical report at http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf may help you understand this. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart-question regarding relation between cp and rel error
Dear useRs, I may be temporarily (I hope :-)) confused, and I hope that someone can answer this question that bugs me at the moment: In the CP table of rpart, I thought the following equation should hold: rel error = rel error(before) - (nsplit - nsplit(before)) * CP(before), where (before) always denotes the entry in the row above. While this equation holds for many rows of the CP tables I've looked at, it doesn't hold for all. For example, in the table below, 0.67182 != 0.68405 - (47-38)*0.0010616, with a difference of 0.002676 which appears larger than just numerical inaccuracy. CP nsplit rel error xerror xstd 1 0.1820909 0 1.0 1.0 0.012890 2 0.0526194 1 0.81791 0.81768 0.012062 3 0.0070390 2 0.76529 0.76529 0.011780 4 0.0043850 4 0.75121 0.77660 0.011842 5 0.0036157 5 0.74683 0.77106 0.011812 6 0.0032310 8 0.73598 0.77083 0.011810 7 0.0026541 9 0.73275 0.77083 0.011810 8 0.0025387 14 0.71936 0.76829 0.011796 9 0.0016155 16 0.71429 0.76644 0.011786 10 0.0013847 20 0.70759 0.76206 0.011761 11 0.0011539 28 0.69605 0.76621 0.011785 12 0.0010616 38 0.68405 0.76875 0.011799 13 0.0010001 47 0.67182 0.76991 0.011805 14 0.001 57 0.66144 0.77060 0.011809 Can someone explain why/when this happens? Regards, Ulrike -- View this message in context: http://www.nabble.com/rpart-question-regarding-relation-between-cp-and-rel-error-tf3356652.html#a9335690 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart minimum sample size
Look at rpart.control. Rpart has two advisory parameters that control the tree size at the smallest nodes: minsplit (default 20): a node with less than this many subjects will not be worth splitting minbucket (default 7) : don't create any final nodes with 7 observations As I said, these are advisory, and reflect that these final splits are usually not worthwhile. They lead to a little faster run time, but mostly to a less complex plotted model. I am not nearly as pessimistic as Frank Harrell (need 20,000 observations). Rpart often gives a good model -- one that predicts the outcome, and I find the intermediate steps that it takes informative. However, there are often many trees with similar predictive ability, but a very different look in terms of splitpoints and variables. Saying that any given rpart model is THE best is perilous. Terry T. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart minimum sample size
Is there an optimal / minimum sample size for attempting to construct a classification tree using /rpart/? I have 27 seagrass disturbance sites (boat groundings) that have been monitored for a number of years. The monitoring protocol for each site is identical. From the monitoring data, I am able to determine the level of recovery that each site has experienced. Recovery is our categorical dependent variable with values of none, low, medium, high which are based upon percent seagrass regrowth into the injury over time. I wish to be able to predict the level of recovery of future vessel grounding sites based upon a number of categorical / continuous predictor variables used here including (but not limited to) such parameters as: sediment grain size, wave exposure, original size (volume) of the injury, injury age, injury location. When I run /rpart/, the data is split into only two terminal nodes based solely upon values of the original volume of each injury. No other predictor variables are considered, even though I have included about six of them in the model. When I remove volume from the model the same thing happens but with injury area - two terminal nodes are formed based upon area values and no other variables appear. I was hoping that this was a programming issue, me being a newbie and all, but I really think I've got the code right. Now I am beginning to wonder if my N is too small for this method? -- Amy V. Uhrin, Research Ecologist NOAA, National Ocean Service Center for Coastal Fisheries and Habitat Research 101 Pivers Island Road Beaufort, NC 28516 (252) 728-8778 (252) 728-8784 (fax) [EMAIL PROTECTED] \!/ \!/ :} \!/ \!/ ^**^ \!/ \!/ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart minimum sample size
amy, without looking at your actual code, i would suggest you to take a look at rpart.control() On 2/27/07, Amy Uhrin [EMAIL PROTECTED] wrote: Is there an optimal / minimum sample size for attempting to construct a classification tree using /rpart/? I have 27 seagrass disturbance sites (boat groundings) that have been monitored for a number of years. The monitoring protocol for each site is identical. From the monitoring data, I am able to determine the level of recovery that each site has experienced. Recovery is our categorical dependent variable with values of none, low, medium, high which are based upon percent seagrass regrowth into the injury over time. I wish to be able to predict the level of recovery of future vessel grounding sites based upon a number of categorical / continuous predictor variables used here including (but not limited to) such parameters as: sediment grain size, wave exposure, original size (volume) of the injury, injury age, injury location. When I run /rpart/, the data is split into only two terminal nodes based solely upon values of the original volume of each injury. No other predictor variables are considered, even though I have included about six of them in the model. When I remove volume from the model the same thing happens but with injury area - two terminal nodes are formed based upon area values and no other variables appear. I was hoping that this was a programming issue, me being a newbie and all, but I really think I've got the code right. Now I am beginning to wonder if my N is too small for this method? -- Amy V. Uhrin, Research Ecologist NOAA, National Ocean Service Center for Coastal Fisheries and Habitat Research 101 Pivers Island Road Beaufort, NC 28516 (252) 728-8778 (252) 728-8784 (fax) [EMAIL PROTECTED] \!/ \!/ :} \!/ \!/ ^**^ \!/ \!/ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart minimum sample size
Amy Uhrin wrote: Is there an optimal / minimum sample size for attempting to construct a classification tree using /rpart/? I have 27 seagrass disturbance sites (boat groundings) that have been monitored for a number of years. The monitoring protocol for each site is identical. From the monitoring data, I am able to determine the level of recovery that each site has experienced. Recovery is our categorical dependent variable with values of none, low, medium, high which are based upon percent seagrass regrowth into the injury over time. I wish to be able to predict the level of recovery of future vessel grounding sites based upon a number of categorical / continuous predictor variables used here including (but not limited to) such parameters as: sediment grain size, wave exposure, original size (volume) of the injury, injury age, injury location. When I run /rpart/, the data is split into only two terminal nodes based solely upon values of the original volume of each injury. No other predictor variables are considered, even though I have included about six of them in the model. When I remove volume from the model the same thing happens but with injury area - two terminal nodes are formed based upon area values and no other variables appear. I was hoping that this was a programming issue, me being a newbie and all, but I really think I've got the code right. Now I am beginning to wonder if my N is too small for this method? In my experience N needs to be around 20,000 to get both good accuracy and replicability of patterns if the number of potential predictors is not tiny. In general, the R^2 from rpart is not competitive with that from an intelligently fitted regression model. It's just a difficult problem, when relying on a single tree (hence the popularity of random forests, bagging, boosting). Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart with overdispersed count data?
I would like to do recursive partitioning when the response is a count variable subject to overdispersion, using say negative binomial likelihood or something like quasipoisson in glm. Would appreciate any thoughts on how to go about this (theory/computation). If I understand the rpart documentation, I would need to write a method argument, but the details are not there. Therefore, a second question is whether/where one can get material on developing new rpart implementations. regards, Farrar [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart tree node label
I generate a tree use rpart. In the node of tree, split is based on the some factor. I want to label these node based on the levels of this factor. Does anyone know how to do this? Thanks, Aimin __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart tree node label
not sure how you want to label it. could you be more specific? thanks. On 2/14/07, Aimin Yan [EMAIL PROTECTED] wrote: I generate a tree use rpart. In the node of tree, split is based on the some factor. I want to label these node based on the levels of this factor. Does anyone know how to do this? Thanks, Aimin __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart tree node label
levels(training$aa_one) [1] A C D E F H I K L M N P Q R S T V W Y this is 19 levels of aa_one. When I see tree, in one node, it is labeled by aa_one=bcdfgknop it is obvious that it is labeled by alphabet letter ,not by levels of aa_one. I want to get aa_one=CDE.. such like. Do you know how to do this Aimin At 04:23 PM 2/14/2007, Wensui Liu wrote: not sure how you want to label it. could you be more specific? thanks. On 2/14/07, Aimin Yan [EMAIL PROTECTED] wrote: I generate a tree use rpart. In the node of tree, split is based on the some factor. I want to label these node based on the levels of this factor. Does anyone know how to do this? Thanks, Aimin __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart tree node label [Broadcast]
Try the following to see: library(rpart) iris.rp(Sepal.Length ~ Species, iris) plot(iris.rp) text(iris.rp) Two possible solutions: 1. Use text(..., pretty=0). See ?text.rpart. 2. Use post(..., filename=). Andy From: Wensui Liu not sure how you want to label it. could you be more specific? thanks. On 2/14/07, Aimin Yan [EMAIL PROTECTED] wrote: I generate a tree use rpart. In the node of tree, split is based on the some factor. I want to label these node based on the levels of this factor. Does anyone know how to do this? Thanks, Aimin __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart
Hello, I have a question for rpart, I try to use it to do prediction for a continuous variable. But I get the different prediction accuracy for same training set, anyone know why? Aimin __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart
Yes, I use the same setting, and I calculate MSE and CC as prediction accuracy measure. Someone told me I should not trust one tree and should do bagging. Is this correct? Aimin At 03:11 PM 2/5/2007, Wensui Liu wrote: are you sure you are using the same setting, tree size, and so on? On 2/5/07, Aimin Yan [EMAIL PROTECTED] wrote: Hello, I have a question for rpart, I try to use it to do prediction for a continuous variable. But I get the different prediction accuracy for same training set, anyone know why? Aimin __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart
man, oh, man Surely you can use bagging, or probably boosting. But that doesn't answer your question, does it? Believe me, even you use bagging, the result will vary, depending on set.seed(). On 2/5/07, Aimin Yan [EMAIL PROTECTED] wrote: Yes, I use the same setting, and I calculate MSE and CC as prediction accuracy measure. Someone told me I should not trust one tree and should do bagging. Is this correct? Aimin At 03:11 PM 2/5/2007, Wensui Liu wrote: are you sure you are using the same setting, tree size, and so on? On 2/5/07, Aimin Yan [EMAIL PROTECTED] wrote: Hello, I have a question for rpart, I try to use it to do prediction for a continuous variable. But I get the different prediction accuracy for same training set, anyone know why? Aimin __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart question
On Thu, 25 Jan 2007, Aimin Yan wrote: I make classification tree like this code p.t2.90 - rpart(y~aa_three+bas+bcu+aa_ss, data=training,method=class,control=rpart.control(cp=0.0001)) Here I want to set weight for 4 predictors(aa_three,bas,bcu,aa_ss). I know that there is a weight set-up in rpart. Can this set-up satisfy my need? It depends on what _you_ mean by 'set weight'. You will need to tell us in detail what exactly you want the weights to do. Using the 'weights' argument is specifying case weights (as the help says). There are also 'cost' and 'parms' for other aspects of weighting. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart question
I make classification tree like this code p.t2.90 - rpart(y~aa_three+bas+bcu+aa_ss, data=training,method=class,control=rpart.control(cp=0.0001)) Here I want to set weight for 4 predictors(aa_three,bas,bcu,aa_ss). I know that there is a weight set-up in rpart. Can this set-up satisfy my need? If so, could someone give me an example? Thanks, Aimin Yan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart - I'm confused by the loss matrix
Hello, As I couldn't find anywhere in the help to rpart which element in the loss matrix means which loss, I played with this parameter and became a bit confused. What I did was this: I used kyphosis data(classification absent/present, number of 'absent' cases is 64, of 'present' cases 17) and I tried the following lmat=matrix(c(0,17,64,0),ncol=2) lmat [,1] [,2] [1,]0 64 [2,] 170 set.seed(1003) fit1-rpart(Kyphosis~.,data=kyphosis,parms=list(loss=lmat)) set.seed(1003) fit2-rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(0.5,0.5))) The results I obtained were identical, so I concluded that the losses were [L(true, predicted)]: L(absent,present)=17 L(present,absent)=64. And thus the arrangement of the elements in the loss matrix seemed clear as absent is considered as class 1 and present as class 2 and my problem seemed to be solved. However, I tried also this: residuals(fit1) and became confused. Because for each misclassified 'absent' the residual(which should be loss in this case) was 64, while for a misclassified 'present' it was 17 (in contradiction to the previous.) So am I wrong somewhere? Is the arrangement of elements in the loss matrix such as I deduced it from fitting fit1 and fit2? Thanks for any comments. Barbora __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart
Dear r-help-list: If I use the rpart method like cfit-rpart(y~.,data=data,...), what kind of tree is stored in cfit? Is it right that this tree is not pruned at all, that it is the full tree? If so, it's up to me to choose a subtree by using the printcp method. In the technical report from Atkinson and Therneau An Introduction to recursive partitioning using the rpart routines from 2000, one can see the following table on page 15: CP nsplit relerror xerror xstd 1 0.105 0 1.0 1. 0.108 2 0.056 3 0.68519 1.1852 0.111 3 0.028 4 0.62963 1.0556 0.109 4 0.574 6 0.57407 1.0556 0.109 5 0.100 7 0.6 1.0556 0.109 Some lines below it says We see that the best tree has 5 terminal nodes (4 splits). Why that if the xerror is the lowest for the tree only consisting of the root? Thank you very much for your help Henri -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart
On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote: Dear r-help-list: If I use the rpart method like cfit-rpart(y~.,data=data,...), what kind of tree is stored in cfit? Is it right that this tree is not pruned at all, that it is the full tree? It is an rpart object. This contains both the tree and the instructions for pruning it at all values of cp: note that cp is also used in deciding how large a tree to grow. If so, it's up to me to choose a subtree by using the printcp method. Or the plotcp method. In the technical report from Atkinson and Therneau An Introduction to recursive partitioning using the rpart routines from 2000, one can see the following table on page 15: CP nsplit relerror xerror xstd 1 0.105 0 1.0 1. 0.108 2 0.056 3 0.68519 1.1852 0.111 3 0.028 4 0.62963 1.0556 0.109 4 0.574 6 0.57407 1.0556 0.109 5 0.100 7 0.6 1.0556 0.109 Some lines below it says We see that the best tree has 5 terminal nodes (4 splits). Why that if the xerror is the lowest for the tree only consisting of the root? There are *two* reports with that name: this seems to be from minitech.ps. The choice is explained in the rest of that para (the 1-SE rule was used). My guess is that the authors excluded the root as not being a tree, but only they can answer that. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart
On Tue, 26 Sep 2006, [EMAIL PROTECTED] wrote: Original-Nachricht Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST) Von: Prof Brian Ripley [EMAIL PROTECTED] An: [EMAIL PROTECTED] Betreff: Re: [R] rpart On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote: Dear r-help-list: If I use the rpart method like cfit-rpart(y~.,data=data,...), what kind of tree is stored in cfit? Is it right that this tree is not pruned at all, that it is the full tree? It is an rpart object. This contains both the tree and the instructions for pruning it at all values of cp: note that cp is also used in deciding how large a tree to grow. Ok, I have to explain my problem a little bit more in detail, I'm sorry for being so vague: I used the method in the following way: cfit- rpart(y~., method=class, minsplit=1, cp=0) I got a tree with a lot of terminals nodes that contained more than 100 observations. This made me believe that the tree was already pruned. On the other hand, the printcp method showed subtrees that were better. This made me believe that the tree hadn't been pruned before. So, are the trees a little bit pruned? Yes, as you asked for cp=0. Look up what that does in ?rpart.control. If so, it's up to me to choose a subtree by using the printcp method. Or the plotcp method. In the technical report from Atkinson and Therneau An Introduction to recursive partitioning using the rpart routines from 2000, one can see the following table on page 15: CP nsplit relerror xerror xstd 1 0.105 0 1.0 1. 0.108 2 0.056 3 0.68519 1.1852 0.111 3 0.028 4 0.62963 1.0556 0.109 4 0.574 6 0.57407 1.0556 0.109 5 0.100 7 0.6 1.0556 0.109 Some lines below it says We see that the best tree has 5 terminal nodes (4 splits). Why that if the xerror is the lowest for the tree only consisting of the root? There are *two* reports with that name: this seems to be from minitech.ps. The choice is explained in the rest of that para (the 1-SE rule was used). My guess is that the authors excluded the root as not being a tree, but only they can answer that. Are both reports from 2000? But you're right, I'm talking about the one from minitch.ps. The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, but not why they didn't choose the tree without a split. The exclusion of the root as not being a tree was my first explanation, too. But if the tree only consisting of the root is still better than any other tree, why would I choose a tree with 4 splits then? Henri -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart
Original-Nachricht Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST) Von: Prof Brian Ripley [EMAIL PROTECTED] An: [EMAIL PROTECTED] Betreff: Re: [R] rpart On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote: Dear r-help-list: If I use the rpart method like cfit-rpart(y~.,data=data,...), what kind of tree is stored in cfit? Is it right that this tree is not pruned at all, that it is the full tree? It is an rpart object. This contains both the tree and the instructions for pruning it at all values of cp: note that cp is also used in deciding how large a tree to grow. Ok, I have to explain my problem a little bit more in detail, I'm sorry for being so vague: I used the method in the following way: cfit- rpart(y~., method=class, minsplit=1, cp=0) I got a tree with a lot of terminals nodes that contained more than 100 observations. This made me believe that the tree was already pruned. On the other hand, the printcp method showed subtrees that were better. This made me believe that the tree hadn't been pruned before. So, are the trees a little bit pruned? If so, it's up to me to choose a subtree by using the printcp method. Or the plotcp method. In the technical report from Atkinson and Therneau An Introduction to recursive partitioning using the rpart routines from 2000, one can see the following table on page 15: CP nsplit relerror xerror xstd 1 0.105 0 1.0 1. 0.108 2 0.056 3 0.68519 1.1852 0.111 3 0.028 4 0.62963 1.0556 0.109 4 0.574 6 0.57407 1.0556 0.109 5 0.100 7 0.6 1.0556 0.109 Some lines below it says We see that the best tree has 5 terminal nodes (4 splits). Why that if the xerror is the lowest for the tree only consisting of the root? There are *two* reports with that name: this seems to be from minitech.ps. The choice is explained in the rest of that para (the 1-SE rule was used). My guess is that the authors excluded the root as not being a tree, but only they can answer that. Are both reports from 2000? But you're right, I'm talking about the one from minitch.ps. The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, but not why they didn't choose the tree without a split. The exclusion of the root as not being a tree was my first explanation, too. But if the tree only consisting of the root is still better than any other tree, why would I choose a tree with 4 splits then? Henri -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart
Original-Nachricht Datum: Tue, 26 Sep 2006 12:54:22 +0100 (BST) Von: Prof Brian Ripley [EMAIL PROTECTED] An: [EMAIL PROTECTED] Betreff: Re: [R] rpart On Tue, 26 Sep 2006, [EMAIL PROTECTED] wrote: Original-Nachricht Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST) Von: Prof Brian Ripley [EMAIL PROTECTED] An: [EMAIL PROTECTED] Betreff: Re: [R] rpart On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote: Dear r-help-list: If I use the rpart method like cfit-rpart(y~.,data=data,...), what kind of tree is stored in cfit? Is it right that this tree is not pruned at all, that it is the full tree? It is an rpart object. This contains both the tree and the instructions for pruning it at all values of cp: note that cp is also used in deciding how large a tree to grow. Ok, I have to explain my problem a little bit more in detail, I'm sorry for being so vague: I used the method in the following way: cfit- rpart(y~., method=class, minsplit=1, cp=0) I got a tree with a lot of terminals nodes that contained more than 100 observations. This made me believe that the tree was already pruned. On the other hand, the printcp method showed subtrees that were better. This made me believe that the tree hadn't been pruned before. So, are the trees a little bit pruned? Yes, as you asked for cp=0. Look up what that does in ?rpart.control. I thought I would get a full tree by choosing cp=0 - and it was one. The nodes with more than 100 observations were not split further because there was no sequence of splits which made the class label change for any subset. (A bad explanation, but you probably know what I mean.) I realized that when I chose cp=-1. Thank you very much for your help! If so, it's up to me to choose a subtree by using the printcp method. Or the plotcp method. In the technical report from Atkinson and Therneau An Introduction to recursive partitioning using the rpart routines from 2000, one can see the following table on page 15: CP nsplit relerror xerror xstd 1 0.105 0 1.0 1. 0.108 2 0.056 3 0.68519 1.1852 0.111 3 0.028 4 0.62963 1.0556 0.109 4 0.574 6 0.57407 1.0556 0.109 5 0.100 7 0.6 1.0556 0.109 Some lines below it says We see that the best tree has 5 terminal nodes (4 splits). Why that if the xerror is the lowest for the tree only consisting of the root? There are *two* reports with that name: this seems to be from minitech.ps. The choice is explained in the rest of that para (the 1-SE rule was used). My guess is that the authors excluded the root as not being a tree, but only they can answer that. Are both reports from 2000? But you're right, I'm talking about the one from minitch.ps. The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, but not why they didn't choose the tree without a split. The exclusion of the root as not being a tree was my first explanation, too. But if the tree only consisting of the root is still better than any other tree, why would I choose a tree with 4 splits then? Henri -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Rpart, custom penalty for an error
On Sun, 2006-09-10 at 20:36 +0100, Prof Brian Ripley wrote: I am however interested in areas where the probability of success is noticeably higher than 5%, for example 20%. I've tried rpart and the weights option, increasing the weights of the success-observations. You are 'misleading' rpart by using 'weights', claiming to have case weights for cases you do not have. You need to use 'cost' instead. As for the rpart() function, the `cost' parameter is for scaling the variables, not for the cost of misclassifications. To specify it, the parameter `parms' needs to be used, as a list with a `loss' element, in form of a matrix. In other words, cost parm is not for cost, use loss parm of the parms parm. Example usage: tr - rpart(y ~ x, data = some.data, method = 'class', parms = list(loss = matrix(c(0, 1, 20, 0), nrow = 2))) This is a standard issue, discussed in all good books on classification (including mine). Yes, in MASS, section 12.2, Classification Theory, page 338 (fourth edition). I was looking for it in section 9.2, where rpart() is discussed. Thanks! Regards, Maciej -- http://automatthias.wordpress.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Rpart, custom penalty for an error
Hello all R-help list subscribers, I'd like to create a regression tree of a data set with binary response variable. Only 5% of observations are a success, so the regression tree will not find really any variable value combinations that will yield more than 50% of probability of success. I am however interested in areas where the probability of success is noticeably higher than 5%, for example 20%. I've tried rpart and the weights option, increasing the weights of the success-observations. It works as expected in terms of the tree creation: instead of a single root, a tree is being built. But the tree plot() and text() are somewhat misleading. I'm interested in the observation counts inside each leaf. I use the use.n = TRUE parameter. The counts displayed are misleading, the numbers of successes are not the original numbers from the sample, they seem to be cloned success-observations. I'd like to split the tree just as weights parameter allows me to, keeping the original number of observations in the tree plot. Is it possible? If yes, how? Kind regards, Maciej -- Maciej Bliziński [EMAIL PROTECTED] http://automatthias.wordpress.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Rpart, custom penalty for an error
On Sun, 10 Sep 2006, Maciej Blizi?ski wrote: Hello all R-help list subscribers, I'd like to create a regression tree of a data set with binary response variable. Only 5% of observations are a success, so the regression tree will not find really any variable value combinations that will yield more than 50% of probability of success. This would be a misuse of a regression tree, for the exact problem for which classification trees were designed. I am however interested in areas where the probability of success is noticeably higher than 5%, for example 20%. I've tried rpart and the weights option, increasing the weights of the success-observations. You are 'misleading' rpart by using 'weights', claiming to have case weights for cases you do not have. You need to use 'cost' instead. This is a standard issue, discussed in all good books on classification (including mine). It works as expected in terms of the tree creation: instead of a single root, a tree is being built. But the tree plot() and text() are somewhat misleading. I'm interested in the observation counts inside each leaf. I use the use.n = TRUE parameter. The counts displayed are misleading, the numbers of successes are not the original numbers from the sample, they seem to be cloned success-observations. They _are_ the original numbers, for that is what 'case weights' means. I'd like to split the tree just as weights parameter allows me to, keeping the original number of observations in the tree plot. Is it possible? If yes, how? Kind regards, Maciej -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart output: rule extraction beyond path.rpart()
Greetings - Is there a way to automatically perform what I believe is called rule extraction (by Quinlan and the machine learning community at least) for the leaves of trees generated by rpart? I can use path.rpart() to automatically extract the paths to the leaves, but these can be needlessly cumbersome. For example, one path returned by path.rpart() might be: [1] root y=-0.1905 y 0.1495 z=-0.19 z 0.1785 [6] y=-0.1385 z=-0.153 x 0.37x=-0.363 But the y = -0.1905 and z=-.19 are both redundant, given restrictions placed further down the tree. Simplifying the paths by hand is feasible for small trees but quite cumbersome when dimensionality increases. I can think of ways to write code to do this automatically, but would prefer not to if it's already implemented. I have done extensive searching and turned up nothing, but I fear I might just be lacking the right terminology. Any thoughts? Much appreciated, -Ben Ben Bryant Doctoral Fellow Pardee RAND Graduate School [EMAIL PROTECTED] This email message is for the sole use of the intended recip...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] rpart unbalanced data
Hello all, I am currently working with rpart to classify vegetation types by spectral characteristics, and am comming up with poor classifications based on the fact that I have some vegetation types that have only 15 observations, while others have over 100. I have attempted to supply prior weights to the dataset, though this does not improve the classification greatly. Could anyone supply some hints about how to improve a classification for a badly unbalanced datase? Thank you, Helen Mills Poulos __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart unbalanced data
Dear Helen, You may want to have a look at http://www.togaware.com/datamining/survivor/Predicting_Fraud.html Greets, Diego Kuonen [EMAIL PROTECTED] wrote: Hello all, I am currently working with rpart to classify vegetation types by spectral characteristics, and am comming up with poor classifications based on the fact that I have some vegetation types that have only 15 observations, while others have over 100. I have attempted to supply prior weights to the dataset, though this does not improve the classification greatly. Could anyone supply some hints about how to improve a classification for a badly unbalanced datase? Thank you, Helen Mills Poulos -- Dr. ès sc. Diego Kuonen, CEOphone +41 (0)21 693 5508 Statoo Consulting fax+41 (0)21 693 8765 PO Box 107 mobile +41 (0)78 709 5384 CH-1015 Lausanne 15 email [EMAIL PROTECTED] web http://www.statoo.info skype Kuonen.Statoo.Consulting - | Statistical Consulting + Data Analysis + Data Mining Services | - + Are you drowning in information and starving for knowledge? + + Have you ever been Statooed? http://www.statoo.biz + __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Rpart -- using predict() when missing data is present?
I am doing library(rpart) m - rpart(y ~ x, D[insample,]) D[outsample,] y x 8 0.78391922 0.579025591 9 0.06629211 NA 10 NA 0.001593063 p - predict(m, newdata=D[9,]) Error in model.frame(formula, rownames, variables, varnames, extras, extranames, : invalid result from na.action How do I persuade him to give me NA since x is NA? I looked at ?predict.rpart but didn't find any mention about NAs. (In this problem, I can easily do it manually, but this is a part of something bigger where I want him to be able to gracefully handle prediction requests involving NA). -- Ajay Shah Consultant [EMAIL PROTECTED] Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Rpart -- using predict() when missing data is present?
On Sat, 8 Oct 2005, Ajay Narottam Shah wrote: I am doing library(rpart) m - rpart(y ~ x, D[insample,]) D[outsample,] y x 8 0.78391922 0.579025591 9 0.06629211 NA 10 NA 0.001593063 p - predict(m, newdata=D[9,]) Error in model.frame(formula, rownames, variables, varnames, extras, extranames, : invalid result from na.action How do I persuade him to give me NA since x is NA? I think the point is to do something sensible! One x prediction problems are not what rpart is designed to do, and the default na.action (na.rpart) fails in that case. (The author forgot drop=F.) I looked at ?predict.rpart but didn't find any mention about NAs. How about ?rpart ? That does. (In this problem, I can easily do it manually, but this is a part of something bigger where I want him to be able to gracefully handle prediction requests involving NA). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart Error in yval[, 1] : incorrect number of dimensions
I tried using rpart, as below, and got this error message rpart Error in yval[, 1] : incorrect number of dimensions. Thinking it might somehow be related to the large number of missing values, I tried using complete data, but with the same result. Does anyone know what may be going on, and how to fix it? I have traced two similar error messages in the Archive, but following the threads did not make it clear how to fix the problem. currwh.rpart-rpart(formula = CURRWHEE~EA17_6_1 + EA17_9_1 + X087 + X148 + X260 + MOTHERSA + GESTATIO,method=class) currwh.rpart n=6783 (2283 observations deleted due to missing) node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 6783 720 3 (0.1060002949 0.8938522778 0.0001474274) * summary(currwh.rpart) Call: rpart(formula = CURRWHEE ~ EA17_6_1 + EA17_9_1 + X087 + X148 + X260 + MOTHERSA + GESTATIO, method = class) n=6783 (2283 observations deleted due to missing) CP nsplit rel error 1 0 0 1 Error in yval[, 1] : incorrect number of dimensions [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart Error in yval[, 1] : incorrect number of dimensions
I tried using rpart, as below, and got this error message rpart Error in yval[, 1] : incorrect number of dimensions. Thinking it might somehow be related to the large number of missing values, I tried using complete data, but with the same result. Does anyone know what may be going on, and how to fix it? I have traced two similar error messages in the Archive, but following the threads did not make it clear how to fix the problem. currwh.rpart-rpart(formula = CURRWHEE~EA17_6_1 + EA17_9_1 + X087 + X148 + X260 + MOTHERSA + GESTATIO,method=class) currwh.rpart n=6783 (2283 observations deleted due to missing) node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 6783 720 3 (0.1060002949 0.8938522778 0.0001474274) * summary(currwh.rpart) Call: rpart(formula = CURRWHEE ~ EA17_6_1 + EA17_9_1 + X087 + X148 + X260 + MOTHERSA + GESTATIO, method = class) n=6783 (2283 observations deleted due to missing) CP nsplit rel error 1 0 0 1 Error in yval[, 1] : incorrect number of dimensions [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] rpart plot question
Petr Pikal wrote: Dear all I am quite confused by rpart plotting. Here is example. set.seed(1) y - (c(rnorm(10), rnorm(10)+2, rnorm(10)+5)) x - c(rep(c(1,2,5), c(10,10,10)) fit - rpart(x~y)## NB should be y~x plot(fit) text(fit) Text on first split says x 3.5 and on the second split x 1.5 what I understand: If x 3.5 so y is lower and y values go to the left split. OK. But, sometimes there is whatever = nnn and it seems to me that if this condition is true response variable follow to right split. try: y1-(c(rnorm(10)+5,rnorm(10)+2, rnorm(10))) fit-rpart(y1~x) plot(fit) text(fit) Well, I am not sure I express myself clearly. Am I correct that when there is sign I shall follow left node but when there is = sign I shall follow the right one? Best regards Petr Pikal Petr Pikal https://stat.ethz.ch/mailman/listinfo/r-helppetr.pikal at precheza.cz If instead of rpart you use mvpart, ie library(mvpart) fit - mvpart(y~x, data=data.frame(cbind(x,y))) plot(fit) text.rpart(fit,which=4) then the plot will be much clearer about the condition for splits. summary(fit) will also help. Regards, John = John Field Consulting Pty Ltd 10 High St, Burnside SA 5066, Australia ph: +61 8 8332 5294 or +61 409 097 586 fax: +61 8 8332 1229 email: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart plot question
Dear all I am quite confused by rpart plotting. Here is example. set.seed(1) y - (c(rnorm(10), rnorm(10)+2, rnorm(10)+5)) x - c(rep(c(1,2,5), c(10,10,10)) fit - rpart(x~y) plot(fit) text(fit) Text on first split says x 3.5 and on the second split x 1.5 what I understand: If x 3.5 so y is lower and y values go to the left split. OK. But, sometimes there is whatever = nnn and it seems to me that if this condition is true response variable follow to right split. try: y1-(c(rnorm(10)+5,rnorm(10)+2, rnorm(10))) fit-rpart(y1~x) plot(fit) text(fit) Well, I am not sure I express myself clearly. Am I correct that when there is sign I shall follow left node but when there is = sign I shall follow the right one? Best regards Petr Pikal Petr Pikal [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart memory problem
Hi everyone, I have a problem using rpart (R 2.0.1 under Unix) Indeed, I have a large matrix (9271x7), my response variable is numeric and all my predictor variables are categorical (from 3 to 8 levels). Here is an example : mydata[1:5,] distance group3 group4 group5 group6 group7 group8 pos_10.141836040224967 a c e a g g pos_501 0.153605961621317 a a a a g g pos_1001 0.152246705384699 a c e a g g pos_1501 0.145563737522463 a c e a g g pos_2001 0.143940027378837 a c e e g g When using rpart() as follow, the program runs for ages, and after a few hours, R is abruptly killed : library(rpart) fit - rpart(distance ~ ., data = mydata) When I change the categorical variables into numeric values (e.g. a = 1, b = 2, c = 3, etc...), the program runs normally in a few seconds. But this is not what I want because it separates my variables according to group7 4.5 (continuous) and not group7 = a,b,d,f or c,e,g (discrete). here is the result : fit n= 9271 node), split, n, deviance, yval * denotes terminal node 1) root 9271 28.43239000 0.1768883 2) group7=4.5 5830 4.87272700 0.1534626 4) group5 5.5 5783 3.29538700 0.1520110 8) group5=4.5 3068 0.68517040 0.1412967 * 9) group5 4.5 2715 1.86003600 0.1641184 * 5) group5=5.5 47 0.06597044 0.3320614 * 3) group7 4.5 3441 14.93984000 0.2165781 6) group5 1.5 1461 1.00414700 0.1906630 * 7) group5=1.5 1980 12.2305 0.2357002 14) group6=2.5 1659 2.95395700 0.2090232 28) group3=2.5 1315 1.65184200 0.1957505 * 29) group3 2.5 344 0.18490260 0.2597607 * 15) group6 2.5 321 1.99404400 0.3735729 * When I create a small dataframe such as the example above, e.g. : distance = rnorm(5,0.15,0.01) group3 = c(a,a,a,a,a) group4 = c(c,a,c,c,c) group5 = c(e,a,e,e,e) group6 = c(a,a,a,a,e) smalldata = data.frame(cbind(distance,group3,group4,group5,group6)) The program runs normally in a few seconds. Why does it work using the large dataset whith only numeric values but not with categorical predictor variables ? I have the impression that it considers my response variable also as a categorical variable and therefore it can't handle 9271 levels, which is quite normal. Is there a way to solve this problem ? I thank you all for your time and help, Jennifer Becq __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] rpart memory problem
[EMAIL PROTECTED] wrote: Hi everyone, I have a problem using rpart (R 2.0.1 under Unix) Indeed, I have a large matrix (9271x7), my response variable is numeric and all my predictor variables are categorical (from 3 to 8 levels). Your problem is the number of levels. You get a similar number of dummy variables and your problem becomes really huge. Uwe Ligges Here is an example : mydata[1:5,] distance group3 group4 group5 group6 group7 group8 pos_10.141836040224967 a c e a g g pos_501 0.153605961621317 a a a a g g pos_1001 0.152246705384699 a c e a g g pos_1501 0.145563737522463 a c e a g g pos_2001 0.143940027378837 a c e e g g When using rpart() as follow, the program runs for ages, and after a few hours, R is abruptly killed : library(rpart) fit - rpart(distance ~ ., data = mydata) When I change the categorical variables into numeric values (e.g. a = 1, b = 2, c = 3, etc...), the program runs normally in a few seconds. But this is not what I want because it separates my variables according to group7 4.5 (continuous) and not group7 = a,b,d,f or c,e,g (discrete). here is the result : fit n= 9271 node), split, n, deviance, yval * denotes terminal node 1) root 9271 28.43239000 0.1768883 2) group7=4.5 5830 4.87272700 0.1534626 4) group5 5.5 5783 3.29538700 0.1520110 8) group5=4.5 3068 0.68517040 0.1412967 * 9) group5 4.5 2715 1.86003600 0.1641184 * 5) group5=5.5 47 0.06597044 0.3320614 * 3) group7 4.5 3441 14.93984000 0.2165781 6) group5 1.5 1461 1.00414700 0.1906630 * 7) group5=1.5 1980 12.2305 0.2357002 14) group6=2.5 1659 2.95395700 0.2090232 28) group3=2.5 1315 1.65184200 0.1957505 * 29) group3 2.5 344 0.18490260 0.2597607 * 15) group6 2.5 321 1.99404400 0.3735729 * When I create a small dataframe such as the example above, e.g. : distance = rnorm(5,0.15,0.01) group3 = c(a,a,a,a,a) group4 = c(c,a,c,c,c) group5 = c(e,a,e,e,e) group6 = c(a,a,a,a,e) smalldata = data.frame(cbind(distance,group3,group4,group5,group6)) The program runs normally in a few seconds. Why does it work using the large dataset whith only numeric values but not with categorical predictor variables ? I have the impression that it considers my response variable also as a categorical variable and therefore it can't handle 9271 levels, which is quite normal. Is there a way to solve this problem ? I thank you all for your time and help, Jennifer Becq __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart
Hi, there: I am working on a classification problem by using rpart. when my response variable y is binary, the trees grow very fast, but if I add one more case to y, that is making y has 3 cases, the tree growing cannot be finished. the command looks like: x-rpart(r0$V142~.,data=r0[,1:141], parms=list(split='gini'), cp=0.01) changing cp or removing parms does not help. summary($V142) gives like: summary(r0$V142) 0 1 2 370 14 16 I am not sure if rpart can do this or there is something wrong with my approach. Please be advised. Ed __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] rpart
On Mon, 17 Jan 2005, Weiwei Shi wrote: I am working on a classification problem by using rpart. when my response variable y is binary, the trees grow very fast, but if I add one more case to y, that is making y has 3 cases, Do you mean 3 classes?: you have many more than 3 cases below. the tree growing cannot be finished. Whatever does that mean? Please see the posting guide and supply the information it asks for, a reproducible example and what happens when you run it and why you think it is wrong. the command looks like: x-rpart(r0$V142~.,data=r0[,1:141], parms=list(split='gini'), cp=0.01) changing cp or removing parms does not help. summary($V142) gives like: summary(r0$V142) 0 1 2 370 14 16 I am not sure if rpart can do this or there is something wrong with my approach. What is `this' you want to do? Rpart works well with multiple classes: see for example MASS4. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart problem
Dear all, I am having some trouble with getting the rpart function to work as expected. I am trying to use rpart to combine levels of a factor to reduce the number of levels of that factor. In exploring the code I have noticed that it is possible for chisq.test to return a statistically significant result whilst the rpart method returns only the root node (i.e. no split is made). The following code recreates the issue using simulated data : # Create a 2 level factor with group 1 probability of success 90% and group 2 60% tmp1 - as.factor((runif (1000) = 0.9)) tmp2 - as.factor((runif (1000) = 0.5)) mysuccess - as.factor(c(tmp1, tmp2)) mygroup - as.factor(c(rep (1,1000), rep (2,1000))) table (mysuccess, mygroup) chisq.test (mysuccess, mygroup) # p-value = 2.2e-16 myrpart - rpart (mysuccess ~ mygroup) myrpart # rpart does not provide splits !! If I change the parameter in the setting of group 2 to 0.3 from 0.6 rpart does return splits, i.e. change the line tmp2 - as.factor((runif (1000) = 0.6)) to tmp2 - as.factor((runif (1000) = 0.3)) rpart does split the nodes, but as the split with 0.6 is highly significant I would still have expected a split in this case too. I would appreciate any advice as to whether this is a known feature of rpart, whether I need to change the way my data are stored, or set some of the control options. I have tested a few of these options with no success. Thanks, Paul. __ Get Tiscali Broadband From £15:99 http://www.tiscali.co.uk/products/broadbandhome/ __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] rpart problem
I think you are confusing the purpose of rpart, which is prediction. You want to predict `mysuccess'. One group has 90% success, so the best prediction is `success'. The other group has 60% success, so the best prediction is `success'. So there is no point in splitting into groups. Replace 60% by 30% and the best prediction for group 2 changes. If this is not now obvious, please read up on tree-based methods. On Mon, 6 Sep 2004 [EMAIL PROTECTED] wrote: Dear all, I am having some trouble with getting the rpart function to work as expected. I am trying to use rpart to combine levels of a factor to reduce the number of levels of that factor. In exploring the code I have noticed that it is possible for chisq.test to return a statistically significant result whilst the rpart method returns only the root node (i.e. no split is made). The following code recreates the issue using simulated data : # Create a 2 level factor with group 1 probability of success 90% and group 2 60% tmp1 - as.factor((runif (1000) = 0.9)) tmp2 - as.factor((runif (1000) = 0.5)) Is 0.5 a typo? mysuccess - as.factor(c(tmp1, tmp2)) mygroup - as.factor(c(rep (1,1000), rep (2,1000))) table (mysuccess, mygroup) chisq.test (mysuccess, mygroup) # p-value = 2.2e-16 myrpart - rpart (mysuccess ~ mygroup) myrpart # rpart does not provide splits !! If I change the parameter in the setting of group 2 to 0.3 from 0.6 rpart does return splits, i.e. change the line tmp2 - as.factor((runif (1000) = 0.6)) to tmp2 - as.factor((runif (1000) = 0.3)) rpart does split the nodes, but as the split with 0.6 is highly significant I would still have expected a split in this case too. I would appreciate any advice as to whether this is a known feature of rpart, whether I need to change the way my data are stored, or set some of the control options. I have tested a few of these options with no success. Testing cp 0 will have an effect. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] rpart and TREE, can be the same?
Hi, Andy, Thank you again for your help. Tree( ) does have an option split='gini' in my version, which I recently downloaded from CRAN. The question is the tree.control only controls over mindev, no option for gini. Or maybe there is a way to specify a 'cp' like parameter when using gini index in tree( )? Thanks, Auston Liaw, Andy [EMAIL PROTECTED] 07/16/2004 02:04 PM To: '[EMAIL PROTECTED]' [EMAIL PROTECTED] cc: Subject: RE: [R] rpart and TREE, can be the same? Auston, tree() does not use Gini as splitting criterion, AFAIK. It uses deviance. You can try to see if the various splitting criteria available in rpart is described in Terry's tech report (available on the Mayo Clinic web site). Andy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, July 16, 2004 2:15 PM To: Liaw, Andy Subject: RE: [R] rpart and TREE, can be the same? Thank you, Andy. Well, I tried 'gini' for both of them and my data has no NAs, but they still don't match. BTW, what is exactly the splitting criterion 'information' used in rpart? Thanks. Auston Liaw, Andy [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 07/16/2004 01:01 PM To: '[EMAIL PROTECTED]' [EMAIL PROTECTED], [EMAIL PROTECTED] cc: Subject: RE: [R] rpart and TREE, can be the same? I guess if you define the splitting criterion in rpart so that it matches the one used in tree(), that's possible. However, I believe the two also differ in how they handle NAs. Andy From: [EMAIL PROTECTED] Hi, all, I am wondering if it is possible to set parameters of 'rpart' and 'tree' such that they will produce the exact same tree? Thanks. Auston Wei Statistical Analyst Department of Biostatistics and Applied Mathematics The University of Texas MD Anderson Cancer Center Tel: 713-563-4281 Email: [EMAIL PROTECTED] [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Notice: This e-mail message, together with any attachments, ...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] rpart and TREE, can be the same?
They are substantially different, even if I used 'gini' for both of them and set feature parameters to 0's. Seems to me there is something more than splitting rule that governs the growth of tree. What could that be, other than sizes? Thank you, Auston Liaw, Andy [EMAIL PROTECTED] 07/19/2004 09:38 AM To: '[EMAIL PROTECTED]' [EMAIL PROTECTED] cc: Subject: RE: [R] rpart and TREE, can be the same? Auston, I see that now. Have you tried setting mindev=0 in tree() and cp=0 in rpart(), to see if the unpruned trees are identical? If so, you can probably try pruning the trees back using other tools in those packages. Cheers, Andy [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart and TREE, can be the same?
Hi, all, I am wondering if it is possible to set parameters of 'rpart' and 'tree' such that they will produce the exact same tree? Thanks. Auston Wei Statistical Analyst Department of Biostatistics and Applied Mathematics The University of Texas MD Anderson Cancer Center Tel: 713-563-4281 Email: [EMAIL PROTECTED] [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] rpart and TREE, can be the same?
I guess if you define the splitting criterion in rpart so that it matches the one used in tree(), that's possible. However, I believe the two also differ in how they handle NAs. Andy From: [EMAIL PROTECTED] Hi, all, I am wondering if it is possible to set parameters of 'rpart' and 'tree' such that they will produce the exact same tree? Thanks. Auston Wei Statistical Analyst Department of Biostatistics and Applied Mathematics The University of Texas MD Anderson Cancer Center Tel: 713-563-4281 Email: [EMAIL PROTECTED] [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart
Hello everyone, I'm a newbie to R and to CART so I hope my questions don't seem too stupid. 1.) My first question concerns the rpart() method. Which method does rpart use in order to get the best split - entropy impurity, Bayes error (min. error) or Gini index? Is there a way to make it use the entropy impurity? The second and third question concern the output of the printcp() function. 2.) What exactly are the cps in that sense here? I assumed them to be the treshold complexity parameters as in Breiman et al., 1998, Section 3.3? Are they the same as the treshold niveaus of alpha? I have read somewhere that the cps here are the treshold alphas divided by the root node error. Is that true? 3.) How is rel error computed? I am supposed to evaluate the goodness of classification of of the CART method. Do you think rel error is a good measure for that? I'd be very thankful if anyone could give me hand on that. This is a project for uni and I desperately need a good mark. Thank you very much in advance, Mareike __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] rpart
Hi, I think most, if not all, your questions can be answered by: 1) ?rpart 2) Some search through the r-help mailing list 3) Read the chapter on tree-based models in MASS 4 (Modern Applied Statistics with S) by Venables and Ripley Kevin - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, June 04, 2004 9:59 PM Subject: [R] rpart Hello everyone, I'm a newbie to R and to CART so I hope my questions don't seem too stupid. 1.) My first question concerns the rpart() method. Which method does rpart use in order to get the best split - entropy impurity, Bayes error (min. error) or Gini index? Is there a way to make it use the entropy impurity? The second and third question concern the output of the printcp() function. 2.) What exactly are the cps in that sense here? I assumed them to be the treshold complexity parameters as in Breiman et al., 1998, Section 3.3? Are they the same as the treshold niveaus of alpha? I have read somewhere that the cps here are the treshold alphas divided by the root node error. Is that true? 3.) How is rel error computed? I am supposed to evaluate the goodness of classification of of the CART method. Do you think rel error is a good measure for that? I'd be very thankful if anyone could give me hand on that. This is a project for uni and I desperately need a good mark. Thank you very much in advance, Mareike __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart for CART with weights/priors
Hi, I have a technical question about rpart: according to Breiman et al. 1984, different costs for misclassification in CART can be modelled either by means of modifying the loss matrix or by means of using different prior probabilities for the classes, which again should have the same effect as using different weights for the response classes. What I tried was this: library(rpart) data(kyphosis) #fit1 from original unweighted data set fit1 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis) #modify loss matrix loss-matrix(c(0,1,2,0),nrow=2,ncol=2) # true class? #[,1] [,2] #[1,]02 #[2,]10 predicted class? #modify priors prior=c(1/3,2/3) fit2- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, parms=list(loss=loss)) fit3 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, parms=list(prior=prior)) fit2 fit3 par(mfrow=c(2,1)) plot(fit2) text(fit2,use.n=T) plot(fit3) text(fit3,use.n=T) #lead to similar but not identical trees (similar topology but different cutoff points), #while all other combinations (even complete reversion, i.e. preference for the other class) #lead to totally different trees... #third approach using weights: #sorting of data to design weight vector ind-order(kyphosis[,1]) kyphosis1-kyphosis[ind,] summary(kyphosis1[,1]) weight-c(rep(1,64),rep(2,17)) summary(as.factor(weight)) fit4 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis1, weights=weight) #leads to result very similar to fit2 with loss-matrix(c(0,1,2,0),nrow=2,ncol=2) #(same tree and cutoff points, but slightly different probabilities, maybe numerical artefact?) fit4 plot(fit4) text(fit4,use.n=T) #doule check with inverse loss matrix loss-matrix(c(0,1,2,0),nrow=2,ncol=2,byrow=T) fit2- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, parms=list(loss=loss)) weight-c(rep(2,64),rep(1,17)) fit4 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis1, weights=weight) fit2 fit4 #also same except for probabilities yprob I don't see 1. why the approach using prior probabilities doesn't work 2. what causes the differences in predicted probabilities in the weights approach Any idea? Thank You! C. -- __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart question
Wondered about the best way to control for input variables that have a large number of levels in 'rpart' models. I understand the algorithm searches through all possible splits (2^(k-1) for k levels) and so variables with more levels are more prone to be good spliters... so I'm looking for ways to compensate and adjust for this complexity. For example, if two variables produce comparable splits in the data but one contains 2 levels and the other 13 levels then I would like to have to have the algorithm choose the 'simpler' split. Is this best done with the 'cost' argument in the rpart options? This defaults to one for all variables... so would it make sense to scale this by nlevels in each variable or sqrt(nlevels) or something similar? Thanks, Landon [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] rpart question
AFAIK rpart does not have built-in facility for adjusting bias in split selection. One possibility is to define your own splitting criterion that does the adjustment is some fashion. I believe the current version of rpart allows you to define custom splitting criterion, but I have not tried it myself. Prof. Wei-yin Loh at UW-Madison (and his current and former students) had worked on algorithms that compensate for bias in split selection. There are software on his web page that you might want to check out. HTH, Andy From: [EMAIL PROTECTED] Wondered about the best way to control for input variables that have a large number of levels in 'rpart' models. I understand the algorithm searches through all possible splits (2^(k-1) for k levels) and so variables with more levels are more prone to be good spliters... so I'm looking for ways to compensate and adjust for this complexity. For example, if two variables produce comparable splits in the data but one contains 2 levels and the other 13 levels then I would like to have to have the algorithm choose the 'simpler' split. Is this best done with the 'cost' argument in the rpart options? This defaults to one for all variables... so would it make sense to scale this by nlevels in each variable or sqrt(nlevels) or something similar? Thanks, Landon [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] RPART drawing the tree
Hello, I am using the RPART library to find patterns in HIV mutations regarding drug-resistancy. My data consists of aminoacid at certain locations and two classes resistant and susceptible. The classification and pruning work fine with Rpart. however there is a problem with displaying the data as a tree in the display window. in the display window the data contain only levels at the splits example: (abcde) left (fg) right. but i would like to have the aminoacids displayed. how can this be achieved ? Rob Kamstra _ MSN Search, for accurate results! http://search.msn.nl __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] RPART drawing the tree
On Thu, 29 Apr 2004, Rob Kamstra wrote: I am using the RPART library to find patterns in HIV mutations regarding drug-resistancy. My data consists of aminoacid at certain locations and two classes resistant and susceptible. The classification and pruning work fine with Rpart. however there is a problem with displaying the data as a tree in the display window. in the display window the data contain only levels at the splits example: (abcde) left (fg) right. but i would like to have the aminoacids displayed. how can this be achieved ? By reading the documentation, as suggested in PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html specifically by ?text.rpart, since that has an argument `pretty' to control this. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart or mvpart
__ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rpart question on loss matrix
Hello again I've looked through ?rpart, Atkinson Therneau (1997), Chap 10 of Venables and Ripley, Breman et al., and the r hgelp archives but haven't seen the answer to these two questions 1) How does rpart deal with asymmetric loss matrices? Breiman et al. suggest some possibilities, but, of course, do not say how rpart does it. 2) In the loss matrix, which direction (column or row) is 'truth' and which 'output of program'? e.g., if you have a 3 level DV (say the levels are A, B, C) and you want a higher cost for misclassifying as later in the alphabet, would it be 0 3 5 1 0 2 2 1 0 or 0 1 2 3 0 1 5 2 0 Thanks in advance Peter __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] rpart postscript graphics, Mac OS
On Tue, 18 Nov 2003, Paul Murrell wrote: Hi Kaiser Fung wrote: I am running R on Mac OS X 10.2x. When I create postscript graphics of rpart tree objects, a tiny part of the tree gets trimmed off, even when it has only a few terminal nodes. This happens even without fancy but worse if fancy=T. (This doesn't happen with boxplot, scatter plots, etc.) How do I fix this? postscript(tree.eps) plot(davb.tree, u=T) text(davb.tree, use.n=T, fancy=F) dev.off() It's hard to see your problem without the actual data to reproduce it. Does it help if you precede the plot command with par(xpd=NA)? Well, the problem is known (calculating the required space for labeling etc. is hard), hence the argument margin in plot.rpart(). ?plot.rpart tells you: margin: an extra percentage of white space to leave around the borders of the tree. (Long labels sometimes get cut off by the default computation). margin=0.1 is sufficient in most cases. Uwe Ligges __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] rpart postscript graphics, Mac OS
I am running R on Mac OS X 10.2x. When I create postscript graphics of rpart tree objects, a tiny part of the tree gets trimmed off, even when it has only a few terminal nodes. This happens even without fancy but worse if fancy=T. (This doesn't happen with boxplot, scatter plots, etc.) How do I fix this? postscript(tree.eps) plot(davb.tree, u=T) text(davb.tree, use.n=T, fancy=F) dev.off() Thanks Kais __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] rpart postscript graphics, Mac OS
Hi Kaiser Fung wrote: I am running R on Mac OS X 10.2x. When I create postscript graphics of rpart tree objects, a tiny part of the tree gets trimmed off, even when it has only a few terminal nodes. This happens even without fancy but worse if fancy=T. (This doesn't happen with boxplot, scatter plots, etc.) How do I fix this? postscript(tree.eps) plot(davb.tree, u=T) text(davb.tree, use.n=T, fancy=F) dev.off() It's hard to see your problem without the actual data to reproduce it. Does it help if you precede the plot command with par(xpd=NA)? Paul -- Dr Paul Murrell Department of Statistics The University of Auckland Private Bag 92019 Auckland New Zealand 64 9 3737599 x85392 [EMAIL PROTECTED] http://www.stat.auckland.ac.nz/~paul/ __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Rpart question - labeling nodes with something not in x$frame
On Thu, 17 Jul 2003, Peter Flom wrote: I have a tree created with tr.hh.logcas - rpart(log(YCASSX + 1)~AGE+DRUGUSEY+SEX+OBSXNUM +WINDLE, xval = 10) I would like to label the nodes with YCASSX rather than log(YCASSX + 1). But the help file for text in library rpart says that you can only use labels that are part of x$frame, which YCASSX is not. This may not be the best solution, but what I have done once is to add another column into the data frame with the labels I want. For example: data(iris) library(rpart) # Recoding the response: #s: setosa #c: versicolor #v: virginica ir - iris[, -5] Species - rep(c(s, c, v), rep(50, 3)) ir - as.data.frame(cbind(ir, Species)) ir.rp - rpart(Species ~ ., data = ir) plot(ir.rp) text(ir.rp) This is probably the long/silly way, but it works ;-D -- Cheers, Kevin -- On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question. -- Charles Babbage (1791-1871) From Computer Stupidities: http://rinkworks.com/stupid/ -- Ko-Kang Kevin Wang Master of Science (MSc) Student SLC Tutor and Lab Demonstrator Department of Statistics University of Auckland New Zealand Homepage: http://www.stat.auckland.ac.nz/~kwan022 Ph: 373-7599 x88475 (City) x88480 (Tamaki) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] rpart vs. randomForest
Anonymous == [EMAIL PROTECTED] on Sat, 12 Apr 2003 14:41:00 -0700 writes: Anonymous Greetings. I'm trying to determine whether to use Anonymous rpart or randomForest for a classification Anonymous tree. Has anybody tested efficacy formally? I've Anonymous run both and the confusion matrix for rf beats Anonymous rpart. I've looking at the rf help page and am Anonymous unable to figure out how to extract the tree. Anonymous But more than that I'm looking for a more Anonymous comprehensive user's guide for randomForest Anonymous including the benefits on using it with MDS. Can Anonymous anybody suggest a general guide? I've been Anonymous finding a lot of broken links and cs-type of web Anonymous pages rather than an end-user's guide. Also Anonymous people's experience on adjusting the mtry param Anonymous would be useful. Breiman says that it isn't too Anonymous sensitive but I'm curious if anybody has had a Anonymous different experience with it. Thanks in advance Anonymous and apologies if this is too general. If you really read Breiman, or alternatively, remember English, you'll know that a forest has many trees... Regards, Martin Maechler [EMAIL PROTECTED] http://stat.ethz.ch/~maechler/ __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] rpart v. lda classification.
On Tue, 11 Feb 2003, Rolf Turner wrote: I've been groping my way through a classification/discrimination problem, from a consulting client. There are 26 observations, with 4 possible categories and 24 (!!!) potential predictor variables. I tried using lda() on the first 7 predictor variables and got 24 of the 26 observations correctly classified. (Training and testing both on the complete data set --- just to get started.) I then tried rpart() for comparison and was somewhat surprised when rpart() only managed to classify 14 of the 26 observations correctly. (I got the same classification using just the first 7 predictors as I did using all of the predictors.) I would have thought that rpart(), being unconstrained by a parametric model, would have a tendency to over-fit and therefore to appear to do better than lda() when the test data and training data are the same. Am I being silly, or is there something weird going on? I can give more detail on what I actually did, if anyone is interested. The first. rpart is seriously constrained by having so few observations, and its model is much more restricted than lda: axis-parallel splits only. There is a similar example, with pictures, in MASS (on Cushings). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] rpart v. lda classification.
I've been groping my way through a classification/discrimination problem, from a consulting client. There are 26 observations, with 4 possible categories and 24 (!!!) potential predictor variables. I tried using lda() on the first 7 predictor variables and got 24 of the 26 observations correctly classified. (Training and testing both on the complete data set --- just to get started.) I then tried rpart() for comparison and was somewhat surprised when rpart() only managed to classify 14 of the 26 observations correctly. (I got the same classification using just the first 7 predictors as I did using all of the predictors.) I would have thought that rpart(), being unconstrained by a parametric model, would have a tendency to over-fit and therefore to appear to do better than lda() when the test data and training data are the same. Am I being silly, or is there something weird going on? I can give more detail on what I actually did, if anyone is interested. The data are pretty obviously nothing like Gaussian, so my gut feeling is that rpart() should be much more appropriate than lda(). And it does not seem surprizing that with so few observations to train with, the success rate should be low, even when testing and training on the same data set. What does surprise me is that lda() gets such a high success rate. Should I just put this down as a random occurrence of a low prob. event? cheers, Rolf Turner [EMAIL PROTECTED] P.S. Using CV=TRUE in lda() I got only 16 of the 26 observations correctly classified. __ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help