[R] rpart weight prior

2007-07-09 Thread Aurélie Davranche

Hi!

Could you please explain the difference between prior and weight in 
rpart? It seems to be the same. But in this case why including a weight 
option in the latest versions? For an unbalanced sampling what is the 
best to use : weight, prior or the both together?


Thanks a lot.

Aurélie Davranche.
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart weight prior

2007-07-09 Thread Prof Brian Ripley

On Sun, 8 Jul 2007, Aurélie Davranche wrote:


Hi!

Could you please explain the difference between prior and weight in 
rpart? It seems to be the same. But in this case why including a weight 
option in the latest versions? For an unbalanced sampling what is the best to 
use : weight, prior or the both together?


The 'weight' argument (sic) has been there for a decade, and is not the 
same as the 'prior' param.


The help file (which you seem unfamiliar with) says

 weights: optional case weights.

   parms: optional parameters for the splitting function. Anova
  splitting has no parameters. Poisson splitting has a single
  parameter, the coefficient of variation of the prior
  distribution on the rates.  The default value is 1.
  Exponential splitting has the same parameter as Poisson. For
  classification splitting, the list can contain any of: the
  vector of prior probabilities (component 'prior'), the loss
  matrix (component 'loss') or the splitting index (component
  'split').  The priors must be positive and sum to 1.  The
  loss matrix must have zeros on the diagonal and positive
  off-diagonal elements.  The splitting index can be 'gini' or
  'information'.  The default priors are proportional to the
  data counts, the losses default to 1, and the split defaults
  to 'gini'.

The rpart technical report at

http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf

may help you understand this.

--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart-question regarding relation between cp and rel error

2007-03-06 Thread Ulrike Grömping

Dear useRs,

I may be temporarily (I hope :-)) confused, and I hope that someone can
answer this question that bugs me at the moment:

In the CP table of rpart, I thought the following equation should hold: 
 rel error = rel error(before) - (nsplit - nsplit(before)) * CP(before),
where (before) always denotes the entry in the row above.
While this equation holds for many rows of the CP tables I've looked at, it
doesn't hold for all. 

For example, in the table below, 0.67182 != 0.68405 - (47-38)*0.0010616,
with a difference of 0.002676 which appears larger than just numerical
inaccuracy.

  CP nsplit rel error  xerror xstd
1  0.1820909  0   1.0 1.0 0.012890
2  0.0526194  1   0.81791 0.81768 0.012062
3  0.0070390  2   0.76529 0.76529 0.011780
4  0.0043850  4   0.75121 0.77660 0.011842
5  0.0036157  5   0.74683 0.77106 0.011812
6  0.0032310  8   0.73598 0.77083 0.011810
7  0.0026541  9   0.73275 0.77083 0.011810
8  0.0025387 14   0.71936 0.76829 0.011796
9  0.0016155 16   0.71429 0.76644 0.011786
10 0.0013847 20   0.70759 0.76206 0.011761
11 0.0011539 28   0.69605 0.76621 0.011785
12 0.0010616 38   0.68405 0.76875 0.011799
13 0.0010001 47   0.67182 0.76991 0.011805
14 0.001 57   0.66144 0.77060 0.011809

Can someone explain why/when this happens?

Regards, Ulrike
-- 
View this message in context: 
http://www.nabble.com/rpart-question-regarding-relation-between-cp-and-rel-error-tf3356652.html#a9335690
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart minimum sample size

2007-02-28 Thread Terry Therneau
  Look at rpart.control.  Rpart has two advisory parameters that control
the tree size at the smallest nodes:
minsplit (default 20): a node with less than this many subjects will
not be worth splitting

minbucket (default 7) : don't create any final nodes with 7 
observations

As I said, these are advisory, and reflect that these final splits are usually
not worthwhile.  They lead to a little faster run time, but mostly to a less
complex plotted model.

  I am not nearly as pessimistic as Frank Harrell (need 20,000 observations).
Rpart often gives a good model -- one that predicts the outcome, and I find
the intermediate steps that it takes informative.  However, there are often many
trees with similar predictive ability, but a very different look in terms
of splitpoints and variables.  Saying that any given rpart model is THE best
is perilous.
Terry T.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart minimum sample size

2007-02-27 Thread Amy Uhrin
Is there an optimal / minimum sample size for attempting to construct a 
classification tree using /rpart/?

I have 27 seagrass disturbance sites (boat groundings) that have been 
monitored for a number of years.  The monitoring protocol for each site 
is identical.  From the monitoring data, I am able to determine the 
level of recovery that each site has experienced.  Recovery is our 
categorical dependent variable with values of none, low, medium, high 
which are based upon percent seagrass regrowth into the injury over 
time.  I wish to be able to predict the level of recovery of future 
vessel grounding sites based upon a number of categorical / continuous 
predictor variables used here including (but not limited to) such 
parameters as:  sediment grain size, wave exposure, original size 
(volume) of the injury, injury age, injury location.

When I run /rpart/, the data is split into only two terminal nodes based 
solely upon values of the original volume of each injury.  No other 
predictor variables are considered, even though I have included about 
six of them in the model.  When I remove volume from the model the same 
thing happens but with injury area - two terminal nodes are formed based 
upon area values and no other variables appear.  I was hoping that this 
was a programming issue, me being a newbie and all, but I really think 
I've got the code right.  Now I am beginning to wonder if my N is too 
small for this method?

-- 
Amy V. Uhrin, Research Ecologist

NOAA, National Ocean Service
Center for Coastal Fisheries and Habitat Research
101 Pivers Island Road
Beaufort, NC 28516
(252) 728-8778
(252) 728-8784 (fax)
[EMAIL PROTECTED]


 \!/ \!/   :}   \!/ \!/  ^**^  \!/ \!/ 


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart minimum sample size

2007-02-27 Thread Wensui Liu
amy,
without looking at your actual code, i would suggest you to take a
look at rpart.control()

On 2/27/07, Amy Uhrin [EMAIL PROTECTED] wrote:
 Is there an optimal / minimum sample size for attempting to construct a
 classification tree using /rpart/?

 I have 27 seagrass disturbance sites (boat groundings) that have been
 monitored for a number of years.  The monitoring protocol for each site
 is identical.  From the monitoring data, I am able to determine the
 level of recovery that each site has experienced.  Recovery is our
 categorical dependent variable with values of none, low, medium, high
 which are based upon percent seagrass regrowth into the injury over
 time.  I wish to be able to predict the level of recovery of future
 vessel grounding sites based upon a number of categorical / continuous
 predictor variables used here including (but not limited to) such
 parameters as:  sediment grain size, wave exposure, original size
 (volume) of the injury, injury age, injury location.

 When I run /rpart/, the data is split into only two terminal nodes based
 solely upon values of the original volume of each injury.  No other
 predictor variables are considered, even though I have included about
 six of them in the model.  When I remove volume from the model the same
 thing happens but with injury area - two terminal nodes are formed based
 upon area values and no other variables appear.  I was hoping that this
 was a programming issue, me being a newbie and all, but I really think
 I've got the code right.  Now I am beginning to wonder if my N is too
 small for this method?

 --
 Amy V. Uhrin, Research Ecologist

 NOAA, National Ocean Service
 Center for Coastal Fisheries and Habitat Research
 101 Pivers Island Road
 Beaufort, NC 28516
 (252) 728-8778
 (252) 728-8784 (fax)
 [EMAIL PROTECTED]

 
  \!/ \!/   :}   \!/ \!/  ^**^  \!/ \!/


 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart minimum sample size

2007-02-27 Thread Frank E Harrell Jr
Amy Uhrin wrote:
 Is there an optimal / minimum sample size for attempting to construct a 
 classification tree using /rpart/?
 
 I have 27 seagrass disturbance sites (boat groundings) that have been 
 monitored for a number of years.  The monitoring protocol for each site 
 is identical.  From the monitoring data, I am able to determine the 
 level of recovery that each site has experienced.  Recovery is our 
 categorical dependent variable with values of none, low, medium, high 
 which are based upon percent seagrass regrowth into the injury over 
 time.  I wish to be able to predict the level of recovery of future 
 vessel grounding sites based upon a number of categorical / continuous 
 predictor variables used here including (but not limited to) such 
 parameters as:  sediment grain size, wave exposure, original size 
 (volume) of the injury, injury age, injury location.
 
 When I run /rpart/, the data is split into only two terminal nodes based 
 solely upon values of the original volume of each injury.  No other 
 predictor variables are considered, even though I have included about 
 six of them in the model.  When I remove volume from the model the same 
 thing happens but with injury area - two terminal nodes are formed based 
 upon area values and no other variables appear.  I was hoping that this 
 was a programming issue, me being a newbie and all, but I really think 
 I've got the code right.  Now I am beginning to wonder if my N is too 
 small for this method?
 

In my experience N needs to be around 20,000 to get both good accuracy 
and replicability of patterns if the number of potential predictors is 
not tiny.  In general, the R^2 from rpart is not competitive with that 
from an intelligently fitted regression model.  It's just a difficult 
problem, when relying on a single tree (hence the popularity of random 
forests, bagging, boosting).

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart with overdispersed count data?

2007-02-25 Thread David Farrar
I would like to do recursive partitioning when the response is a
  count variable subject to overdispersion, using say negative 
  binomial likelihood or something like quasipoisson in glm.  Would 
  appreciate any thoughts on how to go about this (theory/computation).
  If I understand the rpart documentation, I would need to write a 
  method argument, but the details are not there.  Therefore, a second
  question is whether/where one can get material on developing new
  rpart implementations. 
   
  regards,
  Farrar

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart tree node label

2007-02-14 Thread Aimin Yan
I generate a tree use rpart.
In the node of tree, split is based on the some factor.
I want to label these node based on the levels of this factor.

Does anyone know how to do this?

Thanks,

Aimin

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart tree node label

2007-02-14 Thread Wensui Liu
not sure how you want to label it.
could you be more specific?
thanks.

On 2/14/07, Aimin Yan [EMAIL PROTECTED] wrote:
 I generate a tree use rpart.
 In the node of tree, split is based on the some factor.
 I want to label these node based on the levels of this factor.

 Does anyone know how to do this?

 Thanks,

 Aimin

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart tree node label

2007-02-14 Thread Aimin Yan
  levels(training$aa_one)
  [1] A C D E F H I K L M N P Q R S T V 
W Y
this is 19 levels of aa_one.

When I see tree,

in one node, it is labeled by

aa_one=bcdfgknop

it is obvious that it is labeled by alphabet letter ,not by levels of aa_one.

I want to get

aa_one=CDE.. such like.

Do you know how to do this

Aimin



At 04:23 PM 2/14/2007, Wensui Liu wrote:
not sure how you want to label it.
could you be more specific?
thanks.

On 2/14/07, Aimin Yan [EMAIL PROTECTED] wrote:
I generate a tree use rpart.
In the node of tree, split is based on the some factor.
I want to label these node based on the levels of this factor.

Does anyone know how to do this?

Thanks,

Aimin

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart tree node label [Broadcast]

2007-02-14 Thread Liaw, Andy
Try the following to see:

library(rpart)
iris.rp(Sepal.Length ~ Species, iris)
plot(iris.rp)
text(iris.rp)

Two possible solutions:

1. Use text(..., pretty=0).  See ?text.rpart.
2. Use post(..., filename=).

Andy 

From: Wensui Liu
 
 not sure how you want to label it.
 could you be more specific?
 thanks.
 
 On 2/14/07, Aimin Yan [EMAIL PROTECTED] wrote:
  I generate a tree use rpart.
  In the node of tree, split is based on the some factor.
  I want to label these node based on the levels of this factor.
 
  Does anyone know how to do this?
 
  Thanks,
 
  Aimin
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 --
 WenSui Liu
 A lousy statistician who happens to know a little programming
 (http://spaces.msn.com/statcompute/blog)
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 


--
Notice:  This e-mail message, together with any attachments,...{{dropped}}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart

2007-02-05 Thread Aimin Yan
Hello,
I have a question for rpart,
I try to use it to do prediction for a continuous variable.
But I get the different prediction accuracy for same training set, 
anyone know why?

Aimin

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart

2007-02-05 Thread Aimin Yan
Yes, I use the same setting, and I calculate MSE and CC as
prediction accuracy measure.
Someone told me  I should not trust one tree and should do bagging.
Is this correct?
Aimin

At 03:11 PM 2/5/2007, Wensui Liu wrote:
are you sure you are using the same setting,  tree size, and so on?

On 2/5/07, Aimin Yan [EMAIL PROTECTED] wrote:
Hello,
I have a question for rpart,
I try to use it to do prediction for a continuous variable.
But I get the different prediction accuracy for same training set,
anyone know why?

Aimin

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart

2007-02-05 Thread Wensui Liu
man, oh, man
Surely you can use bagging, or probably boosting. But that doesn't
answer your question, does it?
Believe me, even you use bagging, the result will vary, depending on set.seed().

On 2/5/07, Aimin Yan [EMAIL PROTECTED] wrote:
 Yes, I use the same setting, and I calculate MSE and CC as
 prediction accuracy measure.
 Someone told me  I should not trust one tree and should do bagging.
 Is this correct?
 Aimin

 At 03:11 PM 2/5/2007, Wensui Liu wrote:
 are you sure you are using the same setting,  tree size, and so on?
 
 On 2/5/07, Aimin Yan [EMAIL PROTECTED] wrote:
 Hello,
 I have a question for rpart,
 I try to use it to do prediction for a continuous variable.
 But I get the different prediction accuracy for same training set,
 anyone know why?
 
 Aimin
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 --
 WenSui Liu
 A lousy statistician who happens to know a little programming
 (http://spaces.msn.com/statcompute/blog)





-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart question

2007-01-25 Thread Prof Brian Ripley
On Thu, 25 Jan 2007, Aimin Yan wrote:

 I make classification tree like this code
 p.t2.90 - rpart(y~aa_three+bas+bcu+aa_ss,
 data=training,method=class,control=rpart.control(cp=0.0001))

 Here I want to set weight for 4 predictors(aa_three,bas,bcu,aa_ss).

 I know that there is a weight set-up in rpart.
 Can this set-up satisfy my need?

It depends on what _you_ mean by 'set weight'.  You will need to tell us 
in detail what exactly you want the weights to do.

Using the 'weights' argument is specifying case weights (as the help 
says).  There are also 'cost' and 'parms' for other aspects of weighting.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart question

2007-01-24 Thread Aimin Yan
I make classification tree like this code
p.t2.90 - rpart(y~aa_three+bas+bcu+aa_ss, 
data=training,method=class,control=rpart.control(cp=0.0001))

Here I want to set weight for 4 predictors(aa_three,bas,bcu,aa_ss).

I know that there is a weight set-up in rpart.
Can this set-up satisfy my need?

If so, could someone give me an example?

Thanks,

Aimin Yan

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart - I'm confused by the loss matrix

2006-11-09 Thread Barbora Arendacká
Hello,

As I couldn't find anywhere in the help to rpart which element in the
loss matrix means which loss, I played with this parameter and became
a bit confused.
What I did was this:
I used kyphosis data(classification absent/present, number of 'absent'
cases is 64, of 'present' cases 17)
and I tried the following

 lmat=matrix(c(0,17,64,0),ncol=2)
 lmat
 [,1] [,2]
[1,]0   64
[2,]   170

 set.seed(1003)
 fit1-rpart(Kyphosis~.,data=kyphosis,parms=list(loss=lmat))

 set.seed(1003)
 fit2-rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(0.5,0.5)))

The results I obtained were identical, so I concluded that the losses were
[L(true, predicted)]:

L(absent,present)=17
L(present,absent)=64.

And thus the arrangement of the elements in the loss matrix seemed
clear as absent is considered as class 1 and present as class 2 and my
problem seemed to be solved. However, I tried also this:

residuals(fit1)

and became confused. Because for each misclassified 'absent' the
residual(which should be loss in this case) was 64, while for a
misclassified 'present' it was 17 (in contradiction to the previous.)

So am I wrong somewhere? Is the arrangement of elements in the loss
matrix such as I deduced it from fitting fit1 and fit2?

Thanks for any comments.

Barbora

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart

2006-09-26 Thread henrigel
Dear r-help-list:

If I use the rpart method like

cfit-rpart(y~.,data=data,...),

what kind of tree is stored in cfit?
Is it right that this tree is not pruned at all, that it is the full tree?

If so, it's up to me to choose a subtree by using the printcp method.
In the technical report from Atkinson and Therneau An Introduction to 
recursive partitioning using the rpart routines from 2000, one can see the 
following table on page 15:

  CP  nsplit  relerror  xerror   xstd
1   0.105   0 1.0   1.   0.108
2   0.056   3 0.68519   1.1852   0.111
3   0.028   4 0.62963   1.0556   0.109
4   0.574   6 0.57407   1.0556   0.109
5   0.100   7 0.6   1.0556   0.109

Some lines below it says We see that the best tree has 5 terminal nodes (4 
splits). Why that if the xerror is the lowest for the tree only consisting of 
the root?

Thank you very much for your help

Henri 
--

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart

2006-09-26 Thread Prof Brian Ripley
On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote:

 Dear r-help-list:

 If I use the rpart method like

 cfit-rpart(y~.,data=data,...),

 what kind of tree is stored in cfit?
 Is it right that this tree is not pruned at all, that it is the full tree?

It is an rpart object.  This contains both the tree and the instructions 
for pruning it at all values of cp: note that cp is also used in deciding 
how large a tree to grow.

 If so, it's up to me to choose a subtree by using the printcp method.

Or the plotcp method.

 In the technical report from Atkinson and Therneau An Introduction to 
 recursive partitioning using the rpart routines from 2000, one can see 
 the following table on page 15:

  CP  nsplit  relerror  xerror   xstd
 1   0.105   0 1.0   1.   0.108
 2   0.056   3 0.68519   1.1852   0.111
 3   0.028   4 0.62963   1.0556   0.109
 4   0.574   6 0.57407   1.0556   0.109
 5   0.100   7 0.6   1.0556   0.109

 Some lines below it says We see that the best tree has 5 terminal nodes 
 (4 splits). Why that if the xerror is the lowest for the tree only 
 consisting of the root?

There are *two* reports with that name: this seems to be from minitech.ps.
The choice is explained in the rest of that para (the 1-SE rule was used).
My guess is that the authors excluded the root as not being a tree, but 
only they can answer that.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart

2006-09-26 Thread Prof Brian Ripley
On Tue, 26 Sep 2006, [EMAIL PROTECTED] wrote:


  Original-Nachricht 
 Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
 Von: Prof Brian Ripley [EMAIL PROTECTED]
 An: [EMAIL PROTECTED]
 Betreff: Re: [R] rpart

 On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote:

 Dear r-help-list:

 If I use the rpart method like

 cfit-rpart(y~.,data=data,...),

 what kind of tree is stored in cfit?
 Is it right that this tree is not pruned at all, that it is the full
 tree?

 It is an rpart object.  This contains both the tree and the instructions
 for pruning it at all values of cp: note that cp is also used in deciding
 how large a tree to grow.


 Ok, I have to explain my problem a little bit more in detail, I'm sorry for 
 being so vague:
 I used the method in the following way:
 cfit- rpart(y~., method=class, minsplit=1, cp=0)
 I got a tree with a lot of terminals nodes that contained more than 100 
 observations. This made me believe that the tree was already pruned.
 On the other hand, the printcp method showed subtrees that were better.
 This made me believe that the tree hadn't been pruned before.
 So, are the trees a little bit pruned?

Yes, as you asked for cp=0.  Look up what that does in ?rpart.control.

 If so, it's up to me to choose a subtree by using the printcp method.

 Or the plotcp method.

 In the technical report from Atkinson and Therneau An Introduction to
 recursive partitioning using the rpart routines from 2000, one can see
 the following table on page 15:

  CP  nsplit  relerror  xerror   xstd
 1   0.105   0 1.0   1.   0.108
 2   0.056   3 0.68519   1.1852   0.111
 3   0.028   4 0.62963   1.0556   0.109
 4   0.574   6 0.57407   1.0556   0.109
 5   0.100   7 0.6   1.0556   0.109

 Some lines below it says We see that the best tree has 5 terminal nodes
 (4 splits). Why that if the xerror is the lowest for the tree only
 consisting of the root?

 There are *two* reports with that name: this seems to be from minitech.ps.
 The choice is explained in the rest of that para (the 1-SE rule was used).
 My guess is that the authors excluded the root as not being a tree, but
 only they can answer that.


 Are both reports from 2000? But you're right, I'm talking about the one from 
 minitch.ps.
 The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 
 splits, but not why they didn't choose the tree without a split.
 The exclusion of the root as not being a tree was my first explanation, too. 
 But if the tree only consisting of the root is still better than any other 
 tree, why would I choose a tree with 4 splits then?

 Henri



-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart

2006-09-26 Thread henrigel

 Original-Nachricht 
Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
Von: Prof Brian Ripley [EMAIL PROTECTED]
An: [EMAIL PROTECTED]
Betreff: Re: [R] rpart

 On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote:
 
  Dear r-help-list:
 
  If I use the rpart method like
 
  cfit-rpart(y~.,data=data,...),
 
  what kind of tree is stored in cfit?
  Is it right that this tree is not pruned at all, that it is the full
 tree?
 
 It is an rpart object.  This contains both the tree and the instructions 
 for pruning it at all values of cp: note that cp is also used in deciding 
 how large a tree to grow.
 

Ok, I have to explain my problem a little bit more in detail, I'm sorry for 
being so vague:
I used the method in the following way:
cfit- rpart(y~., method=class, minsplit=1, cp=0)
I got a tree with a lot of terminals nodes that contained more than 100 
observations. This made me believe that the tree was already pruned.
On the other hand, the printcp method showed subtrees that were better.
This made me believe that the tree hadn't been pruned before.
So, are the trees a little bit pruned? 

  If so, it's up to me to choose a subtree by using the printcp method.
 
 Or the plotcp method.
 
  In the technical report from Atkinson and Therneau An Introduction to 
  recursive partitioning using the rpart routines from 2000, one can see 
  the following table on page 15:
 
   CP  nsplit  relerror  xerror   xstd
  1   0.105   0 1.0   1.   0.108
  2   0.056   3 0.68519   1.1852   0.111
  3   0.028   4 0.62963   1.0556   0.109
  4   0.574   6 0.57407   1.0556   0.109
  5   0.100   7 0.6   1.0556   0.109
 
  Some lines below it says We see that the best tree has 5 terminal nodes
  (4 splits). Why that if the xerror is the lowest for the tree only 
  consisting of the root?
 
 There are *two* reports with that name: this seems to be from minitech.ps.
 The choice is explained in the rest of that para (the 1-SE rule was used).
 My guess is that the authors excluded the root as not being a tree, but 
 only they can answer that.
 

Are both reports from 2000? But you're right, I'm talking about the one from 
minitch.ps.
The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, 
but not why they didn't choose the tree without a split.
The exclusion of the root as not being a tree was my first explanation, too. 
But if the tree only consisting of the root is still better than any other 
tree, why would I choose a tree with 4 splits then?  

Henri

--

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart

2006-09-26 Thread henrigel

 Original-Nachricht 
Datum: Tue, 26 Sep 2006 12:54:22 +0100 (BST)
Von: Prof Brian Ripley [EMAIL PROTECTED]
An: [EMAIL PROTECTED]
Betreff: Re: [R] rpart

 On Tue, 26 Sep 2006, [EMAIL PROTECTED] wrote:
 
 
   Original-Nachricht 
  Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
  Von: Prof Brian Ripley [EMAIL PROTECTED]
  An: [EMAIL PROTECTED]
  Betreff: Re: [R] rpart
 
  On Mon, 25 Sep 2006, [EMAIL PROTECTED] wrote:
 
  Dear r-help-list:
 
  If I use the rpart method like
 
  cfit-rpart(y~.,data=data,...),
 
  what kind of tree is stored in cfit?
  Is it right that this tree is not pruned at all, that it is the full
  tree?
 
  It is an rpart object.  This contains both the tree and the
 instructions
  for pruning it at all values of cp: note that cp is also used in
 deciding
  how large a tree to grow.
 
 
  Ok, I have to explain my problem a little bit more in detail, I'm sorry
 for being so vague:
  I used the method in the following way:
  cfit- rpart(y~., method=class, minsplit=1, cp=0)
  I got a tree with a lot of terminals nodes that contained more than 100
 observations. This made me believe that the tree was already pruned.
  On the other hand, the printcp method showed subtrees that were
 better.
  This made me believe that the tree hadn't been pruned before.
  So, are the trees a little bit pruned?
 
 Yes, as you asked for cp=0.  Look up what that does in ?rpart.control.
 

I thought I would get a full tree by choosing cp=0 - and it was one.
The nodes with more than 100 observations were not split further because there 
was no sequence of splits which made the class label change for any subset. (A 
bad explanation, but you probably know what I mean.) I realized that when I 
chose cp=-1. Thank you very much for your help!  

  If so, it's up to me to choose a subtree by using the printcp method.
 
  Or the plotcp method.
 
  In the technical report from Atkinson and Therneau An Introduction to
  recursive partitioning using the rpart routines from 2000, one can
 see
  the following table on page 15:
 
   CP  nsplit  relerror  xerror   xstd
  1   0.105   0 1.0   1.   0.108
  2   0.056   3 0.68519   1.1852   0.111
  3   0.028   4 0.62963   1.0556   0.109
  4   0.574   6 0.57407   1.0556   0.109
  5   0.100   7 0.6   1.0556   0.109
 
  Some lines below it says We see that the best tree has 5 terminal
 nodes
  (4 splits). Why that if the xerror is the lowest for the tree only
  consisting of the root?
 
  There are *two* reports with that name: this seems to be from
 minitech.ps.
  The choice is explained in the rest of that para (the 1-SE rule was
 used).
  My guess is that the authors excluded the root as not being a tree, but
  only they can answer that.
 
 
  Are both reports from 2000? But you're right, I'm talking about the one
 from minitch.ps.
  The 1-SE-rule only explains why they didn't choose the tree with 6 or 7
 splits, but not why they didn't choose the tree without a split.
  The exclusion of the root as not being a tree was my first explanation,
 too. But if the tree only consisting of the root is still better than any
 other tree, why would I choose a tree with 4 splits then?
 
  

Henri


--

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Rpart, custom penalty for an error

2006-09-15 Thread Maciej Bliziński
On Sun, 2006-09-10 at 20:36 +0100, Prof Brian Ripley wrote: 
  I am however interested in areas where the probability of success is 
  noticeably higher than 5%, for example 20%. I've tried rpart and the 
  weights option, increasing the weights of the success-observations.
 
 You are 'misleading' rpart by using 'weights', claiming to have case
 weights for cases you do not have.  You need to use 'cost' instead.

As for the rpart() function, the `cost' parameter is for scaling the
variables, not for the cost of misclassifications. To specify it, the
parameter `parms' needs to be used, as a list with a `loss' element, in
form of a matrix. In other words, cost parm is not for cost, use loss
parm of the parms parm. Example usage:

tr - rpart(y ~ x, data = some.data, method = 'class',
parms = list(loss = matrix(c(0, 1, 20, 0), nrow = 2)))

 This is a standard issue, discussed in all good books on classification
 (including mine).

Yes, in MASS, section 12.2, Classification Theory, page 338 (fourth edition).
I was looking for it in section 9.2, where rpart() is discussed.

Thanks!

Regards,
Maciej

-- 
http://automatthias.wordpress.com

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Rpart, custom penalty for an error

2006-09-10 Thread Maciej Bliziński
Hello all R-help list subscribers,

I'd like to create a regression tree of a data set with binary response
variable. Only 5% of observations are a success, so the regression tree
will not find really any variable value combinations that will yield
more than 50% of probability of success. I am however interested in
areas where the probability of success is noticeably higher than 5%, for
example 20%. I've tried rpart and the weights option, increasing the
weights of the success-observations.

It works as expected in terms of the tree creation: instead of a single
root, a tree is being built. But the tree plot() and text() are somewhat
misleading. I'm interested in the observation counts inside each leaf.
I use the use.n = TRUE parameter. The counts displayed are misleading,
the numbers of successes are not the original numbers from the sample,
they seem to be cloned success-observations.

I'd like to split the tree just as weights parameter allows me to,
keeping the original number of observations in the tree plot. Is it
possible? If yes, how?

Kind regards,
Maciej

-- 
Maciej Bliziński [EMAIL PROTECTED]
http://automatthias.wordpress.com

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Rpart, custom penalty for an error

2006-09-10 Thread Prof Brian Ripley
On Sun, 10 Sep 2006, Maciej Blizi?ski wrote:

 Hello all R-help list subscribers,
 
 I'd like to create a regression tree of a data set with binary response
 variable. Only 5% of observations are a success, so the regression tree
 will not find really any variable value combinations that will yield
 more than 50% of probability of success. 

This would be a misuse of a regression tree, for the exact problem for 
which classification trees were designed.

 I am however interested in areas where the probability of success is 
 noticeably higher than 5%, for example 20%. I've tried rpart and the 
 weights option, increasing the weights of the success-observations.

You are 'misleading' rpart by using 'weights', claiming to have case
weights for cases you do not have.  You need to use 'cost' instead.

This is a standard issue, discussed in all good books on classification
(including mine).

 It works as expected in terms of the tree creation: instead of a single
 root, a tree is being built. But the tree plot() and text() are somewhat
 misleading. I'm interested in the observation counts inside each leaf.
 I use the use.n = TRUE parameter. The counts displayed are misleading,
 the numbers of successes are not the original numbers from the sample,
 they seem to be cloned success-observations.

They _are_ the original numbers, for that is what 'case weights' means.

 I'd like to split the tree just as weights parameter allows me to,
 keeping the original number of observations in the tree plot. Is it
 possible? If yes, how?
 
 Kind regards,
 Maciej

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart output: rule extraction beyond path.rpart()

2006-08-22 Thread Bryant, Benjamin
 

Greetings - 

 

Is there a way to automatically perform what I believe is called rule
extraction (by Quinlan and the machine learning community at least) for
the leaves of trees generated by rpart?  I can use path.rpart() to
automatically extract the paths to the leaves, but these can be
needlessly cumbersome.  For example, one path returned by path.rpart()
might be:

 

[1] root   y=-0.1905 y 0.1495  z=-0.19   z 0.1785 

[6] y=-0.1385 z=-0.153  x 0.37x=-0.363

 

But the y = -0.1905 and z=-.19 are both redundant, given restrictions
placed further down the tree.  Simplifying the paths by hand is feasible
for small trees but quite cumbersome when dimensionality increases.  I
can think of ways to write code to do this automatically, but would
prefer not to if it's already implemented.  I have done extensive
searching and turned up nothing, but I fear I might just be lacking the
right terminology.  Any thoughts?

 

Much appreciated,

-Ben

 

Ben Bryant

Doctoral Fellow

Pardee RAND Graduate School   

[EMAIL PROTECTED]

 

 





This email message is for the sole use of the intended recip...{{dropped}}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] rpart unbalanced data

2006-07-21 Thread helen . mills
Hello all,
I am currently working with rpart to classify vegetation types by spectral
characteristics, and am comming up with poor classifications based on the fact
that I have some vegetation types that have only 15 observations, while others
have over 100. I have attempted to supply prior weights to the dataset, though
this does not improve the classification greatly. Could anyone supply some
hints about how to improve a classification for a badly unbalanced datase?

Thank you,
Helen Mills Poulos

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart unbalanced data

2006-07-21 Thread Dr. Diego Kuonen
Dear Helen,

You may want to have a look at

  http://www.togaware.com/datamining/survivor/Predicting_Fraud.html

Greets,

  Diego Kuonen


[EMAIL PROTECTED] wrote:
 Hello all,
 I am currently working with rpart to classify vegetation types by spectral
 characteristics, and am comming up with poor classifications based on the fact
 that I have some vegetation types that have only 15 observations, while others
 have over 100. I have attempted to supply prior weights to the dataset, though
 this does not improve the classification greatly. Could anyone supply some
 hints about how to improve a classification for a badly unbalanced datase?
 
 Thank you,
 Helen Mills Poulos

-- 
Dr. ès sc. Diego Kuonen, CEOphone  +41 (0)21 693 5508
Statoo Consulting   fax+41 (0)21 693 8765
PO Box 107  mobile +41 (0)78 709 5384
CH-1015 Lausanne 15 email   [EMAIL PROTECTED]
web   http://www.statoo.info   skype Kuonen.Statoo.Consulting
-
| Statistical Consulting + Data Analysis + Data Mining Services |
-
+  Are you drowning in information and starving for knowledge?  +
+  Have you ever been Statooed?  http://www.statoo.biz  +

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Rpart -- using predict() when missing data is present?

2005-10-08 Thread Ajay Narottam Shah
I am doing

 library(rpart)
 m - rpart(y ~ x, D[insample,])
 D[outsample,]
y   x
8  0.78391922 0.579025591
9  0.06629211  NA
10 NA 0.001593063
   p - predict(m, newdata=D[9,])
Error in model.frame(formula, rownames, variables, varnames, extras, 
extranames,  : 
invalid result from na.action

How do I persuade him to give me NA since x is NA?

I looked at ?predict.rpart but didn't find any mention about NAs.

(In this problem, I can easily do it manually, but this is a part of
something bigger where I want him to be able to gracefully handle
prediction requests involving NA).

-- 
Ajay Shah   Consultant
[EMAIL PROTECTED]  Department of Economic Affairs
http://www.mayin.org/ajayshah   Ministry of Finance, New Delhi

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Rpart -- using predict() when missing data is present?

2005-10-08 Thread Prof Brian Ripley
On Sat, 8 Oct 2005, Ajay Narottam Shah wrote:

 I am doing

 library(rpart)
 m - rpart(y ~ x, D[insample,])
 D[outsample,]
y   x
 8  0.78391922 0.579025591
 9  0.06629211  NA
 10 NA 0.001593063
   p - predict(m, newdata=D[9,])
 Error in model.frame(formula, rownames, variables, varnames, extras, 
 extranames,  :
   invalid result from na.action

 How do I persuade him to give me NA since x is NA?

I think the point is to do something sensible!  One x prediction problems 
are not what rpart is designed to do, and the default na.action (na.rpart) 
fails in that case.  (The author forgot drop=F.)

 I looked at ?predict.rpart but didn't find any mention about NAs.

How about ?rpart ?  That does.

 (In this problem, I can easily do it manually, but this is a part of
 something bigger where I want him to be able to gracefully handle
 prediction requests involving NA).

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart Error in yval[, 1] : incorrect number of dimensions

2005-09-24 Thread Little, Mark P
I tried using rpart, as below, and got this error message rpart Error in 
yval[, 1] : incorrect number of dimensions.  Thinking it might somehow be 
related to the large number of missing values, I tried using complete data, but 
with the same result. Does anyone know what may be going on, and how to fix it? 
I have traced two similar error messages in the Archive, but following the 
threads did not make it clear how to fix the problem.  

 currwh.rpart-rpart(formula = CURRWHEE~EA17_6_1 + EA17_9_1 + X087 + X148 + 
 X260 + MOTHERSA + GESTATIO,method=class)

 

 currwh.rpart

n=6783 (2283 observations deleted due to missing)

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 6783 720 3 (0.1060002949 0.8938522778 0.0001474274) *

 

 summary(currwh.rpart)

Call:

rpart(formula = CURRWHEE ~ EA17_6_1 + EA17_9_1 + X087 + X148 + 

X260 + MOTHERSA + GESTATIO, method = class)

n=6783 (2283 observations deleted due to missing)

CP nsplit rel error

1 0 0 1

Error in yval[, 1] : incorrect number of dimensions




[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart Error in yval[, 1] : incorrect number of dimensions

2005-09-24 Thread Little, Mark P
I tried using rpart, as below, and got this error message rpart Error in 
yval[, 1] : incorrect number of dimensions.  Thinking it might somehow be 
related to the large number of missing values, I tried using complete data, but 
with the same result. Does anyone know what may be going on, and how to fix it? 
I have traced two similar error messages in the Archive, but following the 
threads did not make it clear how to fix the problem.  

 currwh.rpart-rpart(formula = CURRWHEE~EA17_6_1 + EA17_9_1 + X087 + X148 + 
 X260 + MOTHERSA + GESTATIO,method=class)

 

 currwh.rpart

n=6783 (2283 observations deleted due to missing)

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 6783 720 3 (0.1060002949 0.8938522778 0.0001474274) *

 

 summary(currwh.rpart)

Call:

rpart(formula = CURRWHEE ~ EA17_6_1 + EA17_9_1 + X087 + X148 + 

X260 + MOTHERSA + GESTATIO, method = class)

n=6783 (2283 observations deleted due to missing)

CP nsplit rel error

1 0 0 1

Error in yval[, 1] : incorrect number of dimensions




[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] rpart plot question

2005-08-11 Thread John Field

Petr Pikal wrote:

Dear all


I am quite confused by rpart plotting. Here is example.


set.seed(1)

y - (c(rnorm(10), rnorm(10)+2, rnorm(10)+5))

x - c(rep(c(1,2,5), c(10,10,10))

fit - rpart(x~y)##  NB should be y~x

plot(fit)

text(fit)


Text on first split says x  3.5 and on the second split x  1.5 what

I understand:


If x  3.5 so y is lower and y values go to the left split. OK. But,

sometimes there is


whatever = nnn and it seems to me that if this condition is true

response variable follow to right split.


try:


y1-(c(rnorm(10)+5,rnorm(10)+2, rnorm(10)))

fit-rpart(y1~x)

plot(fit)

text(fit)


Well, I am not sure I express myself clearly. Am I correct that

when there is  sign I shall follow left node but when there is =

sign I shall follow the right one?


Best regards

Petr Pikal

Petr Pikal

https://stat.ethz.ch/mailman/listinfo/r-helppetr.pikal at precheza.cz
If instead of rpart you use mvpart, ie

library(mvpart)
fit - mvpart(y~x, data=data.frame(cbind(x,y)))
plot(fit)
text.rpart(fit,which=4)

then the plot will be much clearer about the condition for splits.

summary(fit) will also help.

Regards,
John

=
John Field Consulting Pty Ltd
10 High St, Burnside SA 5066, Australia
ph: +61 8 8332 5294 or +61 409 097 586
fax: +61 8 8332 1229
email:  [EMAIL PROTECTED] __
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] rpart plot question

2005-08-09 Thread Petr Pikal
Dear all

I am quite confused by rpart plotting. Here is example.

set.seed(1)
y - (c(rnorm(10), rnorm(10)+2, rnorm(10)+5))
x - c(rep(c(1,2,5), c(10,10,10))
fit - rpart(x~y)
plot(fit)
text(fit)

Text on first split says x  3.5 and on the second split x  1.5 what 
I understand:

If x  3.5 so y is lower and y values go to the left split. OK. But, 
sometimes there is

whatever = nnn and it seems to me that if this condition is true 
response variable follow to right split.

try:

y1-(c(rnorm(10)+5,rnorm(10)+2, rnorm(10)))
fit-rpart(y1~x)
plot(fit)
text(fit)

Well, I am not sure I express myself clearly. Am I correct that 
when there is  sign I shall follow left node but when there is = 
sign I shall follow the right one?

Best regards
Petr Pikal
Petr Pikal
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart memory problem

2005-03-21 Thread jenniferbecq

Hi everyone,

I have a problem using rpart (R 2.0.1 under Unix)

Indeed, I have a large matrix (9271x7), my response variable is numeric and all
my predictor variables are categorical (from 3 to 8 levels).

Here is an example :

 mydata[1:5,]
  distance group3 group4 group5 group6 group7 group8
pos_10.141836040224967  a  c  e  a  g  g
pos_501  0.153605961621317  a  a  a  a  g  g
pos_1001 0.152246705384699  a  c  e  a  g  g
pos_1501 0.145563737522463  a  c  e  a  g  g
pos_2001 0.143940027378837  a  c  e  e  g  g

When using rpart() as follow, the program runs for ages, and after a few hours,
R is abruptly killed :

library(rpart)
fit - rpart(distance ~ ., data = mydata)

When I change the categorical variables into numeric values (e.g. a = 1, b = 2,
c = 3, etc...), the program runs normally in a few seconds. But this is not
what I want because it separates my variables according to group7  4.5
(continuous) and not group7 = a,b,d,f or c,e,g (discrete).

here is the result :
fit
n= 9271

node), split, n, deviance, yval
  * denotes terminal node

 1) root 9271 28.43239000 0.1768883
   2) group7=4.5 5830  4.87272700 0.1534626
 4) group5 5.5 5783  3.29538700 0.1520110
   8) group5=4.5 3068  0.68517040 0.1412967 *
   9) group5 4.5 2715  1.86003600 0.1641184 *
 5) group5=5.5 47  0.06597044 0.3320614 *
   3) group7 4.5 3441 14.93984000 0.2165781
 6) group5 1.5 1461  1.00414700 0.1906630 *
 7) group5=1.5 1980 12.2305 0.2357002
  14) group6=2.5 1659  2.95395700 0.2090232
28) group3=2.5 1315  1.65184200 0.1957505 *
29) group3 2.5 344  0.18490260 0.2597607 *
  15) group6 2.5 321  1.99404400 0.3735729 *


When I create a small dataframe such as the example above, e.g. :

distance = rnorm(5,0.15,0.01)
group3 = c(a,a,a,a,a)
group4 = c(c,a,c,c,c)
group5 = c(e,a,e,e,e)
group6 = c(a,a,a,a,e)
smalldata = data.frame(cbind(distance,group3,group4,group5,group6))

The program runs normally in a few seconds.

Why does it work using the large dataset whith only numeric values but not with 
categorical predictor variables ?

I have the impression that it considers my response variable also as a
categorical variable and therefore it can't handle 9271 levels, which is quite
normal. Is there a way to solve this problem ?

I thank you all for your time and help,

Jennifer Becq

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] rpart memory problem

2005-03-21 Thread Uwe Ligges
[EMAIL PROTECTED] wrote:
Hi everyone,
I have a problem using rpart (R 2.0.1 under Unix)
Indeed, I have a large matrix (9271x7), my response variable is numeric and all
my predictor variables are categorical (from 3 to 8 levels).

Your problem is the number of levels. You get a similar number of dummy 
variables and your problem becomes really huge.

Uwe Ligges

Here is an example :

mydata[1:5,]
  distance group3 group4 group5 group6 group7 group8
pos_10.141836040224967  a  c  e  a  g  g
pos_501  0.153605961621317  a  a  a  a  g  g
pos_1001 0.152246705384699  a  c  e  a  g  g
pos_1501 0.145563737522463  a  c  e  a  g  g
pos_2001 0.143940027378837  a  c  e  e  g  g
When using rpart() as follow, the program runs for ages, and after a few hours,
R is abruptly killed :
library(rpart)
fit - rpart(distance ~ ., data = mydata)
When I change the categorical variables into numeric values (e.g. a = 1, b = 2,
c = 3, etc...), the program runs normally in a few seconds. But this is not
what I want because it separates my variables according to group7  4.5
(continuous) and not group7 = a,b,d,f or c,e,g (discrete).
here is the result :
fit
n= 9271
node), split, n, deviance, yval
  * denotes terminal node
 1) root 9271 28.43239000 0.1768883
   2) group7=4.5 5830  4.87272700 0.1534626
 4) group5 5.5 5783  3.29538700 0.1520110
   8) group5=4.5 3068  0.68517040 0.1412967 *
   9) group5 4.5 2715  1.86003600 0.1641184 *
 5) group5=5.5 47  0.06597044 0.3320614 *
   3) group7 4.5 3441 14.93984000 0.2165781
 6) group5 1.5 1461  1.00414700 0.1906630 *
 7) group5=1.5 1980 12.2305 0.2357002
  14) group6=2.5 1659  2.95395700 0.2090232
28) group3=2.5 1315  1.65184200 0.1957505 *
29) group3 2.5 344  0.18490260 0.2597607 *
  15) group6 2.5 321  1.99404400 0.3735729 *
When I create a small dataframe such as the example above, e.g. :
distance = rnorm(5,0.15,0.01)
group3 = c(a,a,a,a,a)
group4 = c(c,a,c,c,c)
group5 = c(e,a,e,e,e)
group6 = c(a,a,a,a,e)
smalldata = data.frame(cbind(distance,group3,group4,group5,group6))
The program runs normally in a few seconds.
Why does it work using the large dataset whith only numeric values but not with 
categorical predictor variables ?

I have the impression that it considers my response variable also as a
categorical variable and therefore it can't handle 9271 levels, which is quite
normal. Is there a way to solve this problem ?
I thank you all for your time and help,
Jennifer Becq
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart

2005-01-17 Thread Weiwei Shi
Hi, there:
I am working on a classification problem by using
rpart. when my response variable y is binary, the
trees grow very fast, but if I add one more case to y,
that is making y has 3 cases, the tree growing cannot
be finished.
the command looks like:
x-rpart(r0$V142~.,data=r0[,1:141],
parms=list(split='gini'), cp=0.01)

changing cp or removing parms does not help. 

summary($V142) gives like:
 summary(r0$V142)
  0   1   2 
370  14  16 

I am not sure if rpart can do this or there is
something wrong with my approach.

Please be advised.

Ed

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] rpart

2005-01-17 Thread Prof Brian Ripley
On Mon, 17 Jan 2005, Weiwei Shi wrote:
I am working on a classification problem by using
rpart. when my response variable y is binary, the
trees grow very fast, but if I add one more case to y,
that is making y has 3 cases,
Do you mean 3 classes?: you have many more than 3 cases below.
the tree growing cannot be finished.
Whatever does that mean?  Please see the posting guide and supply the 
information it asks for, a reproducible example and what happens when you 
run it and why you think it is wrong.

the command looks like:
x-rpart(r0$V142~.,data=r0[,1:141],
parms=list(split='gini'), cp=0.01)
changing cp or removing parms does not help.
summary($V142) gives like:
summary(r0$V142)
 0   1   2
370  14  16
I am not sure if rpart can do this or there is something wrong with my 
approach.
What is `this' you want to do?  Rpart works well with multiple classes: 
see for example MASS4.

--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart problem

2004-09-06 Thread pfm401
Dear all,

I am having some trouble with getting the rpart function to work as expected.
I am trying to use rpart to combine levels of a factor to reduce the number
of levels of that factor. In exploring the code I have noticed that it is
possible for chisq.test to return a statistically significant result whilst
the rpart method returns only the root node (i.e. no split is made). The
following code recreates the issue using simulated data :


# Create a 2 level factor with group 1 probability of success 90% and group
2 60%
tmp1  - as.factor((runif (1000) = 0.9))
tmp2  - as.factor((runif (1000) = 0.5))
mysuccess - as.factor(c(tmp1, tmp2)) 
mygroup   - as.factor(c(rep (1,1000), rep (2,1000)))

table (mysuccess, mygroup)
chisq.test (mysuccess, mygroup)
# p-value =  2.2e-16

myrpart - rpart (mysuccess ~ mygroup)
myrpart
# rpart does not provide splits !!



If I change the parameter in the setting of group 2 to 0.3 from 0.6 rpart
does return splits, i.e. change the line 

tmp2  - as.factor((runif (1000) = 0.6))

to 

tmp2  - as.factor((runif (1000) = 0.3))

rpart does split the nodes, but as the split with 0.6 is highly significant
I would still have expected a split in this case too.

 
I would appreciate any advice as to whether this is a known feature of rpart,
whether I need to change the way my data are stored, or set some of the
control options. I have tested a few of these options with no success.


Thanks,
Paul.


__
Get Tiscali Broadband From £15:99
http://www.tiscali.co.uk/products/broadbandhome/

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] rpart problem

2004-09-06 Thread Prof Brian Ripley
I think you are confusing the purpose of rpart, which is prediction.
You want to predict `mysuccess'.

One group has 90% success, so the best prediction is `success'.
The other group has 60% success, so the best prediction is `success'.

So there is no point in splitting into groups.  Replace 60% by 30% and the 
best prediction for group 2 changes.

If this is not now obvious, please read up on tree-based methods.

On Mon, 6 Sep 2004 [EMAIL PROTECTED] wrote:

 Dear all,
 
 I am having some trouble with getting the rpart function to work as expected.
 I am trying to use rpart to combine levels of a factor to reduce the number
 of levels of that factor. In exploring the code I have noticed that it is
 possible for chisq.test to return a statistically significant result whilst
 the rpart method returns only the root node (i.e. no split is made). The
 following code recreates the issue using simulated data :
 
 
 # Create a 2 level factor with group 1 probability of success 90% and group
 2 60%
 tmp1  - as.factor((runif (1000) = 0.9))
 tmp2  - as.factor((runif (1000) = 0.5))

Is 0.5 a typo?

 mysuccess - as.factor(c(tmp1, tmp2)) 
 mygroup   - as.factor(c(rep (1,1000), rep (2,1000)))
 
 table (mysuccess, mygroup)
 chisq.test (mysuccess, mygroup)
 # p-value =  2.2e-16
 
 myrpart - rpart (mysuccess ~ mygroup)
 myrpart
 # rpart does not provide splits !!
 
 
 
 If I change the parameter in the setting of group 2 to 0.3 from 0.6 rpart
 does return splits, i.e. change the line 
 
 tmp2  - as.factor((runif (1000) = 0.6))
 
 to 
 
 tmp2  - as.factor((runif (1000) = 0.3))
 
 rpart does split the nodes, but as the split with 0.6 is highly significant
 I would still have expected a split in this case too.
 
  
 I would appreciate any advice as to whether this is a known feature of rpart,
 whether I need to change the way my data are stored, or set some of the
 control options. I have tested a few of these options with no success.

Testing cp  0 will have an effect.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] rpart and TREE, can be the same?

2004-07-19 Thread WWei
Hi, Andy,

Thank you again for your help. Tree( ) does have an option split='gini' in 
my version, which I recently downloaded from CRAN. The question is the 
tree.control only controls over mindev, no option for gini. Or maybe there 
is a way to specify a 'cp' like parameter when using gini index in tree( 
)?

Thanks,
Auston






Liaw, Andy [EMAIL PROTECTED]

07/16/2004 02:04 PM



 

To:
'[EMAIL PROTECTED]' [EMAIL PROTECTED]
cc:





Subject:
RE: [R] rpart and TREE, can be the same?



Auston,
 
tree() does not use Gini as splitting criterion, AFAIK.  It uses deviance. 
 You can try to see if the various splitting criteria available in rpart 
is described in Terry's tech report (available on the Mayo Clinic web 
site).
 
Andy
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 16, 2004 2:15 PM
To: Liaw, Andy
Subject: RE: [R] rpart and TREE, can be the same?


Thank you, Andy. Well, I tried 'gini' for both of them and my data has no 
NAs, but they still don't match. BTW, what is exactly the splitting 
criterion 'information' used in rpart? Thanks. 

Auston 





Liaw, Andy [EMAIL PROTECTED] 
Sent by: [EMAIL PROTECTED] 
07/16/2004 01:01 PM 



 

To: 
'[EMAIL PROTECTED]' [EMAIL PROTECTED], [EMAIL PROTECTED] 
cc: 





Subject: 
RE: [R] rpart and TREE, can be the same?




I guess if you define the splitting criterion in rpart so that it matches
the one used in tree(), that's possible.  However, I believe the two also
differ in how they handle NAs.

Andy

 From:  [EMAIL PROTECTED]
 
 Hi, all,
 
 I am wondering if it is possible to set parameters of 'rpart' 
 and 'tree' 
 such that they will produce the exact same tree? Thanks.
 
 Auston Wei
 Statistical Analyst
 Department of Biostatistics and Applied Mathematics
 The University of Texas MD Anderson Cancer Center
 Tel: 713-563-4281
 Email: [EMAIL PROTECTED]
  [[alternative HTML version deleted]]



__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


--
Notice: This e-mail message, together with any attachments, ...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] rpart and TREE, can be the same?

2004-07-19 Thread WWei
They are substantially different, even if I used 'gini' for both of them 
and set feature parameters to 0's. Seems to me there is something more 
than splitting rule that governs the growth of tree. What could that be, 
other than sizes?
Thank you,
Auston






Liaw, Andy [EMAIL PROTECTED]

07/19/2004 09:38 AM



 

To:
'[EMAIL PROTECTED]' [EMAIL PROTECTED]
cc:





Subject:
RE: [R] rpart and TREE, can be the same?



Auston,
 
I see that now.  Have you tried setting mindev=0 in tree() and cp=0 in 
rpart(), to see if the unpruned trees are identical?  If so, you can 
probably try pruning the trees back using other tools in those packages. 
 
Cheers,
Andy


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart and TREE, can be the same?

2004-07-16 Thread WWei
Hi, all,

I am wondering if it is possible to set parameters of 'rpart' and 'tree' 
such that they will produce the exact same tree? Thanks.

Auston Wei
Statistical Analyst
Department of Biostatistics and Applied Mathematics
The University of Texas MD Anderson Cancer Center
Tel: 713-563-4281
Email: [EMAIL PROTECTED]
[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] rpart and TREE, can be the same?

2004-07-16 Thread Liaw, Andy
I guess if you define the splitting criterion in rpart so that it matches
the one used in tree(), that's possible.  However, I believe the two also
differ in how they handle NAs.

Andy

 From:  [EMAIL PROTECTED]
 
 Hi, all,
 
 I am wondering if it is possible to set parameters of 'rpart' 
 and 'tree' 
 such that they will produce the exact same tree? Thanks.
 
 Auston Wei
 Statistical Analyst
 Department of Biostatistics and Applied Mathematics
 The University of Texas MD Anderson Cancer Center
 Tel: 713-563-4281
 Email: [EMAIL PROTECTED]
   [[alternative HTML version deleted]]
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart

2004-06-04 Thread h0444k87
Hello everyone,

I'm a newbie to R and to CART so I hope my questions don't seem too stupid.

1.)
My first question concerns the rpart() method. Which method does rpart use in
order to get the best split - entropy impurity, Bayes error (min. error) or Gini
index? Is there a way to make it use the entropy impurity?

The second and third question concern the output of the printcp() function.
2.)
What exactly are the cps in that sense here? I assumed them to be the treshold
complexity parameters as in Breiman et al., 1998, Section 3.3? Are they the same
as the treshold niveaus of alpha? I have read somewhere that the cps here are
the  treshold alphas divided by the root node error. Is that true?

3.)
How is rel error computed?
I am supposed to evaluate the goodness of classification of of the CART method.
Do you think rel error is a good measure for that?

I'd be very thankful if anyone could give me hand on that. This is a project for
uni and I desperately need a good mark.

Thank you very much in advance,

Mareike

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] rpart

2004-06-04 Thread Ko-Kang Kevin Wang
Hi,

I think most, if not all, your questions can be answered by:

1) ?rpart

2) Some search through the r-help mailing list

3) Read the chapter on tree-based models in MASS 4 (Modern Applied
Statistics with S) by Venables and Ripley

Kevin

- Original Message - 
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, June 04, 2004 9:59 PM
Subject: [R] rpart


 Hello everyone,

 I'm a newbie to R and to CART so I hope my questions don't seem too
stupid.

 1.)
 My first question concerns the rpart() method. Which method does rpart use
in
 order to get the best split - entropy impurity, Bayes error (min. error)
or Gini
 index? Is there a way to make it use the entropy impurity?

 The second and third question concern the output of the printcp()
function.
 2.)
 What exactly are the cps in that sense here? I assumed them to be the
treshold
 complexity parameters as in Breiman et al., 1998, Section 3.3? Are they
the same
 as the treshold niveaus of alpha? I have read somewhere that the cps here
are
 the  treshold alphas divided by the root node error. Is that true?

 3.)
 How is rel error computed?
 I am supposed to evaluate the goodness of classification of of the CART
method.
 Do you think rel error is a good measure for that?

 I'd be very thankful if anyone could give me hand on that. This is a
project for
 uni and I desperately need a good mark.

 Thank you very much in advance,

 Mareike

 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart for CART with weights/priors

2004-05-07 Thread Carolin Strobl
Hi,
I have a technical question about rpart:
according to Breiman et al. 1984, different costs for misclassification in
CART can be modelled 
either by means of modifying the loss matrix or by means of using different
prior probabilities for the classes, 
which again should have the same effect as using different weights for the
response classes.

What I tried was this:

library(rpart)
data(kyphosis)

#fit1 from original unweighted data set
fit1 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)

#modify loss matrix
loss-matrix(c(0,1,2,0),nrow=2,ncol=2)

#   true class?
#[,1] [,2]
#[1,]02 
#[2,]10 predicted class?


#modify priors
prior=c(1/3,2/3)

fit2- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
parms=list(loss=loss))
fit3 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
parms=list(prior=prior))

fit2
fit3

par(mfrow=c(2,1))
plot(fit2)
text(fit2,use.n=T)
plot(fit3)
text(fit3,use.n=T)

#lead to similar but not identical trees (similar topology but different
cutoff points), 
#while all other combinations (even complete reversion, i.e. preference for
the other class) 
#lead to totally different trees...

#third approach using weights:
#sorting of data to design weight vector
ind-order(kyphosis[,1])
kyphosis1-kyphosis[ind,]

summary(kyphosis1[,1])
weight-c(rep(1,64),rep(2,17))
summary(as.factor(weight))

fit4 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis1,
weights=weight)

#leads to result very similar to fit2 with
loss-matrix(c(0,1,2,0),nrow=2,ncol=2)
#(same tree and cutoff points, but slightly different probabilities, maybe
numerical artefact?)

fit4
plot(fit4)
text(fit4,use.n=T)

#doule check with inverse loss matrix

loss-matrix(c(0,1,2,0),nrow=2,ncol=2,byrow=T)
fit2- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
parms=list(loss=loss))

weight-c(rep(2,64),rep(1,17))
fit4 - rpart(Kyphosis ~ Age + Number + Start, data=kyphosis1,
weights=weight)

fit2
fit4
#also same except for probabilities yprob

I don't see 
1. why the approach using prior probabilities doesn't work
2. what causes the differences in predicted probabilities in the weights
approach

Any idea? Thank You! C.

--

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart question

2004-05-04 Thread lsjensen
Wondered about the best way to control for input variables that have a
large number of levels in 'rpart' models.  I understand the algorithm
searches through all possible splits (2^(k-1) for k levels) and so
variables with more levels are more prone to be good spliters... so I'm
looking for ways to compensate and adjust for this complexity.

For example, if two variables produce comparable splits in the data but
one contains 2 levels and the other 13 levels then I would like to have
to have the algorithm choose the 'simpler' split.

Is this best done with the 'cost' argument in the rpart options?  This
defaults to one for all variables... so would it make sense to scale
this by nlevels in each variable or sqrt(nlevels) or something similar?

Thanks,
Landon


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] rpart question

2004-05-04 Thread Liaw, Andy
AFAIK rpart does not have built-in facility for adjusting bias in split
selection.  One possibility is to define your own splitting criterion that
does the adjustment is some fashion.  I believe the current version of rpart
allows you to define custom splitting criterion, but I have not tried it
myself.

Prof. Wei-yin Loh at UW-Madison (and his current and former students) had
worked on algorithms that compensate for bias in split selection.  There are
software on his web page that you might want to check out.

HTH,
Andy

 From: [EMAIL PROTECTED]
 
 Wondered about the best way to control for input variables that have a
 large number of levels in 'rpart' models.  I understand the algorithm
 searches through all possible splits (2^(k-1) for k levels) and so
 variables with more levels are more prone to be good 
 spliters... so I'm
 looking for ways to compensate and adjust for this complexity.
 
 For example, if two variables produce comparable splits in 
 the data but
 one contains 2 levels and the other 13 levels then I would 
 like to have
 to have the algorithm choose the 'simpler' split.
 
 Is this best done with the 'cost' argument in the rpart options?  This
 defaults to one for all variables... so would it make sense to scale
 this by nlevels in each variable or sqrt(nlevels) or 
 something similar?
 
 Thanks,
 Landon
 
 
   [[alternative HTML version deleted]]
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 
 


--
Notice:  This e-mail message, together with any attachments,...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] RPART drawing the tree

2004-04-29 Thread Rob Kamstra
Hello,
I am using the RPART library to find patterns in HIV mutations regarding 
drug-resistancy.
My data consists of aminoacid at certain locations and two classes resistant 
and susceptible.

The classification and pruning work fine with Rpart. however there is a 
problem with displaying the data as a tree in the display window.

in the display window the data contain only levels at the splits example: 
(abcde) left (fg) right. but i would like to have the aminoacids displayed. 
how can this be achieved ?

Rob Kamstra
_
MSN Search, for accurate results! http://search.msn.nl
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] RPART drawing the tree

2004-04-29 Thread Prof Brian Ripley
On Thu, 29 Apr 2004, Rob Kamstra wrote:

 I am using the RPART library to find patterns in HIV mutations regarding 
 drug-resistancy.
 My data consists of aminoacid at certain locations and two classes resistant 
 and susceptible.
 
 The classification and pruning work fine with Rpart. however there is a 
 problem with displaying the data as a tree in the display window.
 
 in the display window the data contain only levels at the splits example: 
 (abcde) left (fg) right. but i would like to have the aminoacids displayed. 
 how can this be achieved ?

By reading the documentation, as suggested in 

 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

specifically by ?text.rpart, since that has an argument `pretty' to 
control this.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart or mvpart

2004-03-29 Thread Ben Stewart-Koster

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] rpart question on loss matrix

2004-01-07 Thread Peter Flom
Hello again

I've looked through ?rpart, Atkinson  Therneau (1997), Chap 10 of
Venables and Ripley, Breman et al., and the r hgelp archives  but
haven't seen the answer to these two questions

1) How does rpart deal with asymmetric loss matrices?  Breiman et al.
suggest some possibilities, but, of course, do not say how rpart does
it.

2) In the loss matrix, which direction (column or row) is 'truth' and
which 'output of program'?  e.g., if you have a 3 level DV (say the
levels are A, B, C) and you want a higher cost for misclassifying as
later in the alphabet, would it be

0  3  5  
1  0  2
2  1  0

or

0  1  2
3  0  1  
5  2  0


Thanks in advance

Peter

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] rpart postscript graphics, Mac OS

2003-11-18 Thread Uwe Ligges


On Tue, 18 Nov 2003, Paul Murrell wrote:

 Hi
 
 
 Kaiser Fung wrote:
  I am running R on Mac OS X 10.2x.  When I create
  postscript graphics of rpart tree objects, a tiny part
  of the tree gets trimmed off, even when it has only a
  few terminal nodes.  This happens even without fancy
  but worse if fancy=T.  (This doesn't happen with
  boxplot, scatter plots, etc.)  How do I fix this?
  
  postscript(tree.eps)
  plot(davb.tree, u=T)
  text(davb.tree, use.n=T, fancy=F)
  dev.off()
 
 
 It's hard to see your problem without the actual data to reproduce it. 
 Does it help if you precede the plot command with par(xpd=NA)?


Well, the problem is known (calculating the required space for labeling 
etc. is hard), hence the argument margin in plot.rpart().
?plot.rpart tells you:

margin: an extra percentage of white space to leave around the borders of
the tree. (Long labels sometimes get cut off by the default computation).  

margin=0.1 is sufficient in most cases.

Uwe Ligges

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] rpart postscript graphics, Mac OS

2003-11-17 Thread Kaiser Fung

I am running R on Mac OS X 10.2x.  When I create
postscript graphics of rpart tree objects, a tiny part
of the tree gets trimmed off, even when it has only a
few terminal nodes.  This happens even without fancy
but worse if fancy=T.  (This doesn't happen with
boxplot, scatter plots, etc.)  How do I fix this?

postscript(tree.eps)
plot(davb.tree, u=T)
text(davb.tree, use.n=T, fancy=F)
dev.off()

Thanks
Kais

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] rpart postscript graphics, Mac OS

2003-11-17 Thread Paul Murrell
Hi

Kaiser Fung wrote:
I am running R on Mac OS X 10.2x.  When I create
postscript graphics of rpart tree objects, a tiny part
of the tree gets trimmed off, even when it has only a
few terminal nodes.  This happens even without fancy
but worse if fancy=T.  (This doesn't happen with
boxplot, scatter plots, etc.)  How do I fix this?
postscript(tree.eps)
plot(davb.tree, u=T)
text(davb.tree, use.n=T, fancy=F)
dev.off()


It's hard to see your problem without the actual data to reproduce it. 
Does it help if you precede the plot command with par(xpd=NA)?

Paul
--
Dr Paul Murrell
Department of Statistics
The University of Auckland
Private Bag 92019
Auckland
New Zealand
64 9 3737599 x85392
[EMAIL PROTECTED]
http://www.stat.auckland.ac.nz/~paul/
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Rpart question - labeling nodes with something not in x$frame

2003-07-17 Thread Ko-Kang Kevin Wang
On Thu, 17 Jul 2003, Peter Flom wrote:

 I have a tree created with
 
 tr.hh.logcas - rpart(log(YCASSX + 1)~AGE+DRUGUSEY+SEX+OBSXNUM +WINDLE,
 xval = 10)
 
 I would like to label the nodes with YCASSX rather than log(YCASSX +
 1).  But the help file for text in library rpart says that you can only
 use labels that are part of x$frame, which YCASSX is not.

This may not be the best solution, but what I have done once is to add 
another column into the data frame with the labels I want.

For example:
  data(iris)
  library(rpart)
  # Recoding the response:
  #s: setosa
  #c: versicolor
  #v: virginica
  ir - iris[, -5]
  Species - rep(c(s, c, v), rep(50, 3))
  ir - as.data.frame(cbind(ir, Species))
  ir.rp - rpart(Species ~ ., data = ir)
  plot(ir.rp)
  text(ir.rp)

This is probably the long/silly way, but it works ;-D

-- 
Cheers,

Kevin

--
On two occasions, I have been asked [by members of Parliament],
'Pray, Mr. Babbage, if you put into the machine wrong figures, will
the right answers come out?' I am not able to rightly apprehend the
kind of confusion of ideas that could provoke such a question.

-- Charles Babbage (1791-1871) 
 From Computer Stupidities: http://rinkworks.com/stupid/

--
Ko-Kang Kevin Wang
Master of Science (MSc) Student
SLC Tutor and Lab Demonstrator
Department of Statistics
University of Auckland
New Zealand
Homepage: http://www.stat.auckland.ac.nz/~kwan022
Ph: 373-7599
x88475 (City)
x88480 (Tamaki)

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] rpart vs. randomForest

2003-04-14 Thread Martin Maechler
 Anonymous ==   [EMAIL PROTECTED]
 on Sat, 12 Apr 2003 14:41:00 -0700 writes:

Anonymous Greetings. I'm trying to determine whether to use
Anonymous rpart or randomForest for a classification
Anonymous tree. Has anybody tested efficacy formally? I've
Anonymous run both and the confusion matrix for rf beats
Anonymous rpart. I've looking at the rf help page and am
Anonymous unable to figure out how to extract the tree.
Anonymous But more than that I'm looking for a more
Anonymous comprehensive user's guide for randomForest
Anonymous including the benefits on using it with MDS. Can
Anonymous anybody suggest a general guide? I've been
Anonymous finding a lot of broken links and cs-type of web
Anonymous pages rather than an end-user's guide. Also
Anonymous people's experience on adjusting the mtry param
Anonymous would be useful. Breiman says that it isn't too
Anonymous sensitive but I'm curious if anybody has had a
Anonymous different experience with it. Thanks in advance
Anonymous and apologies if this is too general.


If you really read Breiman, or alternatively, remember English,
you'll know that a forest has many trees...

Regards,
Martin Maechler [EMAIL PROTECTED] http://stat.ethz.ch/~maechler/

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] rpart v. lda classification.

2003-02-12 Thread ripley
On Tue, 11 Feb 2003, Rolf Turner wrote:

 
 I've been groping my way through a classification/discrimination
 problem, from a consulting client.  There are 26 observations, with 4
 possible categories and 24 (!!!) potential predictor variables.
 
 I tried using lda() on the first 7 predictor variables and got 24 of
 the 26 observations correctly classified.  (Training and testing both
 on the complete data set --- just to get started.)
 
 I then tried rpart() for comparison and was somewhat surprised when
 rpart() only managed to classify 14 of the 26 observations correctly.
 (I got the same classification using just the first 7 predictors as I
 did using all of the predictors.)
 
 I would have thought that rpart(), being unconstrained by a parametric
 model, would have a tendency to over-fit and therefore to appear to
 do better than lda() when the test data and training data are the
 same.
 
 Am I being silly, or is there something weird going on?  I can
 give more detail on what I actually did, if anyone is interested.

The first.  rpart is seriously constrained by having so few observations,
and its model is much more restricted than lda: axis-parallel splits only.
There is a similar example, with pictures, in MASS (on Cushings).

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help



[R] rpart v. lda classification.

2003-02-11 Thread Rolf Turner

I've been groping my way through a classification/discrimination
problem, from a consulting client.  There are 26 observations, with 4
possible categories and 24 (!!!) potential predictor variables.

I tried using lda() on the first 7 predictor variables and got 24 of
the 26 observations correctly classified.  (Training and testing both
on the complete data set --- just to get started.)

I then tried rpart() for comparison and was somewhat surprised when
rpart() only managed to classify 14 of the 26 observations correctly.
(I got the same classification using just the first 7 predictors as I
did using all of the predictors.)

I would have thought that rpart(), being unconstrained by a parametric
model, would have a tendency to over-fit and therefore to appear to
do better than lda() when the test data and training data are the
same.

Am I being silly, or is there something weird going on?  I can
give more detail on what I actually did, if anyone is interested.

The data are pretty obviously nothing like Gaussian, so my
gut feeling is that rpart() should be much more appropriate than
lda().  And it does not seem surprizing that with so few
observations to train with, the success rate should be low, even
when testing and training on the same data set.  What does
surprise me is that lda() gets such a high success rate.

Should I just put this down as a random occurrence of a low
prob. event?

cheers,

Rolf Turner
[EMAIL PROTECTED]

P.S.  Using CV=TRUE in lda() I got only 16 of the 26 observations
correctly classified.

__
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help