Re: [R] pseudo-R2 or GOF for regression trees?

2007-05-05 Thread Frank E Harrell Jr
Prof. Jeffrey Cardille wrote:
 Hello,
 
 Is there an accepted way to convey, for regression trees, something  
 akin to R-squared?
 
 I'm developing regression trees for a continuous y variable and I'd  
 like to say how well they are doing. In particular, I'm analyzing the  
 results of a simulation model having highly non-linear behavior, and  
 asking what characteristics of the inputs are related to a particular  
 output measure.  I've got a very large number of points: n=4000.  I'm  
 not able to do a model sensitivity analysis because of the large  
 number of inputs and the model run time.
 
 I've been googling around both on the archives and on the rest of the  
 web for several hours, but I'm still having trouble getting a firm  
 sense of the state of the art.  Could someone help me to quickly  
 understand what strategy, if any, is acceptable to say something like  
 The regression tree in Figure 3 captures 42% of the variance?  The  
 target audience is readers who will be interested in the subsequent  
 verbal explanation of the relationship, but only once they are  
 comfortable that the tree really does capture something.  I've run  
 across methods to say how well a tree does relative to a set of trees  
 on the same data, but that doesn't help much unless I'm sure the  
 trees in question are really capturing the essence of the system.
 
 I'm happy to be pointed to a web site or to a thread I may have  
 missed that answers this exact question.
 
 Thanks very much,
 
 Jeff
 
 --
 Prof. Jeffrey Cardille
 [EMAIL PROTECTED]
 R-help@stat.math.ethz.ch mailing list

Ye (below) has a method to get a nearly unbiased estimate of R^2 from 
recursive partitioning.  In his examples the result was similar to using 
the formula for adjusted R^2 with regression degrees of freedom equal to 
about 3n/4.  You can also use something like 10-fold cross-validation 
repeated 20 times to get a fairly precise and unbiased estimate of R^2.

Frank


@ARTICLE{ye98mea,
   author = {Ye, Jianming},
   year = 1998,
   title = {On measuring and correcting the effects of data mining and model
   selection},
   journal = JASA,
   volume = 93,
   pages = {120-131},
   annote = {generalized degrees of freedom;GDF;effective degrees of
freedom;data mining;model selection;model
uncertainty;overfitting;nonparametric regression;CART;simulation
setup}
}
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] pseudo-R2 or GOF for regression trees?

2007-05-05 Thread Prof Brian Ripley

On Sat, 5 May 2007, Prof. Jeffrey Cardille wrote:


Hello,

Is there an accepted way to convey, for regression trees, something
akin to R-squared?


Why not use R-squared itself for your purposes?

Just get the fitted values from however you do the fit, and compute 
R-squared from the basic formula (the one which compares with an intercept 
only: all regression trees extend that model).


Now, R-squared has lots of problems of its own (to the extent that it is 
only mentioned as something to avoid in some statistical texts) and these 
are worse here as the number of parameters fitted is unquantifiable. But 
as a factual summary it does mean what you quote.  Whether any model of 
comparable complexity would also explain 42% of the variance is a much 
harder question.


(Small anecdote: one of my first experiences of this was a psychologist 
who had funded a research project to relate personality/intelligence tests 
to 20-odd measurements on facial profiles by (stepwise) linear regression. 
My contribution was to point out that the R^2 produced was less for every 
one of the responses than one would expect on average for the same number 
of random unrelated regressors.  To be systematically worse than such a 
straw man takes some achieving, and I have always suspected a bug in the 
fitting software.)




I'm developing regression trees for a continuous y variable and I'd
like to say how well they are doing. In particular, I'm analyzing the
results of a simulation model having highly non-linear behavior, and
asking what characteristics of the inputs are related to a particular
output measure.  I've got a very large number of points: n=4000.  I'm
not able to do a model sensitivity analysis because of the large
number of inputs and the model run time.

I've been googling around both on the archives and on the rest of the
web for several hours, but I'm still having trouble getting a firm
sense of the state of the art.  Could someone help me to quickly
understand what strategy, if any, is acceptable to say something like
The regression tree in Figure 3 captures 42% of the variance?  The
target audience is readers who will be interested in the subsequent
verbal explanation of the relationship, but only once they are
comfortable that the tree really does capture something.  I've run
across methods to say how well a tree does relative to a set of trees
on the same data, but that doesn't help much unless I'm sure the
trees in question are really capturing the essence of the system.

I'm happy to be pointed to a web site or to a thread I may have
missed that answers this exact question.

Thanks very much,

Jeff

--
Prof. Jeffrey Cardille
[EMAIL PROTECTED]



**  Département de Géographie   **  
Bureau: **
**  professeur adjoint / assistant professor**  
Salle 440   **
**  Université de Montréal  **  
Pavillon Strathcona **
**  C.P. 6128   
**  520, chemin de la Côte-Ste-Catherine**
**  Succursale Centre-ville **  
Montreal, QC H2V 2B8**
**  Montréal, QC, H3C 3J7   **  
Télé: (514) 343-8003**


**  Web:
**
**  http://www.geog.umontreal.ca/geog/cardille.htm  **
**  
**
**  Calendrier de Disponibilité à:  
**
**  http://jeffcardille.googlepages.com/udem**





[[alternative HTML version deleted]]




--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.