[R] Bad points in regression [Broadcast]

2007-03-17 Thread John Maindonald
None of Andy's comments) are inconsistent with the point that
rlm() and lqs(), if they disagree with lm(),  likely offer better
places to start than does lm(), in identifying points that should
be examined as in some sense outliers. All such methods are
to be used, not as crutches, but as sources of insight.

Incidentally, the rlm class inherits from lm, and plot.lm()
(or, preferably, the plot generic) can be used with rlm objects.
This is not the case for lqs objects.

John Maindonald.


On 17 Mar 2007, at 10:00 PM, Andy Liaw wrote:

 From: Liaw, Andy [EMAIL PROTECTED]
 Date: 17 March 2007 4:21:53 AM
 To: Bert Gunter [EMAIL PROTECTED],  
 [EMAIL PROTECTED], r-help@stat.math.ethz.ch
 Subject: Re: [R] Bad points in regression  [Broadcast]


 (My turn on the soapbox ...)

 I'd like to add a bit of caveat to Bert's view.  I'd argue (perhaps  
 even
 plead) that robust/resistant procedures be used with care.  They  
 should
 not be used as a shortcut to avoid careful analysis of data.  I  
 recalled
 that in my first course on regression, the professor made it clear  
 that
 we're fitting models to data, not the other way around.  When the  
 model
 fits badly to (some of the) the data,  do examine and think carefully
 about what happened.  Verify that bad data are indeed bad,  
 instead of
 using statistical criteria to make that judgment.  A scientific
 colleague reminded me of this point when I tried to sell him the  
 idea of
 robust/resistant methods:  Don't use these methods as a crutch to  
 stand
 on badly run experiments (or poorly fitted models).

 Cheers,
 Andy

 From: Bert Gunter

 (mount soapbox...)

 While I know the prior discussion represents common practice,
 I would argue
 -- perhaps even plead -- that the modern(?? 30 years old 
 now) alternative of robust/resistant estimation be used,
 especially in the readily available situation of
 least-squares regression. RSiteSearch(Robust) will bring up
 numerous possibilities.rrcov and robustbase are at least two
 packages devoted to this, but the functionality is available
 in many others (e.g.
 rlm() in MASS).

 Bert Gunter
 Genentech Nonclinical Statistics
 South San Francisco, CA 94404
 650-467-7374





 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Ted Harding
 Sent: Friday, March 16, 2007 6:44 AM
 To: r-help@stat.math.ethz.ch
 Subject: Re: [R] Bad points in regression

 On 16-Mar-07 12:41:50, Alberto Monteiro wrote:
 Ted Harding wrote:

 alpha - 0.3
 beta - 0.4
 sigma - 0.5
 err - rnorm(100)
 err[15] - 5; err[25] - -4; err[50] - 10 x - 1:100 y
 - alpha +
 beta * x + sigma * err ll - lm(y ~ x)
 plot(ll)

 ll is the output of a linear model fiited by lm(), and so
 has several
 components (see ?lm in the section Value), one of which is
 residuals (which can be abbreviated to res).

 So, in the case of your example,

   which(abs(ll$res)2)
   15 25 50

 extracts the information you want (and the 2 was inspired by
 looking at the residuals plot from your plot(ll)).

 Ok, but how can I grab those points _in general_? What is the
 criterium that plot used to mark those points as bad points?

 Ahh ... ! I see what you're after. OK, look at the plot
 method for lm():

 ?plot.lm
   ## S3 method for class 'lm':
   plot(x, which = 1:4,
 caption = c(Residuals vs Fitted, Normal Q-Q plot,
   Scale-Location plot, Cook's distance plot),
   panel = points,
   sub.caption = deparse(x$call), main = ,
   ask = prod(par(mfcol))  length(which)  dev.interactive(),
   ...,
   id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75)


 where (see further down):

   id.n: number of points to be labelled in each plot, starting with
 the most extreme.

 and note, in the default parameter-values listing above:

   id.n = 3

 Hence, the 3 most extreme points (according to the criterion
 being plotted in each plot) are marked in each plot.

 So, for instance3, try

   plot(ll,id.n=5)

 and you will get points 10,15,25,28,50. And so on. But that
 pre-supposes that you know how many points are exceptional.


 What is meant by extremeis not stated in the help page
 ?plot.lm, but can be identified by inspecting the code for
 plot.lm(), which you can see by entering

   plot.lm

 In your example, if you omit the line which assigns anomalous
 values to err[15[, err[25] and err[50], then you are likely
 to observe that different points get identified on different
 plots. For instance, I just got the following results for the
 default id.n=3:

 [1] Residuals vs Fitted:   41,53,59
 [2] Standardised Residuals:41,53,59
 [3] sqrt(Stand Res) vs Fitted: 41,53,59
 [4] Cook's Distance:   59,96,97


 There are several approaches (with somewhat different
 outcomes) to identifying outliers. If you apply one of
 these, you will probably get the identities of the points anyway.

 Again in the context of your example (where in fact you
 deliberately set 3 points to have exceptional errors, thus

Re: [R] Bad points in regression [Broadcast]

2007-03-16 Thread Liaw, Andy
(My turn on the soapbox ...)

I'd like to add a bit of caveat to Bert's view.  I'd argue (perhaps even
plead) that robust/resistant procedures be used with care.  They should
not be used as a shortcut to avoid careful analysis of data.  I recalled
that in my first course on regression, the professor made it clear that
we're fitting models to data, not the other way around.  When the model
fits badly to (some of the) the data,  do examine and think carefully
about what happened.  Verify that bad data are indeed bad, instead of
using statistical criteria to make that judgment.  A scientific
colleague reminded me of this point when I tried to sell him the idea of
robust/resistant methods:  Don't use these methods as a crutch to stand
on badly run experiments (or poorly fitted models).

Cheers,
Andy

From: Bert Gunter
 
 (mount soapbox...)
 
 While I know the prior discussion represents common practice, 
 I would argue
 -- perhaps even plead -- that the modern(?? 30 years old 
 now) alternative of robust/resistant estimation be used, 
 especially in the readily available situation of 
 least-squares regression. RSiteSearch(Robust) will bring up 
 numerous possibilities.rrcov and robustbase are at least two 
 packages devoted to this, but the functionality is available 
 in many others (e.g.
 rlm() in MASS).
 
 Bert Gunter
 Genentech Nonclinical Statistics
 South San Francisco, CA 94404
 650-467-7374
 
 
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Ted Harding
 Sent: Friday, March 16, 2007 6:44 AM
 To: r-help@stat.math.ethz.ch
 Subject: Re: [R] Bad points in regression
 
 On 16-Mar-07 12:41:50, Alberto Monteiro wrote:
  Ted Harding wrote:
  
  alpha - 0.3
  beta - 0.4
  sigma - 0.5
  err - rnorm(100)
  err[15] - 5; err[25] - -4; err[50] - 10 x - 1:100 y 
 - alpha + 
  beta * x + sigma * err ll - lm(y ~ x)
  plot(ll)
  
  ll is the output of a linear model fiited by lm(), and so 
 has several 
  components (see ?lm in the section Value), one of which is 
  residuals (which can be abbreviated to res).
  
  So, in the case of your example,
  
which(abs(ll$res)2)
15 25 50
  
  extracts the information you want (and the 2 was inspired by 
  looking at the residuals plot from your plot(ll)).
 
  Ok, but how can I grab those points _in general_? What is the 
  criterium that plot used to mark those points as bad points?
 
 Ahh ... ! I see what you're after. OK, look at the plot 
 method for lm():
 
 ?plot.lm
   ## S3 method for class 'lm':
   plot(x, which = 1:4,
 caption = c(Residuals vs Fitted, Normal Q-Q plot,
   Scale-Location plot, Cook's distance plot),
   panel = points,
   sub.caption = deparse(x$call), main = ,
   ask = prod(par(mfcol))  length(which)  dev.interactive(),
   ...,
   id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75)
 
 
 where (see further down):
 
   id.n: number of points to be labelled in each plot, starting with
 the most extreme.
 
 and note, in the default parameter-values listing above:
 
   id.n = 3
 
 Hence, the 3 most extreme points (according to the criterion 
 being plotted in each plot) are marked in each plot.
 
 So, for instance3, try
 
   plot(ll,id.n=5)
 
 and you will get points 10,15,25,28,50. And so on. But that 
 pre-supposes that you know how many points are exceptional.
 
 
 What is meant by extremeis not stated in the help page 
 ?plot.lm, but can be identified by inspecting the code for 
 plot.lm(), which you can see by entering
 
   plot.lm
 
 In your example, if you omit the line which assigns anomalous 
 values to err[15[, err[25] and err[50], then you are likely 
 to observe that different points get identified on different 
 plots. For instance, I just got the following results for the 
 default id.n=3:
 
 [1] Residuals vs Fitted:   41,53,59
 [2] Standardised Residuals:41,53,59
 [3] sqrt(Stand Res) vs Fitted: 41,53,59
 [4] Cook's Distance:   59,96,97
 
 
 There are several approaches (with somewhat different 
 outcomes) to identifying outliers. If you apply one of 
 these, you will probably get the identities of the points anyway.
 
 Again in the context of your example (where in fact you 
 deliberately set 3 points to have exceptional errors, thus 
 coincidentally the same as the default value 3 of id.n), you 
 could try different values for id.n and inspect the graphs to 
 see whether a given value of id.n marks some points that do 
 not look exceptional relative to the mass of the other points.
 
 So, the above plot(ll,id.n=5) gave me one point, 10 on the 
 residuals plot, which apparently belonged to the general 
 distribution of residuals.
 
 Hoping this helps,
 Ted.
 
 
 E-Mail: (Ted Harding) [EMAIL PROTECTED]
 Fax-to-email: +44 (0)870 094 0861
 Date: 16-Mar-07   Time: 13:43:54
 -- XFMail