Re: [R] Testing for normality of residuals in a regression model
On Fri, 15 Oct 2004, Kjetil Brinchmann Halvorsen wrote: Liaw, Andy wrote: Also, I was told by someone very smart that fitting OLS to data with heteroscedastic errors can make the residuals look `more normal' than they really are... Don't know how true that is, though. Certainly true, since the residuals will be a kind of average, so the CLT works. [Inserting some R content into the discussion] An example of this can be seen by running qqnorm on the residuals from the Anscombe quartet of data sets (data(anscombe)). -thomas __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
Prof Brian Ripley wrote: However, stats 901 or some such tells you that if the distributions have even slightly longer tails than the normal you can get much better estimates than OLS, and this happens even before a test of normality rejects on a sample size of thousands. Robustness of efficiency is much more important than robustness of distribution, and I believe robustness concepts should be in stats 101. (I was teaching them yesterday in the third lecture of a basic course, albeit a graduate course.) This is a very interesting discussion. So you are basically saying that it's better to use robust regression methods, without having to worry too much about the distribution of residuals, instead of using standard methods and doing a lot of tests to check for normality? Did I get your point? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Prof Brian Ripley wrote: However, stats 901 or some such tells you that if the distributions have even slightly longer tails than the normal you can get much better estimates than OLS, and this happens even before a test of normality rejects on a sample size of thousands. Robustness of efficiency is much more important than robustness of distribution, and I believe robustness concepts should be in stats 101. (I was teaching them yesterday in the third lecture of a basic course, albeit a graduate course.) Federico Gherardini answered: This is a very interesting discussion. So you are basically saying that it's better to use robust regression methods, without having to worry too much about the distribution of residuals, instead of using standard methods and doing a lot of tests to check for normality? Did I get your point? My feeling is that symmetry is more important than, let's say kurtosis 0 in the error. Is this correct? Now the problem is: the lower number of observations, the more severe an effect of non-normality (at least, asymmetry?) could be on the regression AND at the same time, power of tests to detect non normality drops. So, I can imagine easily situations where non-normality is not detected, yet asymmetry is such that regression is significantly biased... It is mainly a question of sample size from this point of view... But not only: Andy Liaw wrote: Also, I was told by someone very smart that fitting OLS to data with heteroscedastic errors can make the residuals look `more normal' than they really are... Don't know how true that is, though. That very smart person is not me, but it happens that I experimented also a little bit on this a while ago! Just experiment with artificial data, and you will see what happens: residuals look often more normal that the error distribution you introduced in your artificial data... Another consequence, is a biased estimate of parameters. Indeed, both come together: parameters are biased in a direction that lowers residuals sum of square, obviously, but also in some circumstances, in a direction that make residuals looking more normal... And that is not (how can it be?) taken into account in the test of normality. That is, I believe, a second reason why non-normality of error could not be detected, yet it has a major impact on the OLS regression. And I am pretty sure there are other reasons, like distribution of error both in the dependent and in the independent variables, another violation of the assumptions made for OLS... Best regards, Philippe ..°})) ) ) ) ) ) ( ( ( ( (Prof. Philippe Grosjean ) ) ) ) ) ( ( ( ( (Numerical Ecology of Aquatic Systems ) ) ) ) ) Mons-Hainaut University, Pentagone ( ( ( ( (Academie Universitaire Wallonie-Bruxelles ) ) ) ) ) 6, av du Champ de Mars, 7000 Mons, Belgium ( ( ( ( ( ) ) ) ) ) phone: + 32.65.37.34.97, fax: + 32.65.37.33.12 ( ( ( ( (email: [EMAIL PROTECTED] ) ) ) ) ) ( ( ( ( (web: http://www.umh.ac.be/~econum ) ) ) ) ) .. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
I am assuming everyone is on R-help and doesn't want two copies so have trimmed the Cc: list to R-help. On Sat, 16 Oct 2004, Philippe Grosjean wrote: Prof Brian Ripley wrote: [ Other contributions previously excised here without comment. ] However, stats 901 or some such tells you that if the distributions have even slightly longer tails than the normal you can get much better estimates than OLS, and this happens even before a test of normality rejects on a sample size of thousands. Robustness of efficiency is much more important than robustness of distribution, and I believe robustness concepts should be in stats 101. (I was teaching them yesterday in the third lecture of a basic course, albeit a graduate course.) Federico Gherardini answered: This is a very interesting discussion. So you are basically saying that it's better to use robust regression methods, without having to worry too much about the distribution of residuals, instead of using standard methods and doing a lot of tests to check for normality? Did I get your point? My feeling is that symmetry is more important than, let's say kurtosis 0 in the error. Is this correct? Now the problem is: the lower number of observations, the more severe an effect of non-normality (at least, asymmetry?) could be on the regression AND at the same time, power of tests to detect non normality drops. So, I can imagine easily situations where non-normality is not detected, yet asymmetry is such that regression is significantly biased... Before you can even talk about bias you have to agree what it is you are trying to estimate. For asymmetric error distributions it is unlikely to be the population mean, but if it is then least-squares linear regression is unbiased provided only that the error distribution has a finite first moment. (Part of the so-called Gauss-Markov Theorem. This seems to suggest that Philippe's `easy imagination' is of impossible things.) For contaminated normal distributions it is possibly the mean of the uncontaminated normal component, and the latter seems the commonest aim of mainstream robust methods, which do often assume symmetry. (This may not affect interpretation of coefficients other than the intercept.) The (non-linear) robust regression estimators may be biased for the population mean but have a (much) smaller variability for long-tailed distributions. There is a lot of careful discussion about this in the statistical literature, and I don't believe that it is profitable for people to be discussing this without knowing the literature, and probably not _here_ even then. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
What about shapiro.test(resid(fit.object)) Stefano On Fri, Oct 15, 2004 at 02:44:18PM +0200, Federico Gherardini wrote: Hi all, Is it possible to have a test value for assessing the normality of residuals from a linear regression model, instead of simply relying on qqplots? I've tried to use fitdistr to try and fit the residuals with a normal distribution, but fitdsitr only returns the parameters of the distribution and the standard errors, not the p-value. Am I missing something? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
Hi Frederico, take also a look at the package nortest: help(package=nortest) Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/396887 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm. - Original Message - From: Federico Gherardini [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, October 15, 2004 2:44 PM Subject: [R] Testing for normality of residuals in a regression model Hi all, Is it possible to have a test value for assessing the normality of residuals from a linear regression model, instead of simply relying on qqplots? I've tried to use fitdistr to try and fit the residuals with a normal distribution, but fitdsitr only returns the parameters of the distribution and the standard errors, not the p-value. Am I missing something? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Dear Federico, A problem with applying a standard test of normality to LS residuals is that the residuals are correlated and heterskedastic even if the standard assumptions of the model hold. In a large sample, this is unlikely to be problematic (unless there's an unusual data configuration), but in a small sample the effect could be nontrivial. One approach is to use BLUS residuals, which transform the LS residuals to a smaller set of uncorrelated, homoskedastic residuals (assuming the correctness of the model). A search of R resources didn't turn up anything for BLUS, but they shouldn't be hard to compute. This is a standard topic covered in many econometrics texts. You might consider the alternative of generating a bootstrapped confidence envelope for the QQ plot; the qq.plot() function in the car package will do this for a linear model. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Federico Gherardini Sent: Friday, October 15, 2004 7:44 AM To: [EMAIL PROTECTED] Subject: [R] Testing for normality of residuals in a regression model Hi all, Is it possible to have a test value for assessing the normality of residuals from a linear regression model, instead of simply relying on qqplots? I've tried to use fitdistr to try and fit the residuals with a normal distribution, but fitdsitr only returns the parameters of the distribution and the standard errors, not the p-value. Am I missing something? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
Thank you very much for your suggestions! The residuals come from a gls model, because I had to correct for heteroscedasticity using a weighted regression... can I simply apply one of these tests (like shapiro.test) to the standardized residuals from my gls model? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
John Fox wrote: Dear Federico, A problem with applying a standard test of normality to LS residuals is that the residuals are correlated and heterskedastic even if the standard assumptions of the model hold. In a large sample, this is unlikely to be problematic (unless there's an unusual data configuration), but in a small sample the effect could be nontrivial. One approach is to use BLUS residuals, which transform the LS residuals to a smaller set of uncorrelated, homoskedastic residuals (assuming the correctness of the model). I'm not sure if this are BLUE residuals, but the following function transform to a smaller set of independent, homoscedastic residuals and the calls shapiro.test: I've proposed to make this a method for shapiro.test for lm objects, but it is not accepted. shapiro.test.lm function (obj) { eff - effects(obj) rank - obj$rank df.r - obj$df.residual if (df.r 3) stop(To few degrees of freedom for residual for the test.) data.name - deparse(substitute(obj)) x - eff[-(1:rank)] res - shapiro.test(x) res$data.name - data.name res$method - paste(res$method, for residuals of linear model) res } Kjetil A search of R resources didn't turn up anything for BLUS, but they shouldn't be hard to compute. This is a standard topic covered in many econometrics texts. You might consider the alternative of generating a bootstrapped confidence envelope for the QQ plot; the qq.plot() function in the car package will do this for a linear model. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Federico Gherardini Sent: Friday, October 15, 2004 7:44 AM To: [EMAIL PROTECTED] Subject: [R] Testing for normality of residuals in a regression model Hi all, Is it possible to have a test value for assessing the normality of residuals from a linear regression model, instead of simply relying on qqplots? I've tried to use fitdistr to try and fit the residuals with a normal distribution, but fitdsitr only returns the parameters of the distribution and the standard errors, not the p-value. Am I missing something? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Kjetil Halvorsen. Peace is the most effective weapon of mass construction. -- Mahdi Elmandjra __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
John Fox wrote: Dear Federico, A problem with applying a standard test of normality to LS residuals is that the residuals are correlated and heterskedastic even if the standard assumptions of the model hold. In a large sample, this is unlikely to be problematic (unless there's an unusual data configuration), but in a small sample the effect could be nontrivial. One approach is to use BLUS residuals, which transform the LS residuals to a smaller set of uncorrelated, homoskedastic residuals (assuming the correctness of the model). A search of R resources didn't turn up anything for BLUS, but they shouldn't be hard to compute. This is a standard topic covered in many econometrics texts. You might consider the alternative of generating a bootstrapped confidence envelope for the QQ plot; the qq.plot() function in the car package will do this for a linear model. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Federico Gherardini Sent: Friday, October 15, 2004 7:44 AM To: [EMAIL PROTECTED] Subject: [R] Testing for normality of residuals in a regression model Hi all, Is it possible to have a test value for assessing the normality of residuals from a linear regression model, instead of simply relying on qqplots? I've tried to use fitdistr to try and fit the residuals with a normal distribution, but fitdsitr only returns the parameters of the distribution and the standard errors, not the p-value. Am I missing something? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Kjetil Halvorsen. Peace is the most effective weapon of mass construction. -- Mahdi Elmandjra __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
Berton Gunter wrote: Quite right, John! I have 2 additional questions: 1) Why test for normality of residuals? Suppose you reject -- then what? (residual plots may give information on skewness, multi-modality, data anomalies that can affect the data analysis). Because I want to know if my model satisfies the basic assumptions of regression theory... in other words I want to know if I can trust my model. Cheers, Federico 2) Why test for normality? Is it EVER useful? Suppose you reject -- then what? (I am tempted to add a 3rd question -- why test at all? -- but that is perhaps too iconoclastic and certainly off topic. Let the hounds remain leashed for now.) Cheers, -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
Berton Gunter wrote: Exactly! My point is that normality tests are useless for this purpose for reasons that are beyond what I can take up here. Thanks for your suggestions, I undesrtand that! Could you possibly give me some (not too complicated!) links so that I can investigate this matter further? Cheers, Federico Hints: Balanced designs are robust to non-normality; independence (especially clustering of subjects due to systematic effects), not normality is usually the biggest real statistical problem; hypothesis tests will always reject when samples are large -- so what!; trust refers to prediction validity which has to do with study design and the validity/representativeness of the current data to future. I know that all the stats 101 tests say to test for normality, but they're full of baloney! Of course, this is free advice -- so caveat emptor! Cheers, Bert __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Berton Gunter wrote: Exactly! My point is that normality tests are useless for this purpose for reasons that are beyond what I can take up here. Thanks for your suggestions, I undesrtand that! Could you possibly give me some (not too complicated!) links so that I can investigate this matter further? Cheers, Federico 1. This was meant as a private reply so I would not roil the list. In future, when a reply takes a discussion off list, you should keep it off list, please. 2. The writings of (and personal conversations with) John Tukey and George Box are certainly primary influences, as are numerous other commentaries over the year from folks like Leo Breiman, Jerry Friedman, David Freedman, Persi Diaconis and many others. Box's original paper about robustness to non-normality was around 1952, I think, but much of what I allude to is statistical folklore, I think. Perhaps other list contributors might give you some better specific references. Cheers, Bert __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Let's see if I can get my stat 101 straight: We learned that linear regression has a set of assumptions: 1. Linearity of the relationship between X and y. 2. Independence of errors. 3. Homoscedasticity (equal error variance). 4. Normality of errors. Now, we should ask: Why are they needed? Can we get away with less? What if some of them are not met? It should be clear why we need #1. Without #2, I believe the least squares estimator is still unbias, but the usual estimate of SEs for the coefficients are wrong, so the t-tests are wrong. Without #3, the coefficients are, again, still unbiased, but not as efficient as can be. Interval estimates for the prediction will surely be wrong. Without #4, well, it depends. If the residual DF is sufficiently large, the t-tests are still valid because of CLT. You do need normality if you have small residual DF. The problem with normality tests, I believe, is that they usually have fairly low power at small sample sizes, so that doesn't quite help. There's no free lunch: A normality test with good power will usually have good power against a fairly narrow class of alternatives, and almost no power against others (directional test). How do you decide what to use? Has anyone seen a data set where the normality test on the residuals is crucial in coming up with appriate analysis? Cheers, Andy From: Federico Gherardini Berton Gunter wrote: Exactly! My point is that normality tests are useless for this purpose for reasons that are beyond what I can take up here. Thanks for your suggestions, I undesrtand that! Could you possibly give me some (not too complicated!) links so that I can investigate this matter further? Cheers, Federico Hints: Balanced designs are robust to non-normality; independence (especially clustering of subjects due to systematic effects), not normality is usually the biggest real statistical problem; hypothesis tests will always reject when samples are large -- so what!; trust refers to prediction validity which has to do with study design and the validity/representativeness of the current data to future. I know that all the stats 101 tests say to test for normality, but they're full of baloney! Of course, this is free advice -- so caveat emptor! Cheers, Bert __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Dear Kjetil, I don't believe that these are BLUS residuals, but since the last n - r effects are projections onto an orthogonal basis for the residual subspace, they should do just fine (as long as the basis vectors have the same length, which I think is the case, but perhaps someone can confirm). The general idea is to transform the LS residuals into an uncorrelated, equal-variance set. Regards, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: Kjetil Brinchmann Halvorsen [mailto:[EMAIL PROTECTED] Sent: Friday, October 15, 2004 9:12 AM To: John Fox Cc: 'Federico Gherardini'; [EMAIL PROTECTED] Subject: Re: [R] Testing for normality of residuals in a regression model John Fox wrote: Dear Federico, A problem with applying a standard test of normality to LS residuals is that the residuals are correlated and heterskedastic even if the standard assumptions of the model hold. In a large sample, this is unlikely to be problematic (unless there's an unusual data configuration), but in a small sample the effect could be nontrivial. One approach is to use BLUS residuals, which transform the LS residuals to a smaller set of uncorrelated, homoskedastic residuals (assuming the correctness of the model). I'm not sure if this are BLUE residuals, but the following function transform to a smaller set of independent, homoscedastic residuals and the calls shapiro.test: I've proposed to make this a method for shapiro.test for lm objects, but it is not accepted. shapiro.test.lm function (obj) { eff - effects(obj) rank - obj$rank df.r - obj$df.residual if (df.r 3) stop(To few degrees of freedom for residual for the test.) data.name - deparse(substitute(obj)) x - eff[-(1:rank)] res - shapiro.test(x) res$data.name - data.name res$method - paste(res$method, for residuals of linear model) res } Kjetil A search of R resources didn't turn up anything for BLUS, but they shouldn't be hard to compute. This is a standard topic covered in many econometrics texts. You might consider the alternative of generating a bootstrapped confidence envelope for the QQ plot; the qq.plot() function in the car package will do this for a linear model. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Federico Gherardini Sent: Friday, October 15, 2004 7:44 AM To: [EMAIL PROTECTED] Subject: [R] Testing for normality of residuals in a regression model Hi all, Is it possible to have a test value for assessing the normality of residuals from a linear regression model, instead of simply relying on qqplots? I've tried to use fitdistr to try and fit the residuals with a normal distribution, but fitdsitr only returns the parameters of the distribution and the standard errors, not the p-value. Am I missing something? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Kjetil Halvorsen. Peace is the most effective weapon of mass construction. -- Mahdi Elmandjra __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Dear Federico, The problem is the same with GLS residuals -- even if the GLS transformation produces homoskedastic errors, the residuals will be correlated and heteroskedastic (with this problem tending to disappear in most instances as n grows). The central point is that residuals don't behave quite the same as errors. Regards, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Federico Gherardini Sent: Friday, October 15, 2004 11:22 AM To: [EMAIL PROTECTED] Subject: Re: [R] Testing for normality of residuals in a regression model Thank you very much for your suggestions! The residuals come from a gls model, because I had to correct for heteroscedasticity using a weighted regression... can I simply apply one of these tests (like shapiro.test) to the standardized residuals from my gls model? Cheers, Federico __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Dear Andy, At the risk of muddying the waters (and certainly without wanting to advocate the use of normality tests for residuals), I believe that your point #4 is subject to misinterpretation: That is, while it is true that t- and F-tests for regression coefficients in large sample retain their validity well when the errors are non-normal, the efficiency of the LS estimates can (depending upon the nature of the non-normality) be seriously compromised, not only absolutely but in relation to alternatives (e.g., robust regression). Regards, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy Sent: Friday, October 15, 2004 11:55 AM To: 'Federico Gherardini'; Berton Gunter Cc: R-help mailing list Subject: RE: [R] Testing for normality of residuals in a regression model Let's see if I can get my stat 101 straight: We learned that linear regression has a set of assumptions: 1. Linearity of the relationship between X and y. 2. Independence of errors. 3. Homoscedasticity (equal error variance). 4. Normality of errors. Now, we should ask: Why are they needed? Can we get away with less? What if some of them are not met? It should be clear why we need #1. Without #2, I believe the least squares estimator is still unbias, but the usual estimate of SEs for the coefficients are wrong, so the t-tests are wrong. Without #3, the coefficients are, again, still unbiased, but not as efficient as can be. Interval estimates for the prediction will surely be wrong. Without #4, well, it depends. If the residual DF is sufficiently large, the t-tests are still valid because of CLT. You do need normality if you have small residual DF. The problem with normality tests, I believe, is that they usually have fairly low power at small sample sizes, so that doesn't quite help. There's no free lunch: A normality test with good power will usually have good power against a fairly narrow class of alternatives, and almost no power against others (directional test). How do you decide what to use? Has anyone seen a data set where the normality test on the residuals is crucial in coming up with appriate analysis? Cheers, Andy From: Federico Gherardini Berton Gunter wrote: Exactly! My point is that normality tests are useless for this purpose for reasons that are beyond what I can take up here. Thanks for your suggestions, I undesrtand that! Could you possibly give me some (not too complicated!) links so that I can investigate this matter further? Cheers, Federico Hints: Balanced designs are robust to non-normality; independence (especially clustering of subjects due to systematic effects), not normality is usually the biggest real statistical problem; hypothesis tests will always reject when samples are large -- so what!; trust refers to prediction validity which has to do with study design and the validity/representativeness of the current data to future. I know that all the stats 101 tests say to test for normality, but they're full of baloney! Of course, this is free advice -- so caveat emptor! Cheers, Bert __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Testing for normality of residuals in a regression model
Hi John, Your point is well taken. I was only thinking about the shape of the distribution, and neglected the cases of, say, symmetric long tailed distributions. However, I think I'd still argue that other tools are probably more useful than normality tests (e.g., robust methods, as you mentioned). To take the point a bit further, let's say we test for normality and it's rejected. What do we do then? Well, if the non-normality is caused by outliers, we can try robust methods. If not, what do we do? We can try to see if some sort of transformation would bring the residuals closer to normally distributed, but if the interest is in inference on the coefficients, those inferences on the `final' model are potentially invalid. What's one to do then? Also, I was told by someone very smart that fitting OLS to data with heteroscedastic errors can make the residuals look `more normal' than they really are... Don't know how true that is, though. Best, Andy From: John Fox Dear Andy, At the risk of muddying the waters (and certainly without wanting to advocate the use of normality tests for residuals), I believe that your point #4 is subject to misinterpretation: That is, while it is true that t- and F-tests for regression coefficients in large sample retain their validity well when the errors are non-normal, the efficiency of the LS estimates can (depending upon the nature of the non-normality) be seriously compromised, not only absolutely but in relation to alternatives (e.g., robust regression). Regards, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy Sent: Friday, October 15, 2004 11:55 AM To: 'Federico Gherardini'; Berton Gunter Cc: R-help mailing list Subject: RE: [R] Testing for normality of residuals in a regression model Let's see if I can get my stat 101 straight: We learned that linear regression has a set of assumptions: 1. Linearity of the relationship between X and y. 2. Independence of errors. 3. Homoscedasticity (equal error variance). 4. Normality of errors. Now, we should ask: Why are they needed? Can we get away with less? What if some of them are not met? It should be clear why we need #1. Without #2, I believe the least squares estimator is still unbias, but the usual estimate of SEs for the coefficients are wrong, so the t-tests are wrong. Without #3, the coefficients are, again, still unbiased, but not as efficient as can be. Interval estimates for the prediction will surely be wrong. Without #4, well, it depends. If the residual DF is sufficiently large, the t-tests are still valid because of CLT. You do need normality if you have small residual DF. The problem with normality tests, I believe, is that they usually have fairly low power at small sample sizes, so that doesn't quite help. There's no free lunch: A normality test with good power will usually have good power against a fairly narrow class of alternatives, and almost no power against others (directional test). How do you decide what to use? Has anyone seen a data set where the normality test on the residuals is crucial in coming up with appriate analysis? Cheers, Andy From: Federico Gherardini Berton Gunter wrote: Exactly! My point is that normality tests are useless for this purpose for reasons that are beyond what I can take up here. Thanks for your suggestions, I undesrtand that! Could you possibly give me some (not too complicated!) links so that I can investigate this matter further? Cheers, Federico Hints: Balanced designs are robust to non-normality; independence (especially clustering of subjects due to systematic effects), not normality is usually the biggest real statistical problem; hypothesis tests will always reject when samples are large -- so what!; trust refers to prediction validity which has to do with study design and the validity/representativeness of the current data to future. I know that all the stats 101 tests say to test for normality, but they're full of baloney! Of course, this is free advice -- so caveat emptor! Cheers, Bert __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
Re: [R] Testing for normality of residuals in a regression model
OK, I'll expose myself: I tend to do normal probability plots of residuals (usely deletion / studentized residuals as described by Venables and Ripley in Modern Applied Statistics with S, 4th ed, MASS4). If the plots look strange, I do something. I'll check apparent outliers for coding and data entry errors, and I often delete those points from the analysis even if I can't find a reason why. Robust regression will usually handle this type of problem, and I am gradually migrating to increasing use of robust regression, especially the procedures recommended by MASS4. . However, I recently encountered a situation that would be masked by standard use of robust regression without examining residual plots: A normal probability plot looked like three parallel straight lines with gaps, suggesting a mixture of 3 normal distributions with different means and a common standard deviation. Further investigation revealed that an important 3-level explanatory variable that had been miscoded. When this was corrected, that variable entered the model and the gaps in the normal plot disappeared. I tend NOT to use tests of normality for the reasons Andy mentioned. Instead, I do various kinds of diagnostic plots and modify my model or investigate the data in response to what I see. Comments? hope this helps. spencer graves Liaw, Andy wrote: Let's see if I can get my stat 101 straight: We learned that linear regression has a set of assumptions: 1. Linearity of the relationship between X and y. 2. Independence of errors. 3. Homoscedasticity (equal error variance). 4. Normality of errors. Now, we should ask: Why are they needed? Can we get away with less? What if some of them are not met? It should be clear why we need #1. Without #2, I believe the least squares estimator is still unbias, but the usual estimate of SEs for the coefficients are wrong, so the t-tests are wrong. Without #3, the coefficients are, again, still unbiased, but not as efficient as can be. Interval estimates for the prediction will surely be wrong. Without #4, well, it depends. If the residual DF is sufficiently large, the t-tests are still valid because of CLT. You do need normality if you have small residual DF. The problem with normality tests, I believe, is that they usually have fairly low power at small sample sizes, so that doesn't quite help. There's no free lunch: A normality test with good power will usually have good power against a fairly narrow class of alternatives, and almost no power against others (directional test). How do you decide what to use? Has anyone seen a data set where the normality test on the residuals is crucial in coming up with appriate analysis? Cheers, Andy From: Federico Gherardini Berton Gunter wrote: Exactly! My point is that normality tests are useless for this purpose for reasons that are beyond what I can take up here. Thanks for your suggestions, I undesrtand that! Could you possibly give me some (not too complicated!) links so that I can investigate this matter further? Cheers, Federico Hints: Balanced designs are robust to non-normality; independence (especially clustering of subjects due to systematic effects), not normality is usually the biggest real statistical problem; hypothesis tests will always reject when samples are large -- so what!; trust refers to prediction validity which has to do with study design and the validity/representativeness of the current data to future. I know that all the stats 101 tests say to test for normality, but they're full of baloney! Of course, this is free advice -- so caveat emptor! Cheers, Bert __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Spencer Graves, PhD, Senior Development Engineer O: (408)938-4420; mobile: (408)655-4567 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Testing for normality of residuals in a regression model
Liaw, Andy wrote: . . . . Also, I was told by someone very smart that fitting OLS to data with heteroscedastic errors can make the residuals look `more normal' than they really are... Don't know how true that is, though. Certainly true, since the residuals will be a kind of average, so the CLT works. (Think that is in Seber, Linear Regression Analysis, 1977) Kjetil Best, Andy -- Kjetil Halvorsen. Peace is the most effective weapon of mass construction. -- Mahdi Elmandjra __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html