Re: [R] Assumptions for ANOVA: the right way to check the normality
Frodo Jedi wrote: What is the question you are really trying to find the answer for? Knowing that may help us give more meaningful answers. Concerning your question I thought to have been clear. I want to understand which analysis I have to use in order to understand if the differences I am having are statistically significant or not. Dear Frodo, I would like to suggest that the question required is a concrete specification of an hypothesis. Something like: I hypothesize that the responses to condition-A would be different in magnitude from the responses to condition-AH, across all stimuli. Perhaps after having a detailed formulation of your hypothesis the required analysis will be clearer for yourself, or at least make it easier for experts to guide you. Best, dror -- View this message in context: http://r.789695.n4.nabble.com/Assumptions-for-ANOVA-the-right-way-to-check-the-normality-tp3176073p3208596.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Assumptions for ANOVA: the right way to check the normality
From: Frodo Jedi [mailto:frodo.j...@yahoo.com] Sent: Monday, January 10, 2011 5:44 PM To: Greg Snow Cc: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Greg, first of all thanks for your reply. And I add also many thanks to all of you guys who are helping me, sorry for the amount of questions I recently posted ;-) I don´t have a solid statistics background (I am not a statician) and I am basically learning everything by myself. So my first goal is TO UNDERSTAND. I need to have general guidelines because for my PhD I am doing and I will do several psycophysic experiments. I am totally alone in this challenge, so I am asking some help to you guys as I think that here is the best place to exchange the thing that I miss and that will never found in any book: the experience. Isn't there a single statistician anywhere in the University? Does your committee have any experience with any of this? What is the question you are really trying to find the answer for? Knowing that may help us give more meaningful answers. Concerning your question I thought to have been clear. I want to understand which analysis I have to use in order to understand if the differences I am having are statistically significant or not. Now, as in all the books I read there is written that to apply ANOVA I must respect the assumption of normality then I am try to find a way to understand this. A general run of anova procedures will produce multiple p-values addressing multiple null hypotheses addressing many different questions (often many of which are uninteresting). Which terms are you really trying to test and which are included because you already know that they have an effect. Are you including interactions because you find them actually interesting? Or just because that is what everyone else does? [snip] Also remember that the normality of the data/residuals/etc. is not as important as the CLT for your sample size. The main things that make the CLT not work (for samples that are not large enough) are outliers and strong skewness, since your outcome is limited to the numbers 1-7, I don’t see outliers or skewness being a real problem. So you are probably fine for fixed effects style models (though checking with experts in your area or doing simulations can support/counter this). As far as I have seen everyone in my field does ANOVA. [imagine best Mom voice] and if everyone in your field jumped off a cliff . . . Do you want to do what everyone else is doing, or something new and different? What does your committee chair say about this? But when you add in random effects then there is a lot of uncertainty about if the normal theory still holds, the latest lme code uses mcmc sampling rather than depending on normal theory and is still being developed. For random effects do you mean the repeated measures right? So why staticians developed the ANOVA with repeated measure if there is so much uncertainty? Repeated measures are one type of random effect analysis, but random and mixed effects is more general than just repeated measures. Statisticians developed those methods because they worked for simple cases, made some sense for more complicated cases, and they did not have anything that was both better and practical. Now with modern computers we can see when those do work (unfortunately not as often as had been hoped) and what was once impractical is now much simpler (but inertia is to do it the old way, even though the people who developed the old way would have preferred to do it our way). The article: Why Permutation Tests Are Superior to t and F Tests in Biomedical Research John Ludbrook and Hugh Dudley The American Statistician Vol. 52, No. 2 (May, 1998), pp. 127-132 May be enlightening here (and give possible alternatives). Also see: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001819.html for some simulation involving mixed models. One shows that the normal theory works fine for that particular case, the next one shows a case where the normal theory does not work, then shows how to use simulation (parametric bootstrap) to get a more appropriate p-value. You can adapt those examples for your own situation. This now comes back to my first question: what are you trying to find out? My ultimate goal is to find the p-values in order to understand if my results are significative or not. So I can write them on the paper ;-) There is a function in the TeachingDemos package that will produce p-values if that is all your want, these are independent of any normality assumptions, independent of any data in fact. However they don't really help with understanding. Graphing the data (I think you have done this already) is the best route to understanding. If you need more than that, then consider the following article: Buja, A., Cook, D.
Re: [R] Assumptions for ANOVA: the right way to check the normality
Many many thanks for your feedback Greg. You have been very enlightening for me. Now is time for me to study the material you kindly provided me. Thanks. From: Greg Snow greg.s...@imail.org Cc: r-help@r-project.org r-help@r-project.org Sent: Tue, January 11, 2011 10:13:34 PM Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality Sent: Monday, January 10, 2011 5:44 PM To: Greg Snow Cc: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Greg, first of all thanks for your reply. And I add also many thanks to all of you guys who are helping me, sorry for the amount of questions I recently posted ;-) I don´t have a solid statistics background (I am not a statician) and I am basically learning everything by myself. So my first goal is TO UNDERSTAND. I need to have general guidelines because for my PhD I am doing and I will do several psycophysic experiments. I am totally alone in this challenge, so I am asking some help to you guys as I think that here is the best place to exchange the thing that I miss and that will never found in any book: the experience. Isn't there a single statistician anywhere in the University? Does your committee have any experience with any of this? What is the question you are really trying to find the answer for? Knowing that may help us give more meaningful answers. Concerning your question I thought to have been clear. I want to understand which analysis I have to use in order to understand if the differences I am having are statistically significant or not. Now, as in all the books I read there is written that to apply ANOVA I must respect the assumption of normality then I am try to find a way to understand this. A general run of anova procedures will produce multiple p-values addressing multiple null hypotheses addressing many different questions (often many of which are uninteresting). Which terms are you really trying to test and which are included because you already know that they have an effect. Are you including interactions because you find them actually interesting? Or just because that is what everyone else does? [snip] Also remember that the normality of the data/residuals/etc. is not as important as the CLT for your sample size. The main things that make the CLT not work (for samples that are not large enough) are outliers and strong skewness, since your outcome is limited to the numbers 1-7, I donât see outliers or skewness being a real problem. So you are probably fine for fixed effects style models (though checking with experts in your area or doing simulations can support/counter this). As far as I have seen everyone in my field does ANOVA. [imagine best Mom voice] and if everyone in your field jumped off a cliff . . . Do you want to do what everyone else is doing, or something new and different? What does your committee chair say about this? But when you add in random effects then there is a lot of uncertainty about if the normal theory still holds, the latest lme code uses mcmc sampling rather than depending on normal theory and is still being developed. For random effects do you mean the repeated measures right? So why staticians developed the ANOVA with repeated measure if there is so much uncertainty? Repeated measures are one type of random effect analysis, but random and mixed effects is more general than just repeated measures. Statisticians developed those methods because they worked for simple cases, made some sense for more complicated cases, and they did not have anything that was both better and practical. Now with modern computers we can see when those do work (unfortunately not as often as had been hoped) and what was once impractical is now much simpler (but inertia is to do it the old way, even though the people who developed the old way would have preferred to do it our way). The article: Why Permutation Tests Are Superior to t and F Tests in Biomedical Research John Ludbrook and Hugh Dudley The American Statistician Vol. 52, No. 2 (May, 1998), pp. 127-132 May be enlightening here (and give possible alternatives). Also see: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001819.html for some simulation involving mixed models. One shows that the normal theory works fine for that particular case, the next one shows a case where the normal theory does not work, then shows how to use simulation (parametric bootstrap) to get a more appropriate p-value. You can adapt those examples for your own situation. This now comes back to my first question: what are you trying to find out? My ultimate goal is to find the p-values in order to understand if my results are significative or not. So I can write them on the paper ;-) There is a function in the TeachingDemos package that will produce p-values if that is all
Re: [R] Assumptions for ANOVA: the right way to check the normality
What is the question you are really trying to find the answer for? Knowing that may help us give more meaningful answers. You keep wanting to test the residuals for normality, but it looks like you are doing it because some outdate recipe suggests it rather than that you understand why. It is fairly easy to create a distribution that is definitely not normal, that gives the wrong answer most of the time if normality is assumed, yet will pass most normality tests most of the time (well except for SnowsPenualtimateNormalityTest, but that one has an unfair advantage in this situation). So just because the residuals look normal (or close enough) does not mean that the theory holds. R. A. Fisher is said to have said that the quality of a statistician can be judged by the amount of rat droppings under his finger nails. Now if we take that literally, then I must not be very good. But more what he meant is that a statistician must understand the source of the data, not just get a file and put it through some canned routines. So these questions are really for you or the source of your data. Also remember that the normality of the data/residuals/etc. is not as important as the CLT for your sample size. The main things that make the CLT not work (for samples that are not large enough) are outliers and strong skewness, since your outcome is limited to the numbers 1-7, I don't see outliers or skewness being a real problem. So you are probably fine for fixed effects style models (though checking with experts in your area or doing simulations can support/counter this). But when you add in random effects then there is a lot of uncertainty about if the normal theory still holds, the latest lme code uses mcmc sampling rather than depending on normal theory and is still being developed. This now comes back to my first question: what are you trying to find out? You may not need to do anova or that type of model. Some simple hypotheses may be answered using McNemars test on your data. If you want to do predictions then linear models will be meaningless (what would a prediction of -3.2, 4.493, or 8.1 mean on a 7 point likert scale?) and something like proportional odds logistic regression will be much more meaningful. Between those are bootstrap and permutation methods that may answer you question without any normality assumptions. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 From: Frodo Jedi [mailto:frodo.j...@yahoo.com] Sent: Saturday, January 08, 2011 3:20 AM To: Greg Snow Cc: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Greg, many thanks for your answer. Now I have a problem then in understanding how to check normality in case of ANOVA with repeated measures. I would need an help with a numeric example, as I haven´tu fully understood how it works with the proj() command as it as suggested by another R user in this mailing list. For example, in attachment you find a .csv table resulting from an experiment, you can access it by means of this command: scrd- read.csv(file='/Users/../tables_for_R/table_quality_wood.csv',sep=',',header=T) The data are from an experiment where participants had to evaluate on a seven point likert scale the realism of some stimuli, which are presented both in condition A and in condition AH. I need to perform the ANOVA by means of this command: aov1 = aov(response ~ stimulus*condition + Error(subject/(stimulus*condition)), data=scrd) but the problem is that I cannot plot as usually do the qqnorm on the residuals of the fit because lm does not support the Error term present in aov. I normally check normality through a plot (or a shapiro.test function). Now could you please illustrate me how will you be able to undestand from my data if they are normally distributed? Please enlighten me Best regards From: Greg Snow greg.s...@imail.org To: Ben Ward benjamin.w...@bathspa.org; r-help@r-project.org r-help@r-project.org Sent: Fri, January 7, 2011 7:34:05 PM Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality A lot of this depends on what question you are really trying to answer. For one way anova replacing y-values with their ranks essentially transforms the distribution to uniform (under the null) and the Central Limit Theorem kicks in for the uniform with samples larger than about 5, so the normal approximations are pretty good and the theory works, but what are you actually testing? The most meaningful null that is being tested is that all data come from the exact same distribution. So what does it mean when you reject that null? It means that all the groups are not representing the same distribution, but is that because the means differ? Or the variances? Or the shapes? It can be any of those. Some point out that if you make
Re: [R] Assumptions for ANOVA: the right way to check the normality
I can't get hotmail to indicate the original text so I'm going to top post. There seems to be a lot of back and forth here, let me see if these comment help guide discussion a bit. I tried to run some histograms of your experiment (prior to a bunch of other things ) and IIRC in many cases you have counts under 10. At minimum, anything you do or any test you run you want to do some senistivity analyses and perturb your data a bit. Your objective of course is important- say you want to calibrate your response data and try to validate your assumption that your survey question relfect some continuous variable ( but a respondent can only round his response to an int as in teh case of taking a temperature for example, otherwise all you can really say is that these things are like ranks, 765 etc ). Personally I always avoid non-parametrics ( just personal bias) but with small samples and a response that is closer to a rank than a continuous variable with some meaning, it may make sense. If you plot hitograms of responses versus A and AH, visually they look different, you could try fitting the histos to various pdf's and see what you get etc. This is all retro/post-hoc so you may as well explore away. From: greg.s...@imail.org To: frodo.j...@yahoo.com Date: Mon, 10 Jan 2011 11:26:05 -0700 CC: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality What is the question you are really trying to find the answer for? Knowing that may help us give more meaningful answers. You keep wanting to test the residuals for normality, but it looks like you are doing it because some outdate recipe suggests it rather than that you understand why. It is fairly easy to create a distribution that is definitely not normal, that gives the wrong answer most of the time if normality is assumed, yet will pass most normality tests most of the time (well except for SnowsPenualtimateNormalityTest, but that one has an unfair advantage in this situation). So just because the residuals look normal (or close enough) does not mean that the theory holds. R. A. Fisher is said to have said that the quality of a statistician can be judged by the amount of rat droppings under his finger nails. Now if we take that literally, then I must not be very good. But more what he meant is that a statistician must understand the source of the data, not just get a file and put it through some canned routines. So these questions are really for you or the source of your data. Also remember that the normality of the data/residuals/etc. is not as important as the CLT for your sample size. The main things that make the CLT not work (for samples that are not large enough) are outliers and strong skewness, since your outcome is limited to the numbers 1-7, I don't see outliers or skewness being a real problem. So you are probably fine for fixed effects style models (though checking with experts in your area or doing simulations can support/counter this). But when you add in random effects then there is a lot of uncertainty about if the normal theory still holds, the latest lme code uses mcmc sampling rather than depending on normal theory and is still being developed. This now comes back to my first question: what are you trying to find out? You may not need to do anova or that type of model. Some simple hypotheses may be answered using McNemars test on your data. If you want to do predictions then linear models will be meaningless (what would a prediction of -3.2, 4.493, or 8.1 mean on a 7 point likert scale?) and something like proportional odds logistic regression will be much more meaningful. Between those are bootstrap and permutation methods that may answer you question without any normality assumptions. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 From: Frodo Jedi [mailto:frodo.j...@yahoo.com] Sent: Saturday, January 08, 2011 3:20 AM To: Greg Snow Cc: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Greg, many thanks for your answer. Now I have a problem then in understanding how to check normality in case of ANOVA with repeated measures. I would need an help with a numeric example, as I haven´tu fully understood how it works with the proj() command as it as suggested by another R user in this mailing list. For example, in attachment you find a .csv table resulting from an experiment, you can access it by means of this command: scrd- read.csv(file='/Users/../tables_for_R/table_quality_wood.csv',sep=',',header=T) The data are from an experiment where participants had to evaluate on a seven point likert scale the realism of some stimuli, which are presented both in condition A and in condition AH. I need to perform the ANOVA by means of this command: aov1 = aov(response ~ stimulus*condition +
Re: [R] Assumptions for ANOVA: the right way to check the normality
Dear Greg, first of all thanks for your reply. And I add also many thanks to all of you guys who are helping me, sorry for the amount of questions I recently posted ;-) I don´t have a solid statistics background (I am not a statician) and I am basically learning everything by myself. So my first goal is TO UNDERSTAND. I need to have general guidelines because for my PhD I am doing and I will do several psycophysic experiments. I am totally alone in this challenge, so I am asking some help to you guys as I think that here is the best place to exchange the thing that I miss and that will never found in any book: the experience. What is the question you are really trying to find the answer for? Knowing that may help us give more meaningful answers. Concerning your question I thought to have been clear. I want to understand which analysis I have to use in order to understand if the differences I am having are statistically significant or not. Now, as in all the books I read there is written that to apply ANOVA I must respect the assumption of normality then I am try to find a way to understand this. You keep wanting to test the residuals for normality, but it looks like you are doing it because some outdate recipe suggests it rather than that you understand why. Sorry Greg, if I look like this. It is not true, I am understanding everything, more than I show. It is fairly easy to create a distribution that is definitely not normal, that gives the wrong answer most of the time if normality is assumed, yet will pass most normality tests most of the time (well except for SnowsPenualtimateNormalityTest, but that one has an unfair advantage in this situation). So just because the residuals look normal (or close enough) does not mean that the theory holds. This is the thing that I cannot find in any book, do you understand? If I keep stuck to a book I would never understood this. R. A. Fisher is said to have said that the quality of a statistician can be judged by the amount of rat droppings under his finger nails. Now if we take that literally, then I must not be very good. But more what he meant is that a statistician must understand the source of the data, not just get a file and put it through some canned routines. So these questions are really for you or the source of your data. I agree. The source of my data are basically subjects (15 n 45) who have to test some devices that I develop and they have to fill a questionnaire (sometime based on a seven point Likert scale). Now, I don´t find any particular problem for this case. Which can be the problem of my data here? Also remember that the normality of the data/residuals/etc. is not as important as the CLT for your sample size. The main things that make the CLT not work (for samples that are not large enough) are outliers and strong skewness, since your outcome is limited to the numbers 1-7, I donât see outliers or skewness being a real problem. So you are probably fine for fixed effects style models (though checking with experts in your area or doing simulations can support/counter this). As far as I have seen everyone in my field does ANOVA. But when you add in random effects then there is a lot of uncertainty about if the normal theory still holds, the latest lme code uses mcmc sampling rather than depending on normal theory and is still being developed. For random effects do you mean the repeated measures right? So why staticians developed the ANOVA with repeated measure if there is so much uncertainty? This now comes back to my first question: what are you trying to find out? My ultimate goal is to find the p-values in order to understand if my results are significative or not. So I can write them on the paper ;-) You may not need to do anova or that type of model. Some simple hypotheses may be answered using McNemars test on your data. If you want to do predictions then linear models will be meaningless (what would a prediction of -3.2, 4.493, or 8.1 mean on a 7 point likert scale?) and something like proportional odds logistic regression will be much more meaningful. Between those are bootstrap and permutation methods that may answer you question without any normality assumptions. Ok. But my ANOVA analysis I did so far is wrong or not? I think it is very valid, since the results seem coherent with what one can see looking at the means. Thanks for sharing your precious experience with me. I think the world becomes better when people help each others. All the best -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 From:Frodo Jedi [mailto:frodo.j...@yahoo.com] Sent: Saturday, January 08, 2011 3:20 AM To: Greg Snow Cc: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Greg, many thanks for your answer. Now
Re: [R] Assumptions for ANOVA: the right way to check the normality
( again top posting since hotmail isn't adding and these comments apply to whole thread anyway ) I'm not a statistician either but rather an engineer who has had a chance to use my intro stats/math background to look at some real life situations. I'm just making comments for conversation, hoping to elicit more specifics from experts willing to talk ( but even with these caveats I have noted my posts are getting quite sloppy ). There is nothing wrong with wanting to be rigorous and replicate the approach others have taken and select the best tests. However, if this is a thesis and you want to do meaningful research, you really will have to be more concerned about the results, not just parroting stuff that may or may not help you understand your data. Personally I would suggest a book, not just buy a consultant, and use dead time to play with formulas and think about them- paper and pencil are still worthwhile and it would help your latter work if you had some idea what these numbers may or may not mean. None of this math should be beyond you. If you run every test that could be of interest and is easy to code in R, plow through the formula and reconcile apparent agreements and disagreements using algebra, run sensitivity tests, add noise to data remove points etc and do it all again, then could start to see what would happen if your data or residuals from some fit or some other thing were not normally distributed. It should pop out of the analysis. You ought to be able to do simple things like get SSE's without using any stats packages and see if you can step through the stuff. To give you perspective on what your peers are doing, this is a search on an recent controversy in which you may be interested before looking for cookbook approaches. I looked at this in a few years now maybe it has been settled but these are respected folk using terms like voodoo for stat analysis, http://www.google.com/#sclient=psyhl=enq=vul+fmri+voodoo maybe this in particular, http://afni.nimh.nih.gov/pub/dist/doc/misc/voodoo.pdf and I'm sure you can easily find others. No one is suggesting you ignore experts and do your own thing, it is easy to mislead yourself if you don't look for sanity checks, just that the field is a bit open and you are unlikely to get a cookbook result for a single test that gives you THE P VALUE. Date: Mon, 10 Jan 2011 16:43:56 -0800 From: frodo.j...@yahoo.com To: greg.s...@imail.org CC: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Greg, first of all thanks for your reply. And I add also many thanks to all of you guys who are helping me, sorry for the amount of questions I recently posted ;-) I don´t have a solid statistics background (I am not a statician) and I am basically learning everything by myself. So my first goal is TO UNDERSTAND. I need to have general guidelines because for my PhD I am doing and I will do several psycophysic experiments. I am totally alone in this challenge, so I am asking some help to you guys as I think that here is the best place to exchange the thing that I miss and that will never found in any book: the experience. What is the question you are really trying to find the answer for? Knowing that may help us give more meaningful answers. Concerning your question I thought to have been clear. I want to understand which analysis I have to use in order to understand if the differences I am having are statistically significant or not. Now, as in all the books I read there is written that to apply ANOVA I must respect the assumption of normality then I am try to find a way to understand this. You keep wanting to test the residuals for normality, but it looks like you are doing it because some outdate recipe suggests it rather than that you understand why. Sorry Greg, if I look like this. It is not true, I am understanding everything, more than I show. It is fairly easy to create a distribution that is definitely not normal, that gives the wrong answer most of the time if normality is assumed, yet will pass most normality tests most of the time (well except for SnowsPenualtimateNormalityTest, but that one has an unfair advantage in this situation). So just because the residuals look normal (or close enough) does not mean that the theory holds. This is the thing that I cannot find in any book, do you understand? If I keep stuck to a book I would never understood this. R. A. Fisher is said to have said that the quality of a statistician can be judged by the amount of rat droppings under his finger nails. Now if we take that literally, then I must not be very good. But more what he meant is that a statistician must understand the source of the data, not just get a file and put it through some canned routines. So these questions are really for you or the source of your data. I agree. The source of my data are basically subjects (15 n 45) who have to test some
Re: [R] Assumptions for ANOVA: the right way to check the normality
Dear Greg, many thanks for your answer. Now I have a problem then in understanding how to check normality in case of ANOVA with repeated measures. I would need an help with a numeric example, as I haven´tu fully understood how it works with the proj() command as it as suggested by another R user in this mailing list. For example, in attachment you find a .csv table resulting from an experiment, you can access it by means of this command: scrd- read.csv(file='/Users/../tables_for_R/table_quality_wood.csv',sep=',',header=T) The data are from an experiment where participants had to evaluate on a seven point likert scale the realism of some stimuli, which are presented both in condition A and in condition AH. I need to perform the ANOVA by means of this command: aov1 = aov(response ~ stimulus*condition + Error(subject/(stimulus*condition)), data=scrd) but the problem is that I cannot plot as usually do the qqnorm on the residuals of the fit because lm does not support the Error term present in aov. I normally check normality through a plot (or a shapiro.test function). Now could you please illustrate me how will you be able to undestand from my data if they are normally distributed? Please enlighten me Best regards From: Greg Snow greg.s...@imail.org To: Ben Ward benjamin.w...@bathspa.org; r-help@r-project.org r-help@r-project.org Sent: Fri, January 7, 2011 7:34:05 PM Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality A lot of this depends on what question you are really trying to answer. For one way anova replacing y-values with their ranks essentially transforms the distribution to uniform (under the null) and the Central Limit Theorem kicks in for the uniform with samples larger than about 5, so the normal approximations are pretty good and the theory works, but what are you actually testing? The most meaningful null that is being tested is that all data come from the exact same distribution. So what does it mean when you reject that null? It means that all the groups are not representing the same distribution, but is that because the means differ? Or the variances? Or the shapes? It can be any of those. Some point out that if you make certain assumptions such as symmetry or shifts of the same distributions, then you can talk about differences in means or medians, but usually if I am using non-parametrics it is because I don't believe that things are symmetric and the shift idea doesn't fit in my mind. Some alternatives include bootstrapping or permutation tests, or just transforming the data to get something closer to normal. Now what does replacing by ranks do in 2-way anova where we want to test the difference in one factor without making assumptions about whether the other factor has an effect or not? I'm not sure on this one. I have seen regression on ranks, it basically tests for some level of relationship, but regression is usually used for some type of prediction and predicting from a rank-rank regression does not seem meaningful to me. Fitting the regression model does not require normality, it is the tests on the coefficients and confidence and prediction intervals that assume normality (again the CLT helps for large samples (but not for prediction intervals)). Bootstrapping is an option for regression without assuming normality, transformations can also help. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Ben Ward Sent: Thursday, January 06, 2011 2:00 PM To: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality On 06/01/2011 20:29, Greg Snow wrote: Some would argue to always use the kruskal wallis test since we never know for sure if we have normality. Personally I am not sure that I understand what exactly that test is really testing. Plus in your case you are doing a two-way anova and kruskal.test does one-way, so it will not work for your case. There are other non-parametric options. Just read this and had queries of my own and comments on this subject: Would one of these options be to rank the data before doing whatever model or test you want to do? As I understand it makes the place of the data the same, but pulls extreme cases closer to the rest. Not an expert though. I've been doing lm() for my work, and I don't know if that makes an assumption of normality (may data is not normal). And I'm unsure of any other assumptions as my texts don't really discuss them. Although I can comfortably evaluate a model say using residual vs fitted, and F values turned to P, resampling and confidence intervals, and looking at sums of squares terms add to explanation of the model. I've tried the plot() function to help
Re: [R] Assumptions for ANOVA: the right way to check the normality
A lot of this depends on what question you are really trying to answer. For one way anova replacing y-values with their ranks essentially transforms the distribution to uniform (under the null) and the Central Limit Theorem kicks in for the uniform with samples larger than about 5, so the normal approximations are pretty good and the theory works, but what are you actually testing? The most meaningful null that is being tested is that all data come from the exact same distribution. So what does it mean when you reject that null? It means that all the groups are not representing the same distribution, but is that because the means differ? Or the variances? Or the shapes? It can be any of those. Some point out that if you make certain assumptions such as symmetry or shifts of the same distributions, then you can talk about differences in means or medians, but usually if I am using non-parametrics it is because I don't believe that things are symmetric and the shift idea doesn't fit in my mind. Some alternatives include bootstrapping or permutation tests, or just transforming the data to get something closer to normal. Now what does replacing by ranks do in 2-way anova where we want to test the difference in one factor without making assumptions about whether the other factor has an effect or not? I'm not sure on this one. I have seen regression on ranks, it basically tests for some level of relationship, but regression is usually used for some type of prediction and predicting from a rank-rank regression does not seem meaningful to me. Fitting the regression model does not require normality, it is the tests on the coefficients and confidence and prediction intervals that assume normality (again the CLT helps for large samples (but not for prediction intervals)). Bootstrapping is an option for regression without assuming normality, transformations can also help. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Ben Ward Sent: Thursday, January 06, 2011 2:00 PM To: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality On 06/01/2011 20:29, Greg Snow wrote: Some would argue to always use the kruskal wallis test since we never know for sure if we have normality. Personally I am not sure that I understand what exactly that test is really testing. Plus in your case you are doing a two-way anova and kruskal.test does one-way, so it will not work for your case. There are other non-parametric options. Just read this and had queries of my own and comments on this subject: Would one of these options be to rank the data before doing whatever model or test you want to do? As I understand it makes the place of the data the same, but pulls extreme cases closer to the rest. Not an expert though. I've been doing lm() for my work, and I don't know if that makes an assumption of normality (may data is not normal). And I'm unsure of any other assumptions as my texts don't really discuss them. Although I can comfortably evaluate a model say using residual vs fitted, and F values turned to P, resampling and confidence intervals, and looking at sums of squares terms add to explanation of the model. I've tried the plot() function to help graphically evaluate a model, and I want to make sure I understand what it's showing me. I think the first, is showing me the models fitted values vs the residuals, and ideally, I think the closer the points are to the red line the better. The next plot is a Q-Q plot, the closer the points to the line, the more normal the model coefficients (or perhaps the data). I'm not sure what the next two plots are, but it is titled Scale-Location. And it looks to have the square root of standardized residuals on y, and fitted model values on x. Might this be similar to the first plot? The final one is titled Residuals vs Leverage, which has standardized residuals on y and leverage on x, and something called Cooks Distance is plotted as well. Thanks, Ben. W Whether to use anova and other normality based tests is really a matter of what assumptions you are willing to live with and what level of close enough you are comfortable with. Consulting with a local consultant with experience in these areas is useful if you don't have enough experience to decide what you are comfortable with. For your description, I would try the proportional odds logistic regression, but again, you should probably consult with someone who has experience rather than trying that on your own until you have more training and experience. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 From: Frodo Jedi [mailto:frodo.j...@yahoo.com] Sent: Thursday,
Re: [R] Assumptions for ANOVA: the right way to check the normality
I believe what I'm doing, is an ancova, because I have two categorical and a numerical explanatory variables, and a numerical response variable (this is the same experiment as before, the bacteria), and I'm just, at the minute (because I'm only half way through), doing some modelling and seeing what I get with what I currently have. And I'm paying attention to 95% CI for the different terms of a model, as well as the coefficient, and the explanatory power of the term and likelyhood that the same result could be obtained at random through the P values, derived from F. To be honest I havent checked much what my data distributions are like and such becasue I'm not finished collecting it yet. I mainly mentioned the ranking because it was given considerable mention in one of my texts sections on hypothesis testing on models. On 07/01/2011 18:34, Greg Snow wrote: A lot of this depends on what question you are really trying to answer. For one way anova replacing y-values with their ranks essentially transforms the distribution to uniform (under the null) and the Central Limit Theorem kicks in for the uniform with samples larger than about 5, so the normal approximations are pretty good and the theory works, but what are you actually testing? The most meaningful null that is being tested is that all data come from the exact same distribution. So what does it mean when you reject that null? It means that all the groups are not representing the same distribution, but is that because the means differ? Or the variances? Or the shapes? It can be any of those. Some point out that if you make certain assumptions such as symmetry or shifts of the same distributions, then you can talk about differences in means or medians, but usually if I am using non-parametrics it is because I don't believe that things are symmetric and the shift idea doesn't fit in my mind. Some alternatives include bootstrapping or permutation tests, or just transforming the data to get something closer to normal. Now what does replacing by ranks do in 2-way anova where we want to test the difference in one factor without making assumptions about whether the other factor has an effect or not? I'm not sure on this one. I have seen regression on ranks, it basically tests for some level of relationship, but regression is usually used for some type of prediction and predicting from a rank-rank regression does not seem meaningful to me. Fitting the regression model does not require normality, it is the tests on the coefficients and confidence and prediction intervals that assume normality (again the CLT helps for large samples (but not for prediction intervals)). Bootstrapping is an option for regression without assuming normality, transformations can also help. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Ben Ward Sent: Thursday, January 06, 2011 2:00 PM To: r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality On 06/01/2011 20:29, Greg Snow wrote: Some would argue to always use the kruskal wallis test since we never know for sure if we have normality. Personally I am not sure that I understand what exactly that test is really testing. Plus in your case you are doing a two-way anova and kruskal.test does one-way, so it will not work for your case. There are other non-parametric options. Just read this and had queries of my own and comments on this subject: Would one of these options be to rank the data before doing whatever model or test you want to do? As I understand it makes the place of the data the same, but pulls extreme cases closer to the rest. Not an expert though. I've been doing lm() for my work, and I don't know if that makes an assumption of normality (may data is not normal). And I'm unsure of any other assumptions as my texts don't really discuss them. Although I can comfortably evaluate a model say using residual vs fitted, and F values turned to P, resampling and confidence intervals, and looking at sums of squares terms add to explanation of the model. I've tried the plot() function to help graphically evaluate a model, and I want to make sure I understand what it's showing me. I think the first, is showing me the models fitted values vs the residuals, and ideally, I think the closer the points are to the red line the better. The next plot is a Q-Q plot, the closer the points to the line, the more normal the model coefficients (or perhaps the data). I'm not sure what the next two plots are, but it is titled Scale-Location. And it looks to have the square root of standardized residuals on y, and fitted model values on x. Might this be similar to the first plot? The final one is titled Residuals vs Leverage, which has standardized residuals on y and leverage
Re: [R] Assumptions for ANOVA: the right way to check the normality
Dear Robert, thanks so much!!! Now I understand! So you also think that I have to check only the residuals and not the data directly. Now just for curiosity I did the the shapiro test on the residuals. The problem is that on fit3 I don´t get from the test that the data are normally distribuited. Why? Here the data: shapiro.test(residuals(fit1)) Shapiro-Wilk normality test data: residuals(fit1) W = 0.9848, p-value = 0.05693 #Here the test is ok: the test says that the data are distributed normally (p-value greather than 0.05) shapiro.test(residuals(fit2)) Shapiro-Wilk normality test data: residuals(fit2) W = 0.9853, p-value = 0.06525 #Here the test is ok: the test says that the data are distributed normally (p-value greather than 0.05) shapiro.test(residuals(fit3)) Shapiro-Wilk normality test data: residuals(fit3) W = 0.9621, p-value = 0.0001206 Now the test reveals p-value lower than 0.05: so the residuals for fit3 are not distributed normally Why I get this beheaviour? Indeed in the histogram and Q-Q plot for fit3 residuals I get a normal distribution. From: Robert Baer rb...@atsu.edu Sent: Wed, January 5, 2011 8:56:50 PM Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Someone suggested me that I don´t have to check the normality of the data, but the normality of the residuals I get after the fitting of the linear model. I really ask you to help me to understand this point as I don´t find enough material online where to solve it. Try the following: # using your scrd data and your proposed models fit1- lm(response ~ stimulus + condition + stimulus:condition, data=scrd) fit2- lm(response ~ stimulus + condition, data=scrd) fit3- lm(response ~ condition, data=scrd) # Set up for 6 plots on 1 panel op = par(mfrow=c(2,3)) # residuals function extracts residuals # Visual inspection is a good start for checking normality # You get a much better feel than from some magic number statistic hist(residuals(fit1)) hist(residuals(fit2)) hist(residuals(fit3)) # especially qqnorm() plots which are linear for normal data qqnorm(residuals(fit1)) qqnorm(residuals(fit2)) qqnorm(residuals(fit3)) # Restore plot parameters par(op) If the data are not normally distributed I have to use the kruskal wallys test and not the ANOVA...so please help me to understand. Indeed - Kruskal-Wallis is a good test to use for one factor data that is ordinal so it is a good alternative to your fit3. Your response seems to be a discrete variable rather than a continuous variable. You must decide if it is reasonable to approximate it with a normal distribution which is by definition continuous. I make a numerical example, could you please tell me if the data in this table are normally distributed or not? Help! number stimulus condition response 1 flat_550_W_realism A3 2 flat_550_W_realism A3 3 flat_550_W_realism A5 4 flat_550_W_realism A3 5 flat_550_W_realism A3 6 flat_550_W_realism A3 7 flat_550_W_realism A3 8 flat_550_W_realism A5 9 flat_550_W_realism A3 10flat_550_W_realism A3 11flat_550_W_realism A5 12flat_550_W_realism A7 13flat_550_W_realism A5 14flat_550_W_realism A2 15flat_550_W_realism A3 16flat_550_W_realismAH7 17flat_550_W_realismAH4 18flat_550_W_realismAH5 19flat_550_W_realismAH3 20flat_550_W_realismAH6 21flat_550_W_realismAH5 22flat_550_W_realismAH3 23flat_550_W_realismAH5 24flat_550_W_realismAH5 25flat_550_W_realismAH7 26flat_550_W_realismAH2 27flat_550_W_realismAH7 28flat_550_W_realismAH5 29flat_550_W_realismAH5 30 bump_2_step_W_realism A1 31 bump_2_step_W_realism A3 32 bump_2_step_W_realism A5 33 bump_2_step_W_realism A1 34 bump_2_step_W_realism A3 35 bump_2_step_W_realism A2 36 bump_2_step_W_realism A5 37 bump_2_step_W_realism A4 38 bump_2_step_W_realism A4 39 bump_2_step_W_realism A4
Re: [R] Assumptions for ANOVA: the right way to check the normality
Remember that an non-significant result (especially one that is still near alpha like yours) does not give evidence that the null is true. The reason that the 1st 2 tests below don't show significance is more due to lack of power than some of the residuals being normal. The only test that I would trust for this is SnowsPenultimateNormalityTest (TeachingDemos package, the help page is more useful than the function itself). But I think that you are mixing up 2 different concepts (a very common misunderstanding). What is important if we want to do normal theory inference is that the coefficients/effects/estimates are normally distributed. Now since these coefficients can be shown to be linear combinations of the error terms, if the errors are iid normal then the coefficients are also normally distributed. So many people want to show that the residuals come from a perfectly normal distribution. But it is the theoretical errors, not the observed residuals that are important (the observed residuals are not iid). You need to think about the source of your data to see if this is a reasonable assumption. Now I cannot fathom any universe (theoretical or real) in which normally distributed errors added to means that they are independent of will result in a finite set of integers, so an assumption of exact normality is not reasonable (some may want to argue this, but convincing me will be very difficult). But looking for exact normality is a bit of a red herring because, we also have the Central Limit Theorem that says that if the errors are not normal (but still iid) then the distribution of the coefficients will approach normality as the sample size increases. This is what make statistics doable (because no real dataset entered into the computer is exactly normal). The more important question is are the residuals normal enough? for which there is not a definitive test (experience and plots help). But this all depends on another assumption that I don't think that you have even considered. Yes we can use normal theory even when the random part of the data is not normally distributed, but this still assumes that the data is at least interval data, i.e. that we firmly believe that the difference between a response of 1 and a response of 2 is exactly the same as a difference between a 6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. From your data and other descriptions, I don't think that that is a reasonable assumption. If you are not willing to make that assumption (like me) then means and normal theory tests are meaningless and you should use other approaches. One possibility is to use non-parametric methods (which I believe Frank has already suggested you use), another is to use proportional odds logistic regression. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Frodo Jedi Sent: Wednesday, January 05, 2011 3:22 PM To: Robert Baer; r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Robert, thanks so much!!! Now I understand! So you also think that I have to check only the residuals and not the data directly. Now just for curiosity I did the the shapiro test on the residuals. The problem is that on fit3 I don´t get from the test that the data are normally distribuited. Why? Here the data: shapiro.test(residuals(fit1)) Shapiro-Wilk normality test data: residuals(fit1) W = 0.9848, p-value = 0.05693 #Here the test is ok: the test says that the data are distributed normally (p-value greather than 0.05) shapiro.test(residuals(fit2)) Shapiro-Wilk normality test data: residuals(fit2) W = 0.9853, p-value = 0.06525 #Here the test is ok: the test says that the data are distributed normally (p-value greather than 0.05) shapiro.test(residuals(fit3)) Shapiro-Wilk normality test data: residuals(fit3) W = 0.9621, p-value = 0.0001206 Now the test reveals p-value lower than 0.05: so the residuals for fit3 are not distributed normally Why I get this beheaviour? Indeed in the histogram and Q-Q plot for fit3 residuals I get a normal distribution. From: Robert Baer rb...@atsu.edu Sent: Wed, January 5, 2011 8:56:50 PM Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Someone suggested me that I don´t have to check the normality of the data, but the normality of the residuals I get after the fitting of the linear model. I really ask you to help me to understand this point as I don´t find enough material online where to solve it. Try the following: # using your scrd data and your proposed models fit1- lm(response ~ stimulus + condition +
Re: [R] Assumptions for ANOVA: the right way to check the normality
Ok, I see ;-) Let´s put in this way then. When do I have to use the kruskal wallis test? I mean, when I am very sure that I have to use it instead of ANOVA? Thanks Best regards P.S. In addition, which is the non parametric methods corresponding to a 2 ways anova?..or have I to repeat many times the kruskal wallis test? From: Greg Snow greg.s...@imail.org r-help@r-project.org r-help@r-project.org Sent: Thu, January 6, 2011 7:07:17 PM Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality Remember that an non-significant result (especially one that is still near alpha like yours) does not give evidence that the null is true. The reason that the 1st 2 tests below don't show significance is more due to lack of power than some of the residuals being normal. The only test that I would trust for this is SnowsPenultimateNormalityTest (TeachingDemos package, the help page is more useful than the function itself). But I think that you are mixing up 2 different concepts (a very common misunderstanding). What is important if we want to do normal theory inference is that the coefficients/effects/estimates are normally distributed. Now since these coefficients can be shown to be linear combinations of the error terms, if the errors are iid normal then the coefficients are also normally distributed. So many people want to show that the residuals come from a perfectly normal distribution. But it is the theoretical errors, not the observed residuals that are important (the observed residuals are not iid). You need to think about the source of your data to see if this is a reasonable assumption. Now I cannot fathom any universe (theoretical or real) in which normally distributed errors added to means that they are independent of will result in a finite set of integers, so an assumption of exact normality is not reasonable (some may want to argue this, but convincing me will be very difficult). But looking for exact normality is a bit of a red herring because, we also have the Central Limit Theorem that says that if the errors are not normal (but still iid) then the distribution of the coefficients will approach normality as the sample size increases. This is what make statistics doable (because no real dataset entered into the computer is exactly normal). The more important question is are the residuals normal enough? for which there is not a definitive test (experience and plots help). But this all depends on another assumption that I don't think that you have even considered. Yes we can use normal theory even when the random part of the data is not normally distributed, but this still assumes that the data is at least interval data, i.e. that we firmly believe that the difference between a response of 1 and a response of 2 is exactly the same as a difference between a 6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. From your data and other descriptions, I don't think that that is a reasonable assumption. If you are not willing to make that assumption (like me) then means and normal theory tests are meaningless and you should use other approaches. One possibility is to use non-parametric methods (which I believe Frank has already suggested you use), another is to use proportional odds logistic regression. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Frodo Jedi Sent: Wednesday, January 05, 2011 3:22 PM To: Robert Baer; r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Dear Robert, [[elided Yahoo spam]] So you also think that I have to check only the residuals and not the data directly. Now just for curiosity I did the the shapiro test on the residuals. The problem is that on fit3 I don´t get from the test that the data are normally distribuited. Why? Here the data: shapiro.test(residuals(fit1)) Shapiro-Wilk normality test data: residuals(fit1) W = 0.9848, p-value = 0.05693 #Here the test is ok: the test says that the data are distributed normally (p-value greather than 0.05) shapiro.test(residuals(fit2)) Shapiro-Wilk normality test data: residuals(fit2) W = 0.9853, p-value = 0.06525 #Here the test is ok: the test says that the data are distributed normally (p-value greather than 0.05) shapiro.test(residuals(fit3)) Shapiro-Wilk normality test data: residuals(fit3) W = 0.9621, p-value = 0.0001206 Now the test reveals p-value lower than 0.05: so the residuals for fit3 are not distributed normally Why I get this beheaviour? Indeed in the histogram and Q-Q plot for fit3 residuals I get a normal distribution.
Re: [R] Assumptions for ANOVA: the right way to check the normality
Some would argue to always use the kruskal wallis test since we never know for sure if we have normality. Personally I am not sure that I understand what exactly that test is really testing. Plus in your case you are doing a two-way anova and kruskal.test does one-way, so it will not work for your case. There are other non-parametric options. Whether to use anova and other normality based tests is really a matter of what assumptions you are willing to live with and what level of close enough you are comfortable with. Consulting with a local consultant with experience in these areas is useful if you don't have enough experience to decide what you are comfortable with. For your description, I would try the proportional odds logistic regression, but again, you should probably consult with someone who has experience rather than trying that on your own until you have more training and experience. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 From: Frodo Jedi [mailto:frodo.j...@yahoo.com] Sent: Thursday, January 06, 2011 12:57 PM To: Greg Snow; r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Ok, I see ;-) Let´s put in this way then. When do I have to use the kruskal wallis test? I mean, when I am very sure that I have to use it instead of ANOVA? Thanks Best regards P.S. In addition, which is the non parametric methods corresponding to a 2 ways anova?..or have I to repeat many times the kruskal wallis test? From: Greg Snow greg.s...@imail.org To: Frodo Jedi frodo.j...@yahoo.com; Robert Baer rb...@atsu.edu; r-help@r-project.org r-help@r-project.org Sent: Thu, January 6, 2011 7:07:17 PM Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality Remember that an non-significant result (especially one that is still near alpha like yours) does not give evidence that the null is true. The reason that the 1st 2 tests below don't show significance is more due to lack of power than some of the residuals being normal. The only test that I would trust for this is SnowsPenultimateNormalityTest (TeachingDemos package, the help page is more useful than the function itself). But I think that you are mixing up 2 different concepts (a very common misunderstanding). What is important if we want to do normal theory inference is that the coefficients/effects/estimates are normally distributed. Now since these coefficients can be shown to be linear combinations of the error terms, if the errors are iid normal then the coefficients are also normally distributed. So many people want to show that the residuals come from a perfectly normal distribution. But it is the theoretical errors, not the observed residuals that are important (the observed residuals are not iid). You need to think about the source of your data to see if this is a reasonable assumption. Now I cannot fathom any universe (theoretical or real) in which normally distributed errors added to means that they are independent of will result in a finite set of integers, so an assumption of exact normality is not reasonable (some may want to argue this, but convincing me will be very difficult). But looking for exact normality is a bit of a red herring because, we also have the Central Limit Theorem that says that if the errors are not normal (but still iid) then the distribution of the coefficients will approach normality as the sample size increases. This is what make statistics doable (because no real dataset entered into the computer is exactly normal). The more important question is are the residuals normal enough? for which there is not a definitive test (experience and plots help). But this all depends on another assumption that I don't think that you have even considered. Yes we can use normal theory even when the random part of the data is not normally distributed, but this still assumes that the data is at least interval data, i.e. that we firmly believe that the difference between a response of 1 and a response of 2 is exactly the same as a difference between a 6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. From your data and other descriptions, I don't think that that is a reasonable assumption. If you are not willing to make that assumption (like me) then means and normal theory tests are meaningless and you should use other approaches. One possibility is to use non-parametric methods (which I believe Frank has already suggested you use), another is to use proportional odds logistic regression. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.orgmailto:greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org [mailto:r-help-boun...@r-
Re: [R] Assumptions for ANOVA: the right way to check the normality
On 06/01/2011 20:29, Greg Snow wrote: Some would argue to always use the kruskal wallis test since we never know for sure if we have normality. Personally I am not sure that I understand what exactly that test is really testing. Plus in your case you are doing a two-way anova and kruskal.test does one-way, so it will not work for your case. There are other non-parametric options. Just read this and had queries of my own and comments on this subject: Would one of these options be to rank the data before doing whatever model or test you want to do? As I understand it makes the place of the data the same, but pulls extreme cases closer to the rest. Not an expert though. I've been doing lm() for my work, and I don't know if that makes an assumption of normality (may data is not normal). And I'm unsure of any other assumptions as my texts don't really discuss them. Although I can comfortably evaluate a model say using residual vs fitted, and F values turned to P, resampling and confidence intervals, and looking at sums of squares terms add to explanation of the model. I've tried the plot() function to help graphically evaluate a model, and I want to make sure I understand what it's showing me. I think the first, is showing me the models fitted values vs the residuals, and ideally, I think the closer the points are to the red line the better. The next plot is a Q-Q plot, the closer the points to the line, the more normal the model coefficients (or perhaps the data). I'm not sure what the next two plots are, but it is titled Scale-Location. And it looks to have the square root of standardized residuals on y, and fitted model values on x. Might this be similar to the first plot? The final one is titled Residuals vs Leverage, which has standardized residuals on y and leverage on x, and something called Cooks Distance is plotted as well. Thanks, Ben. W Whether to use anova and other normality based tests is really a matter of what assumptions you are willing to live with and what level of close enough you are comfortable with. Consulting with a local consultant with experience in these areas is useful if you don't have enough experience to decide what you are comfortable with. For your description, I would try the proportional odds logistic regression, but again, you should probably consult with someone who has experience rather than trying that on your own until you have more training and experience. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 From: Frodo Jedi [mailto:frodo.j...@yahoo.com] Sent: Thursday, January 06, 2011 12:57 PM To: Greg Snow; r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Ok, I see ;-) Let´s put in this way then. When do I have to use the kruskal wallis test? I mean, when I am very sure that I have to use it instead of ANOVA? Thanks Best regards P.S. In addition, which is the non parametric methods corresponding to a 2 ways anova?..or have I to repeat many times the kruskal wallis test? From: Greg Snowgreg.s...@imail.org To: Frodo Jedifrodo.j...@yahoo.com; Robert Baerrb...@atsu.edu; r-help@r-project.orgr-help@r-project.org Sent: Thu, January 6, 2011 7:07:17 PM Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality Remember that an non-significant result (especially one that is still near alpha like yours) does not give evidence that the null is true. The reason that the 1st 2 tests below don't show significance is more due to lack of power than some of the residuals being normal. The only test that I would trust for this is SnowsPenultimateNormalityTest (TeachingDemos package, the help page is more useful than the function itself). But I think that you are mixing up 2 different concepts (a very common misunderstanding). What is important if we want to do normal theory inference is that the coefficients/effects/estimates are normally distributed. Now since these coefficients can be shown to be linear combinations of the error terms, if the errors are iid normal then the coefficients are also normally distributed. So many people want to show that the residuals come from a perfectly normal distribution. But it is the theoretical errors, not the observed residuals that are important (the observed residuals are not iid). You need to think about the source of your data to see if this is a reasonable assumption. Now I cannot fathom any universe (theoretical or real) in which normally distributed errors added to means that they are independent of will result in a finite set of integers, so an assumption of exact normality is not reasonable (some may want to argue this, but convincing me will be very difficult). But looking for exact normality is a bit of a red herring because, we also have the
Re: [R] Assumptions for ANOVA: the right way to check the normality
Thanks a lot Greg, you have been very helpful. All the best From: Greg Snow greg.s...@imail.org r-help@r-project.org Sent: Thu, January 6, 2011 9:29:36 PM Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality Some would argue to always use the kruskal wallis test since we never know for sure if we have normality. Personally I am not sure that I understand what exactly that test is really testing. Plus in your case you are doing a two-way anova and kruskal.test does one-way, so it will not work for your case. There are other non-parametric options. Whether to use anova and other normality based tests is really a matter of what assumptions you are willing to live with and what level of âclose enoughâ you are comfortable with. Consulting with a local consultant with experience in these areas is useful if you donât have enough experience to decide what you are comfortable with. For your description, I would try the proportional odds logistic regression, but again, you should probably consult with someone who has experience rather than trying that on your own until you have more training and experience. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 Sent: Thursday, January 06, 2011 12:57 PM To: Greg Snow; r-help@r-project.org Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality Ok, I see ;-) Let´s put in this way then. When do I have to use the kruskal wallis test? I mean, when I am very sure that I have to use it instead of ANOVA? Thanks Best regards P.S. In addition, which is the non parametric methods corresponding to a 2 ways anova?..or have I to repeat many times the kruskal wallis test? From:Greg Snow greg.s...@imail.org r-help@r-project.org r-help@r-project.org Sent: Thu, January 6, 2011 7:07:17 PM Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality Remember that an non-significant result (especially one that is still near alpha like yours) does not give evidence that the null is true. The reason that the 1st 2 tests below don't show significance is more due to lack of power than some of the residuals being normal. The only test that I would trust for this is SnowsPenultimateNormalityTest (TeachingDemos package, the help page is more useful than the function itself). But I think that you are mixing up 2 different concepts (a very common misunderstanding). What is important if we want to do normal theory inference is that the coefficients/effects/estimates are normally distributed. Now since these coefficients can be shown to be linear combinations of the error terms, if the errors are iid normal then the coefficients are also normally distributed. So many people want to show that the residuals come from a perfectly normal distribution. But it is the theoretical errors, not the observed residuals that are important (the observed residuals are not iid). You need to think about the source of your data to see if this is a reasonable assumption. Now I cannot fathom any universe (theoretical or real) in which normally distributed errors added to means that they are independent of will result in a finite set of integers, so an assumption of exact normality is not reasonable (some may want to argue this, but convincing me will be very difficult). But looking for exact normality is a bit of a red herring because, we also have the Central Limit Theorem that says that if the errors are not normal (but still iid) then the distribution of the coefficients will approach normality as the sample size increases. This is what make statistics doable (because no real dataset entered into the computer is exactly normal). The more important question is are the residuals normal enough? for which there is not a definitive test (experience and plots help). But this all depends on another assumption that I don't think that you have even considered. Yes we can use normal theory even when the random part of the data is not normally distributed, but this still assumes that the data is at least interval data, i.e. that we firmly believe that the difference between a response of 1 and a response of 2 is exactly the same as a difference between a 6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. From your data and other descriptions, I don't think that that is a reasonable assumption. If you are not willing to make that assumption (like me) then means and normal theory tests are meaningless and you should use other approaches. One possibility is to use non-parametric methods (which I believe Frank has already suggested you use), another is to use proportional odds logistic regression. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org
Re: [R] Assumptions for ANOVA: the right way to check the normality
Someone suggested me that I don´t have to check the normality of the data, but the normality of the residuals I get after the fitting of the linear model. I really ask you to help me to understand this point as I don´t find enough material online where to solve it. Try the following: # using your scrd data and your proposed models fit1- lm(response ~ stimulus + condition + stimulus:condition, data=scrd) fit2- lm(response ~ stimulus + condition, data=scrd) fit3- lm(response ~ condition, data=scrd) # Set up for 6 plots on 1 panel op = par(mfrow=c(2,3)) # residuals function extracts residuals # Visual inspection is a good start for checking normality # You get a much better feel than from some magic number statistic hist(residuals(fit1)) hist(residuals(fit2)) hist(residuals(fit3)) # especially qqnorm() plots which are linear for normal data qqnorm(residuals(fit1)) qqnorm(residuals(fit2)) qqnorm(residuals(fit3)) # Restore plot parameters par(op) If the data are not normally distributed I have to use the kruskal wallys test and not the ANOVA...so please help me to understand. Indeed - Kruskal-Wallis is a good test to use for one factor data that is ordinal so it is a good alternative to your fit3. Your response seems to be a discrete variable rather than a continuous variable. You must decide if it is reasonable to approximate it with a normal distribution which is by definition continuous. I make a numerical example, could you please tell me if the data in this table are normally distributed or not? Help! number stimulus condition response 1 flat_550_W_realism A3 2 flat_550_W_realism A3 3 flat_550_W_realism A5 4 flat_550_W_realism A3 5 flat_550_W_realism A3 6 flat_550_W_realism A3 7 flat_550_W_realism A3 8 flat_550_W_realism A5 9 flat_550_W_realism A3 10flat_550_W_realism A3 11flat_550_W_realism A5 12flat_550_W_realism A7 13flat_550_W_realism A5 14flat_550_W_realism A2 15flat_550_W_realism A3 16flat_550_W_realismAH7 17flat_550_W_realismAH4 18flat_550_W_realismAH5 19flat_550_W_realismAH3 20flat_550_W_realismAH6 21flat_550_W_realismAH5 22flat_550_W_realismAH3 23flat_550_W_realismAH5 24flat_550_W_realismAH5 25flat_550_W_realismAH7 26flat_550_W_realismAH2 27flat_550_W_realismAH7 28flat_550_W_realismAH5 29flat_550_W_realismAH5 30 bump_2_step_W_realism A1 31 bump_2_step_W_realism A3 32 bump_2_step_W_realism A5 33 bump_2_step_W_realism A1 34 bump_2_step_W_realism A3 35 bump_2_step_W_realism A2 36 bump_2_step_W_realism A5 37 bump_2_step_W_realism A4 38 bump_2_step_W_realism A4 39 bump_2_step_W_realism A4 40 bump_2_step_W_realism A4 41 bump_2_step_W_realismAH3 42 bump_2_step_W_realismAH5 43 bump_2_step_W_realismAH1 44 bump_2_step_W_realismAH5 45 bump_2_step_W_realismAH4 46 bump_2_step_W_realismAH4 47 bump_2_step_W_realismAH5 48 bump_2_step_W_realismAH4 49 bump_2_step_W_realismAH3 50 bump_2_step_W_realismAH4 51 bump_2_step_W_realismAH5 52 bump_2_step_W_realismAH4 53 hole_2_step_W_realism A3 54 hole_2_step_W_realism A3 55 hole_2_step_W_realism A4 56 hole_2_step_W_realism A1 57 hole_2_step_W_realism A4 58 hole_2_step_W_realism A3 59 hole_2_step_W_realism A5 60 hole_2_step_W_realism A4 61 hole_2_step_W_realism A3 62 hole_2_step_W_realism A4 63 hole_2_step_W_realism A7 64 hole_2_step_W_realism A5 65 hole_2_step_W_realism A1 66