On Thu, 08 Jan 2009 20:04:23 -0800, Karl Wuensch wrote: > I'm even less conservative than Stephen. I would not apply the >Bonferroni adjustment. After all, these are PLANNED comparisons, eh?
This is a curious point: Why should the state of knowledge (i.e., able to predict the size of difference, the direction of a difference, etc.) affect the probability of making an error of inference? Consider t-tests, if we "know" that a difference is in a particular direction, then using a one-tailed test will give us a lower critical value and possibly make it easier to reject the null hypothesis that the two means estimate the same population mean. But if our "knowledge" is faulty or incomplete and the difference is in the opposite direction we either (a) ignore a statistically significant result because it was not in "right" direction, (b) say "Ooops!", and claim we "planned" on doing a two-tailed test all along. Situation (a) is the more honest action though one might say it is also stupid -- we arrogantly asserted our overconfidence in our knowledge of the situation and in order to maintain consistency we should ignore the statistically significant result in the wrong direction (we might do the study over and use a two-tailed test the second time). In situation (b) the effective alpha level is no longer .05 but .10 because the recognition of the statistically significant result in wrong direction implies that we are not really doing a one-tailed test at .05 but a two-tailed test at .10. Readjusting the alpha after discovery of the error may be practical but somewhat questionable from an ethical perspective. It was for reasons like this that Jack Cohen used to say that one-tailed tests should not be done (how could one distinguish delusional self-confidence from solid knowledge?) To guard against these types of behavior Jack used to say that a researcher should write out the plan of analysis before the research was done and mail it to someone who would open it after the analyses were done and compare the two. Were any one-tailed tests planned but two-tailed used instead? Why? With respect to post hoc tests, I don't really remember what the rationale is for Fisher's "protected t-tests" or "Least Significant Difference" (LSD) test or, more generally, the use of alpha= .05 for planned comparisons outside of the fact that as long as there were few such tests, the overall Type I error rate will not be too high. Remember, alpha(overall) = 1 - [1-alpha(per comparison)]**C where ** means exponentiation C= # of tests alpha(per comparison)= alpha level used for a specific test alpha(overall)=probability that one has committed a Type I error after doing C tests at each specific per comparison alpha. If we fix alpha(per comparison)= .05, then alpha(overall)= 0.0975 for C=2 alpha(overall)= 0.1456 for C=3 alpha(overall)= 0.1855 for C=4 and so on. I think that it is easy to see that even after a few tests, the overall Type I error rate is fairly high. One should remember that if post hoc tests are done in the context of an ANOVA, with omnibus F-tests serving as the justification for post hoc testing, then the overall Type I error rate is likely to be quite high (Kirk in his "Experimental Design" text refers to this as "experimentwise Type I error" or "familywise Type I error"). I believe it was Rand Wilcox who is one person that recommends dispensing with the two-stage procedure (i.e., significant ANOVA followed by multiple comparisons) and simply doing multiple comparisons (e.g., all difference among means, etc.) which would keep the overall Type I error rate down (I believe that he provides additional justification for this but I don't have his text anymore). So, should we adjust for the alpha level for the number of comparisons/tests we do? Probably but unless one has fixed alpha(overall) = .05, it is quite likely that alpha(overall) after an analysis is likely to be much higher than .05 (unless we have only done a single test or a few tests with adjusted alpha(per comparison). >Not that I really thing that "planned" means much -- but I do think >that downwards adjustments of per comparison alpha have done >more harm than good. I don't know about this. Consider the following: the APA's Online Psychological Laboratory (opl.apa.org) has an experiment that replicated the classic Donders' reaction time (RT) task: simple RT condition , go-no go RT condition, and choice RT condition. When I teach experimental lab I have the class participate in this experiment and we analyze the data. Depending upon the number of students in the class (sometime N<20, often N around 20) and the variability in performance, the repeated measures ANOVA may be statistically significant but the three mean RTs are not always different (the three RT means should be different according to Donders). I ask the class "Why did we get this pattern of results? Is Donders wrong or do we lack statistical power in detecting real differences because our sample size is too small?" Some students say Donders is wrong and some (possibly because they remember something about statistical power from their stat course) choose the latter. I suggest that we do the following let's do post hoc comparisons: first using the Fisher's LSD procedure because it is the most powerful post hoc test we can do [drawback is elevated alpha(overall)] as well as the Bonferroni corrected minimum difference because it sets alpha(overall) = .05 and alpha(per comparison)= .05/3=.01667 (but it is less powerful}. SPSS' GLM procedure allows one to do that repeated measure ANOVA and both of these tests. So we do the analyses and compare the results. Sometimes means are different for the LSD but not the Bonferroni. This helps to illustrate that statistical power/sample size is an important consideration in detecting real differences. Often, for N=20 or thereabout, all three means are not different which leaves the question of whether Donders is wrong or we have failed to replicate his results. However, because the OPL website allows one to download data from other classes that also participated in the experiment, we can increase the sample size to whatever we want. I usually download N= 100-200 subject's worth of data and then we repeat the ANOVA and multiple comparisons to see if things have changed. Of course, with the larger sample size, everything is statistically significant, which supports Donders' theory (however theoretical problems with Donders method still remain) and the earlier nonsignificant results was due to lack of statistical power/small sample size. Of course, I tell the students to report that ANOVA and just the Bonferroni results -- the LSD is looked down upon because it has the higher alpha(overall). One lesson to be drawn from this is if we have nonsignificant results, can be obtain more data? If no, then when we use a more powerful test, are some results significant that are not significant by the less powerful test? If so, maybe one should collect more data (or more sophisticated tests). >The Type I Boogie Man under your bed is really a myth. The Type >II Boogie Man in your closet is for real. :-) When I cover this in stats class, with alpha(per comparison) and alpha(overall), I suggest that they think of alpha(overall) in more general terms, namely alpha(lifetime). I tell them to think of alpha(lifetime) as being like a taxi meter and every time one does a statistical test, the meter goes up. If one does a lot of statistical tests in one's life, then the probability approaches 1.00 that one has committed *at least* one Type I error but one won't know which test(s) it is. Consequently, one should think long and hard about which statistical test(s) one wants to do because the meter is running. --- To make changes to your subscription contact: Bill Southerly ([email protected])
