On Thu, 08 Jan 2009 20:04:23 -0800, Karl Wuensch wrote:
>        I'm even less conservative than Stephen.  I would not apply the
>Bonferroni adjustment.  After all, these are PLANNED comparisons, eh?

This is a curious point:

Why should the state of knowledge (i.e., able to predict the size
of difference, the direction of a difference, etc.) affect the probability
of making an error of inference?

Consider t-tests, if we "know" that a difference is in a particular
direction, then using a one-tailed test will give us a lower critical
value and possibly make it easier to reject the null hypothesis that
the two means estimate the same population mean.  But if our
"knowledge" is faulty or incomplete and the difference is in the
opposite direction we either (a) ignore a statistically significant
result because it was not in "right" direction, (b) say "Ooops!",
and claim we "planned" on doing a two-tailed test all along.

Situation (a) is the more honest action though one might
say it is also stupid -- we arrogantly asserted our overconfidence
in our knowledge of the situation and in order to maintain consistency
we should ignore the statistically significant result in the wrong
direction (we might do the study over and use a two-tailed test
the second time).  

In situation (b) the effective alpha level is no longer .05 but
.10 because the recognition of the statistically significant result
in wrong direction implies that we are not really doing a one-tailed
test at .05 but a two-tailed test at .10.  Readjusting the alpha
after discovery of the error may be practical but somewhat
questionable from an ethical perspective.  It was for reasons
like this that Jack Cohen used to say that one-tailed tests should
not be done (how could one distinguish delusional self-confidence
from solid knowledge?)  To guard against these types of behavior
Jack used to say that a researcher should write out the plan of
analysis before the research was done and mail it to someone who
would open it after the analyses were done and compare the two.
Were any one-tailed tests planned but two-tailed used instead?
Why?

With respect to post hoc tests, I don't really remember what
the rationale is for Fisher's "protected t-tests" or "Least
Significant Difference" (LSD) test or, more generally, the
use of alpha= .05 for planned comparisons outside of the
fact that as long as there were few such tests, the overall Type I
error rate will not be too high. Remember,

alpha(overall) = 1 - [1-alpha(per comparison)]**C
where ** means exponentiation
C= # of tests
alpha(per comparison)= alpha level used for a specific test
alpha(overall)=probability that one has committed a Type I
error after doing C tests at each specific per comparison alpha.

If we fix alpha(per comparison)= .05, then
alpha(overall)= 0.0975 for C=2
alpha(overall)= 0.1456 for C=3
alpha(overall)= 0.1855 for C=4
and so on.

I think that it is easy to see that even after a few tests, the overall
Type I error rate is fairly high.  One should remember that if post
hoc tests are done in the context of an ANOVA, with omnibus
F-tests serving as the justification for post hoc testing, then the
overall Type I error rate is likely to be quite high (Kirk in his
"Experimental Design" text refers to this as "experimentwise
Type I error" or "familywise Type I error").  I believe it was
Rand Wilcox who is one person that recommends dispensing 
with the two-stage procedure (i.e., significant ANOVA followed 
by multiple comparisons) and simply doing multiple comparisons
(e.g., all difference among means, etc.) which would keep the
overall Type I error rate down (I believe that he provides
additional justification for this but I don't have his text anymore).

So, should we adjust for the alpha level for the number of
comparisons/tests we do?  Probably but unless one has fixed
alpha(overall) = .05, it is quite likely that alpha(overall) after
an analysis is likely to be much higher than .05 (unless we
have only done a single test or a few tests with adjusted
alpha(per comparison).

>Not that I really thing that "planned" means much -- but I do think 
>that downwards adjustments of per comparison alpha have done 
>more harm than good.  

I don't know about this.  Consider the following:  the APA's
Online Psychological Laboratory (opl.apa.org) has an experiment
that replicated the classic Donders' reaction time (RT) task:
simple RT condition , go-no go RT condition, and choice RT
condition.  When I teach experimental lab I have the class participate
in this experiment and we analyze the data.  Depending upon
the number of students in the class (sometime N<20, often N 
around 20) and the variability in performance, the repeated
measures ANOVA may be statistically significant but the three
mean RTs are not always different (the three RT means should
be different according to Donders).  I ask the class "Why did
we get this pattern of results?  Is Donders wrong or do we
lack statistical power in detecting real differences because our
sample size is too small?"

Some students say Donders is wrong and some (possibly because
they remember something about statistical power from their
stat course) choose the latter.  I suggest that we do the following
let's do post hoc comparisons:  first using the Fisher's LSD 
procedure because it is the most powerful post hoc test we 
can do [drawback is elevated alpha(overall)] as well as the 
Bonferroni corrected minimum difference because it sets 
alpha(overall) = .05 and alpha(per comparison)= .05/3=.01667
(but it is less powerful}. SPSS' GLM procedure allows one to 
do that repeated measure ANOVA and both of these tests.

So we do the analyses and compare the results.  Sometimes
means are different for the LSD but not the Bonferroni.
This helps to illustrate that statistical power/sample size is
an important consideration in detecting real differences.
Often, for N=20 or thereabout, all three means are not different
which leaves the question of whether Donders is wrong or
we have failed to replicate his results.  However, because 
the OPL website allows one to download data from other
classes that also participated in the experiment, we can increase
the sample size to whatever we want.  I usually download
N= 100-200 subject's worth of data and then we repeat
the ANOVA and multiple comparisons to see if things
have changed. Of course, with the larger sample size, 
everything is statistically significant, which supports Donders'
theory (however theoretical problems with Donders method
still remain) and the earlier nonsignificant results was due to
lack of statistical power/small sample size.  Of course,
I tell the students to report that ANOVA and just the
Bonferroni results -- the LSD is looked down upon because
it has the higher alpha(overall).

One lesson to be drawn from this is if we have nonsignificant
results, can be obtain more data?  If no, then when we use a 
more powerful test, are some results significant that are not
significant by the less powerful test?  If so, maybe one should
collect more data (or more sophisticated tests).

>The Type I Boogie Man under your bed is really a myth.  The Type
>II Boogie Man in your closet is for real.  :-)

When I cover this in stats class, with alpha(per comparison) and
alpha(overall), I suggest that they think of alpha(overall) in more
general terms, namely alpha(lifetime).  I tell them to think of
alpha(lifetime) as being like a taxi meter and every time one does
a statistical test, the meter goes up.  If one does a lot of statistical
tests in one's life, then the probability approaches 1.00 that one
has committed *at least* one Type I error but one won't know
which test(s) it is.  Consequently, one should think long and hard
about which statistical test(s) one wants to do because the
meter is running.






---
To make changes to your subscription contact:

Bill Southerly ([email protected])

Reply via email to