Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-11 Thread DrorD


Frodo Jedi wrote:
 
 
What is the question you are really trying to find the answer for? 
Knowing that 
may help us give more meaningful answers.
 
 Concerning your question I thought to have been clear. 
 
 I want to understand which analysis I have to use in order to understand
 if the differences I am having are statistically significant or not. 
 
 

Dear Frodo, 

I would like to suggest that the question required is a concrete
specification of an hypothesis. 

Something like: 
I hypothesize that the responses to condition-A would be different in
magnitude from the responses to condition-AH, across all stimuli.

Perhaps after having a detailed formulation of your hypothesis the required
analysis will be clearer for yourself, or at least make it easier for
experts to guide you. 

Best, 
dror
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Assumptions-for-ANOVA-the-right-way-to-check-the-normality-tp3176073p3208596.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-11 Thread Greg Snow
 From: Frodo Jedi [mailto:frodo.j...@yahoo.com] 
 Sent: Monday, January 10, 2011 5:44 PM
 To: Greg Snow
 Cc: r-help@r-project.org
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality
 
 Dear Greg,
 first of all thanks for your reply. And I add also many thanks to all of you 
 guys who are helping me, sorry for the amount of questions I recently posted 
 ;-) 
 
 I don´t have a solid statistics background (I am not a statician) and I am 
 basically learning everything by myself. 
 So my first goal is TO UNDERSTAND. I need to have general guidelines because 
 for my PhD I am doing and I will do several psycophysic experiments.
 I am totally alone in this challenge, so I am asking some help to you guys as 
 I think that here is the best place to exchange the thing that I miss
 and that will never found in any book: the experience.

Isn't there a single statistician anywhere in the University?  Does your 
committee have any experience with any of this?

 What is the question you are really trying to find the answer for?  Knowing 
 that may help us give more meaningful answers.
 
 Concerning your question I thought to have been clear. I want to understand 
 which analysis I have to use in order to understand if 
 the differences I am having are statistically significant or not. Now, as in 
 all the books I read there is written that to apply ANOVA 
 I must respect the assumption of normality then I am try to find a way to 
 understand this.

A general run of anova procedures will produce multiple p-values addressing 
multiple null hypotheses addressing many different questions (often many of 
which are uninteresting).  Which terms are you really trying to test and which 
are included because you already know that they have an effect.

Are you including interactions because you find them actually interesting? Or 
just because that is what everyone else does?

[snip]
 
 Also remember that the normality of the data/residuals/etc. is not as 
 important as the CLT for your sample size.  The main things that make the 
 CLT not work (for samples that are not large enough) are outliers and 
 strong skewness, since your outcome is limited to the numbers 1-7, I don’t 
 see outliers or skewness being a real problem.  So you are probably fine 
 for fixed effects style models (though checking with experts in your area or 
 doing simulations can support/counter this).  
 
 As far as I have seen everyone in my field does ANOVA.

[imagine best Mom voice] and if everyone in your field jumped off a cliff . . .

Do you want to do what everyone else is doing, or something new and different?

What does your committee chair say about this?

 But when you add in random effects then there is a lot of uncertainty about 
 if the normal theory still holds, the latest lme code uses mcmc sampling 
 rather than depending on normal theory and is still being developed.
 
 
 For random effects do you mean the repeated measures right? So why 
 staticians developed the ANOVA with repeated measure if there is so much 
 uncertainty?

Repeated measures are one type of random effect analysis, but random and mixed 
effects is more general than just repeated measures.

Statisticians developed those methods because they worked for simple cases, 
made some sense for more complicated cases, and they did not have anything that 
was both better and practical.  Now with modern computers we can see when those 
do work (unfortunately not as often as had been hoped) and what was once 
impractical is now much simpler (but inertia is to do it the old way, even 
though the people who developed the old way would have preferred to do it our 
way).  The article: 

Why Permutation Tests Are Superior to t and F Tests in Biomedical Research
John Ludbrook and Hugh Dudley
The American Statistician
Vol. 52, No. 2 (May, 1998), pp. 127-132

May be enlightening here (and give possible alternatives).

Also see: 
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001819.html

for some simulation involving mixed models.  One shows that the normal theory 
works fine for that particular case, the next one shows a case where the normal 
theory does not work, then shows how to use simulation (parametric bootstrap) 
to get a more appropriate p-value.  You can adapt those examples for your own 
situation.

  
 This now comes back to my first question: what are you trying to find out?
 
 My ultimate goal is to find the p-values in order to understand if my results 
 are significative or not. So I can write them on the paper ;-)

There is a function in the TeachingDemos package that will produce p-values if 
that is all your want, these are independent of any normality assumptions, 
independent of any data in fact.  However they don't really help with 
understanding.

Graphing the data (I think you have done this already) is the best route to 
understanding.  If you need more than that, then consider the following article:

 Buja, A., Cook, D. 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-11 Thread Frodo Jedi
Many many thanks for your feedback Greg.
You have been very enlightening for me.

Now is time for me to study the material you kindly provided me. Thanks.








From: Greg Snow greg.s...@imail.org

Cc: r-help@r-project.org r-help@r-project.org
Sent: Tue, January 11, 2011 10:13:34 PM
Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality


 Sent: Monday, January 10, 2011 5:44 PM
 To: Greg Snow
 Cc: r-help@r-project.org
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality
 
 Dear Greg,
 first of all thanks for your reply. And I add also many thanks to all of you 
guys who are helping me, sorry for the amount of questions I recently posted 
;-) 

 
 I don´t have a solid statistics background (I am not a statician) and I am 
basically learning everything by myself. 

 So my first goal is TO UNDERSTAND. I need to have general guidelines because 
for my PhD I am doing and I will do several psycophysic experiments.
 I am totally alone in this challenge, so I am asking some help to you guys as 
 I 
think that here is the best place to exchange the thing that I miss
 and that will never found in any book: the experience.

Isn't there a single statistician anywhere in the University?  Does your 
committee have any experience with any of this?

 What is the question you are really trying to find the answer for?  Knowing 
that may help us give more meaningful answers.
 
 Concerning your question I thought to have been clear. I want to understand 
which analysis I have to use in order to understand if 

 the differences I am having are statistically significant or not. Now, as in 
all the books I read there is written that to apply ANOVA 

 I must respect the assumption of normality then I am try to find a way to
understand this.

A general run of anova procedures will produce multiple p-values addressing
multiple null hypotheses addressing many different questions (often many of
which are uninteresting).  Which terms are you really trying to test and which 
are included because you already know that they have an effect.

Are you including interactions because you find them actually interesting? Or 
just because that is what everyone else does?

[snip]

 Also remember that the normality of the data/residuals/etc. is not as 
important as the CLT for your sample size.  The main things that make the CLT 
not work (for samples that are not large enough) are outliers and strong
skewness, since your outcome is limited to the numbers 1-7, I don’t see 
outliers 
or skewness being a real problem.  So you are probably fine for fixed effects 
style models (though checking with experts in your area or doing simulations 
can 
support/counter this).  

 
 As far as I have seen everyone in my field does ANOVA.

[imagine best Mom voice] and if everyone in your field jumped off a cliff . . .

Do you want to do what everyone else is doing, or something new and different?

What does your committee chair say about this?

 But when you add in random effects then there is a lot of uncertainty about 
 if 
the normal theory still holds, the latest lme code uses mcmc sampling rather 
than depending on normal theory and is still being developed.
 
 
 For random effects do you mean the repeated measures right? So why 
 staticians 
developed the ANOVA with repeated measure if there is so much uncertainty?

Repeated measures are one type of random effect analysis, but random and mixed 
effects is more general than just repeated measures.

Statisticians developed those methods because they worked for simple cases, 
made 
some sense for more complicated cases, and they did not have anything that was 
both better and practical.  Now with modern computers we can see when those do 
work (unfortunately not as often as had been hoped) and what was once 
impractical is now much simpler (but inertia is to do it the old way, even
though the people who developed the old way would have preferred to do it our 
way).  The article: 


Why Permutation Tests Are Superior to t and F Tests in Biomedical Research
John Ludbrook and Hugh Dudley
The American Statistician
Vol. 52, No. 2 (May, 1998), pp. 127-132

May be enlightening here (and give possible alternatives).

Also see: 
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001819.html 

for some simulation involving mixed models.  One shows that the normal theory 
works fine for that particular case, the next one shows a case where the normal 
theory does not work, then shows how to use simulation (parametric bootstrap) 
to 
get a more appropriate p-value.  You can adapt those examples for your own
situation.

  
 This now comes back to my first question: what are you trying to find out?
 
 My ultimate goal is to find the p-values in order to understand if my results 
are significative or not. So I can write them on the paper ;-)

There is a function in the TeachingDemos package that will produce p-values if 
that is all 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-10 Thread Greg Snow
What is the question you are really trying to find the answer for?  Knowing 
that may help us give more meaningful answers.

You keep wanting to test the residuals for normality, but it looks like you are 
doing it because some outdate recipe suggests it rather than that you 
understand why.

It is fairly easy to create a distribution that is definitely not normal, that 
gives the wrong answer most of the time if normality is assumed, yet will pass 
most normality tests most of the time (well except for 
SnowsPenualtimateNormalityTest, but that one has an unfair advantage in this 
situation).  So just because the residuals look normal (or close enough) does 
not mean that the theory holds.

R. A. Fisher is said to have said that the quality of a statistician can be 
judged by the amount of rat droppings under his finger nails.  Now if we take 
that literally, then I must not be very good.  But more what he meant is that a 
statistician must understand the source of the data, not just get a file and 
put it through some canned routines.  So these questions are really for you or 
the source of your data.

Also remember that the normality of the data/residuals/etc. is not as important 
as the CLT for your sample size.  The main things that make the CLT not work 
(for samples that are not large enough) are outliers and strong skewness, since 
your outcome is limited to the numbers 1-7, I don't see outliers or skewness 
being a real problem.  So you are probably fine for fixed effects style models 
(though checking with experts in your area or doing simulations can 
support/counter this).  But when you add in random effects then there is a lot 
of uncertainty about if the normal theory still holds, the latest lme code uses 
mcmc sampling rather than depending on normal theory and is still being 
developed.

This now comes back to my first question: what are you trying to find out?

You may not need to do anova or that type of model.  Some simple hypotheses may 
be answered using McNemars test on your data.  If you want to do predictions 
then linear models will be meaningless (what would a prediction of -3.2, 4.493, 
or 8.1 mean on a 7 point likert scale?) and something like proportional odds 
logistic regression will be much more meaningful.  Between those are bootstrap 
and permutation methods that may answer you question without any normality 
assumptions.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111

From: Frodo Jedi [mailto:frodo.j...@yahoo.com]
Sent: Saturday, January 08, 2011 3:20 AM
To: Greg Snow
Cc: r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality

Dear Greg,
many thanks for your answer. Now I have a problem then in understanding how to 
check
normality in case of ANOVA with repeated measures.
I would need an help with a numeric example, as I haven´tu fully understood how 
it works with the
proj() command as it as suggested by another R user in this mailing list.


For example, in attachment you find a .csv table resulting from an experiment, 
you can access it by means of this command:

 scrd- 
 read.csv(file='/Users/../tables_for_R/table_quality_wood.csv',sep=',',header=T)

The data are from an experiment where participants had to evaluate on a seven 
point likert scale
the realism of some stimuli, which are presented both in condition A and in 
condition AH.

I need to perform the ANOVA by means of this command:

 aov1 = aov(response ~ stimulus*condition + 
 Error(subject/(stimulus*condition)), data=scrd)


but the problem is that I cannot plot as usually do the qqnorm on the residuals 
of the fit because
lm does not support the Error term present in aov.
I normally check normality through a plot (or a shapiro.test function). Now 
could you please
illustrate me how will you be able to undestand from my data if they are 
normally distributed?


Please enlighten me

Best regards



From: Greg Snow greg.s...@imail.org
To: Ben Ward benjamin.w...@bathspa.org; r-help@r-project.org 
r-help@r-project.org
Sent: Fri, January 7, 2011 7:34:05 PM
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality

A lot of this depends on what question you are really trying to answer.  For 
one way anova replacing y-values with their ranks essentially transforms the 
distribution to uniform (under the null) and the Central Limit Theorem kicks in 
for the uniform with samples larger than about 5, so the normal approximations 
are pretty good and the theory works, but what are you actually testing?  The 
most meaningful null that is being tested is that all data come from the exact 
same distribution.  So what does it mean when you reject that null?  It means 
that all the groups are not representing the same distribution, but is that 
because the means differ? Or the variances? Or the shapes? It can be any of 
those.  Some point out that if you make 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-10 Thread Mike Marchywka

I can't get hotmail to indicate the original text so I'm going to top
post. There seems to be a lot of back and forth here, let me see if these
comment help guide discussion a bit. 

I tried to run some histograms of your experiment (prior to a bunch of other 
things )
and IIRC in many cases
you have counts under 10. At minimum, anything you do or any test you
run you want to do some senistivity analyses and perturb your data a bit.
Your objective of course is important- say you want to calibrate your
response data and try to validate your assumption that your survey 
question relfect some continuous variable ( but a respondent can only
round his response to an int as in teh case of taking a temperature for 
example, otherwise
all you can really say is that these things are like ranks, 765 etc ). 
Personally I always avoid non-parametrics
( just personal bias) but with small samples and a response that is closer
to a rank than a continuous variable with some meaning, it may make sense.

If you plot hitograms of responses versus A and AH, visually they look
different, you could try fitting the histos to various pdf's and see what you
get etc. This is all retro/post-hoc so you may as well explore away.


From: greg.s...@imail.org
To: frodo.j...@yahoo.com
Date: Mon, 10 Jan 2011 11:26:05 -0700
CC: r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality


What is the question you are really trying to find the answer for?  Knowing 
that may help us give more meaningful answers.

You keep wanting to test the residuals for normality, but it looks like you are 
doing it because some outdate recipe suggests it rather than that you 
understand why.

It is fairly easy to create a distribution that is definitely not normal, that 
gives the wrong answer most of the time if normality is assumed, yet will pass 
most normality tests most of the time (well except for 
SnowsPenualtimateNormalityTest, but that one has an unfair advantage in this 
situation).  So just because the residuals look normal (or close enough) does 
not mean that the theory holds.

R. A. Fisher is said to have said that the quality of a statistician can be 
judged by the amount of rat droppings under his finger nails.  Now if we take 
that literally, then I must not be very good.  But more what he meant is that a 
statistician must understand the source of the data, not just get a file and 
put it through some canned routines.  So these questions are really for you or 
the source of your data.

Also remember that the normality of the data/residuals/etc. is not as important 
as the CLT for your sample size.  The main things that make the CLT not work 
(for samples that are not large enough) are outliers and strong skewness, since 
your outcome is limited to the numbers 1-7, I don't see outliers or skewness 
being a real problem.  So you are probably fine for fixed effects style models 
(though checking with experts in your area or doing simulations can 
support/counter this).  But when you add in random effects then there is a lot 
of uncertainty about if the normal theory still holds, the latest lme code uses 
mcmc sampling rather than depending on normal theory and is still being 
developed.

This now comes back to my first question: what are you trying to find out?

You may not need to do anova or that type of model.  Some simple hypotheses may 
be answered using McNemars test on your data.  If you want to do predictions 
then linear models will be meaningless (what would a prediction of -3.2, 4.493, 
or 8.1 mean on a 7 point likert scale?) and something like proportional odds 
logistic regression will be much more meaningful.  Between those are bootstrap 
and permutation methods that may answer you question without any normality 
assumptions.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111

From: Frodo Jedi [mailto:frodo.j...@yahoo.com]
Sent: Saturday, January 08, 2011 3:20 AM
To: Greg Snow
Cc: r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality

Dear Greg,
many thanks for your answer. Now I have a problem then in understanding how to 
check
normality in case of ANOVA with repeated measures.
I would need an help with a numeric example, as I haven´tu fully understood how 
it works with the
proj() command as it as suggested by another R user in this mailing list.


For example, in attachment you find a .csv table resulting from an experiment, 
you can access it by means of this command:

 scrd- 
 read.csv(file='/Users/../tables_for_R/table_quality_wood.csv',sep=',',header=T)

The data are from an experiment where participants had to evaluate on a seven 
point likert scale
the realism of some stimuli, which are presented both in condition A and in 
condition AH.

I need to perform the ANOVA by means of this command:

 aov1 = aov(response ~ stimulus*condition + 
 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-10 Thread Frodo Jedi
Dear Greg,
first of all thanks for your reply. And I add also many thanks to all of you 
guys who are helping me, sorry for the amount of questions I recently posted 
;-) 


I don´t have a solid statistics background (I am not a statician) and I am 
basically learning everything by myself. 

So my first goal is TO UNDERSTAND. I need to have general guidelines because 
for 
my PhD I am doing and I will do several psycophysic experiments.
I am totally alone in this challenge, so I am asking some help to you guys as I 
think that here is the best place to exchange the thing that I miss
and that will never found in any book: the experience.

What is the question you are really trying to find the answer for?  Knowing 
that 
may help us give more meaningful answers.

Concerning your question I thought to have been clear. I want to understand 
which analysis I have to use in order to understand if 


the differences I am having are statistically significant or not. Now, as in 
all 
the books I read there is written that to apply ANOVA 


I must respect the assumption of normality then I am try to find a way to 
understand this.

 

You keep wanting to test the residuals for normality, but it looks like you 
are 
doing it because some outdate recipe suggests it rather than that you 
understand 
why.

Sorry Greg, if I look like this. It is not true, I am understanding everything, 
more than I show.

 
It is fairly easy to create a distribution that is definitely not normal, that 
gives the wrong answer most of the time if normality is assumed, yet will pass 
most normality tests most of the time (well except for 
SnowsPenualtimateNormalityTest,  but that one has an unfair advantage in this 
situation).  So just because the residuals look normal (or close enough) does 
not mean that the theory holds.

This is the thing that I cannot find in any book, do you understand? If I keep 
stuck to a book I would never understood this. 



 
R. A. Fisher is said to have said that the quality of a statistician can be 
judged by the amount of rat droppings under his finger nails.  Now if we take 
that literally, then I must not be very good.  But more what he meant is that 
a 
statistician must understand the source of the data, not just get a file and 
put 
it through some canned routines.  So these questions are really for you or 
the 
source of your data.
I agree. The source of my  data are basically subjects (15  n  45) who have 
to 
test some devices that I develop and they have to fill a questionnaire (sometime
based on a seven point Likert scale). Now, I don´t find any particular problem 
for this case. Which can be the problem of my data here?


 
Also remember that the normality of the data/residuals/etc. is not as 
important 
as the CLT for your sample size.  The main things that make the CLT not work 
(for samples that are not large enough) are outliers and strong skewness, 
since 
your outcome is limited to the numbers 1-7, I don’t see outliers or skewness 
being a real problem.  So you are probably fine for fixed effects style 
models 
(though checking with experts in your area or doing simulations can 
support/counter this).  



As far as I have seen everyone in my field does ANOVA.


But when you add in random effects then there is a lot of uncertainty about if 
the normal theory still holds, the latest lme code uses mcmc sampling rather 
than depending on normal theory and is still being developed.


For random effects do you mean the repeated measures right? So why staticians 
developed the ANOVA with repeated measure if there is so much uncertainty?


 
This now comes back to my first question: what are you trying to find out?

My ultimate goal is to find the p-values in order to understand if my results 
are significative or not. So I can write them on the paper ;-)


 
You may not need to do anova or that type of model.  Some simple hypotheses 
may 
be answered using McNemars test on your data.  If you want to do predictions 
then linear models will be meaningless (what would a prediction of -3.2,  
4.493, or 8.1 mean on a 7 point likert scale?) and something like proportional 
odds logistic regression will be much more meaningful.  Between those are 
bootstrap and permutation methods that may answer you question without any 
normality assumptions.

Ok. But my ANOVA analysis I did so far is wrong or not? I think it is very 
valid, since the results seem coherent with what one can see 


looking at the means.


Thanks for sharing your precious experience with me. I think the world becomes 
better when people help each others.


All the best

 
-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111
 
From:Frodo Jedi [mailto:frodo.j...@yahoo.com] 
Sent: Saturday, January 08, 2011 3:20 AM
To: Greg Snow
Cc: r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality
 
Dear Greg,
many thanks for your answer. Now 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-10 Thread Mike Marchywka

( again top posting since hotmail isn't adding  and these comments
apply to whole thread anyway ) 

I'm not a statistician either but rather an engineer who has had a chance to use
my intro stats/math background to look at some real life situations.
I'm just making comments for conversation, hoping to elicit more specifics from 
experts willing
to talk ( but even with these caveats I have noted my posts are getting quite
sloppy ). There is nothing wrong with wanting to be rigorous and replicate
the approach others have taken and select the best tests. However, if this is a 
thesis and you 
want to do meaningful research, you really will have to be more concerned
about the results, not just parroting stuff that may or may not help you
understand your data. Personally I would suggest a book, not just buy
a consultant, and use dead time to play with formulas and think about them-
paper and pencil are still worthwhile and it would help your latter work if
you had some idea what these numbers may or may not mean. None of this
math should be beyond you. 
If you  run every test that could be of interest and is easy to code in R, plow 
through the
formula and reconcile apparent agreements and disagreements using algebra, run 
sensitivity tests,
add noise to data remove points etc and do it all again, then could start to
see  what would happen if your
data or residuals from some fit or some other thing were not normally 
distributed. It should
pop out of the analysis.  You ought to be

able to do simple things like get SSE's without using any stats packages and 

see if you can step through the stuff.

To give you
perspective on what your peers are doing, this is a search on an recent
controversy in which you may be interested before looking for cookbook 
approaches.
I looked at this in a few years now maybe it has been settled but these
are respected folk using terms like voodoo for stat analysis, 

http://www.google.com/#sclient=psyhl=enq=vul+fmri+voodoo

maybe this in particular, 

http://afni.nimh.nih.gov/pub/dist/doc/misc/voodoo.pdf

and I'm sure you can easily find others.

No one is suggesting you ignore experts and do your own thing, it is easy 
to mislead yourself if you don't look for sanity checks, just
that the field is a bit open and you are unlikely to get a cookbook
result for a single test that gives you THE P VALUE.




Date: Mon, 10 Jan 2011 16:43:56 -0800
From: frodo.j...@yahoo.com
To: greg.s...@imail.org
CC: r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality


Dear Greg,
first of all thanks for your reply. And I add also many thanks to all of you
guys who are helping me, sorry for the amount of questions I recently posted ;-)


I don´t have a solid statistics background (I am not a statician) and I am
basically learning everything by myself.

So my first goal is TO UNDERSTAND. I need to have general guidelines because for
my PhD I am doing and I will do several psycophysic experiments.
I am totally alone in this challenge, so I am asking some help to you guys as I
think that here is the best place to exchange the thing that I miss
and that will never found in any book: the experience.

What is the question you are really trying to find the answer for?  Knowing 
that
may help us give more meaningful answers.

Concerning your question I thought to have been clear. I want to understand
which analysis I have to use in order to understand if


the differences I am having are statistically significant or not. Now, as in all
the books I read there is written that to apply ANOVA


I must respect the assumption of normality then I am try to find a way to
understand this.



You keep wanting to test the residuals for normality, but it looks like you are
doing it because some outdate recipe suggests it rather than that you 
understand
why.

Sorry Greg, if I look like this. It is not true, I am understanding everything,
more than I show.


It is fairly easy to create a distribution that is definitely not normal, that
gives the wrong answer most of the time if normality is assumed, yet will pass
most normality tests most of the time (well except for
SnowsPenualtimateNormalityTest,  but that one has an unfair advantage in this
situation).  So just because the residuals look normal (or close enough) does
not mean that the theory holds.

This is the thing that I cannot find in any book, do you understand? If I keep
stuck to a book I would never understood this.




R. A. Fisher is said to have said that the quality of a statistician can be
judged by the amount of rat droppings under his finger nails.  Now if we take
that literally, then I must not be very good.  But more what he meant is that 
a
statistician must understand the source of the data, not just get a file and 
put
it through some canned routines.  So these questions are really for you or the
source of your data.
I agree. The source of my  data are basically subjects (15  n  45) who have to
test some 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-08 Thread Frodo Jedi
Dear Greg,
many thanks for your answer. Now I have a problem then in understanding how to 
check 

normality in case of ANOVA with repeated measures.
I would need an help with a numeric example, as I haven´tu fully understood how 
it works with the
proj() command as it as suggested by another R user in this mailing list.


For example, in attachment you find a .csv table resulting from an experiment, 
you can access it by means of this command:

 scrd-  
read.csv(file='/Users/../tables_for_R/table_quality_wood.csv',sep=',',header=T)


The data are from an experiment where participants had to evaluate on a seven 
point likert scale
the realism of some stimuli, which are presented both in condition A and in 
condition AH.

I need to perform the ANOVA by means of this command:

 aov1 = aov(response ~ stimulus*condition + 
 Error(subject/(stimulus*condition)), 
data=scrd)


but the problem is that I cannot plot as usually do the qqnorm on the residuals 
of the fit because
lm does not support the Error term present in aov.
I normally check normality through a plot (or a shapiro.test function). Now 
could you please 

illustrate me how will you be able to undestand from my data if they are 
normally distributed?


Please enlighten me

Best regards





From: Greg Snow greg.s...@imail.org
To: Ben Ward benjamin.w...@bathspa.org; r-help@r-project.org 
r-help@r-project.org
Sent: Fri, January 7, 2011 7:34:05 PM
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality

A lot of this depends on what question you are really trying to answer.  For 
one 
way anova replacing y-values with their ranks essentially transforms the 
distribution to uniform (under the null) and the Central Limit Theorem kicks in 
for the uniform with samples larger than about 5, so the normal approximations 
are pretty good and the theory works, but what are you actually testing?  The 
most meaningful null that is being tested is that all data come from the exact 
same distribution.  So what does it mean when you reject that null?  It means 
that all the groups are not representing the same distribution, but is that 
because the means differ? Or the variances? Or the shapes? It can be any of 
those.  Some point out that if you make certain assumptions such as symmetry or 
shifts of the same distributions, then you can talk about differences in means 
or medians, but usually if I am using non-parametrics it is because I don't 
believe that things are symmetric and the shift idea doesn't fit in my mind.

Some alternatives include bootstrapping or permutation tests, or just 
transforming the data to get something closer to normal.

Now what does replacing by ranks do in 2-way anova where we want to test the 
difference in one factor without making assumptions about whether the other 
factor has an effect or not?  I'm not sure on this one.

I have seen regression on ranks, it basically tests for some level of 
relationship, but regression is usually used for some type of prediction and 
predicting from a rank-rank regression does not seem meaningful to me.

Fitting the regression model does not require normality, it is the tests on the 
coefficients and confidence and prediction intervals that assume normality 
(again the CLT helps for large samples (but not for prediction intervals)).  
Bootstrapping is an option for regression without assuming normality, 
transformations can also help.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of Ben Ward
 Sent: Thursday, January 06, 2011 2:00 PM
 To: r-help@r-project.org
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the
 normality

 On 06/01/2011 20:29, Greg Snow wrote:
  Some would argue to always use the kruskal wallis test since we never
 know for sure if we have normality.  Personally I am not sure that I
 understand what exactly that test is really testing.  Plus in your case
 you are doing a two-way anova and kruskal.test does one-way, so it will
 not work for your case.  There are other non-parametric options.
 Just read this and had queries of my own and comments on this subject:
 Would one of these options be to rank the data before doing whatever
 model or test you want to do? As I understand it makes the place of the
 data the same, but pulls extreme cases closer to the rest. Not an
 expert
 though.
 I've been doing lm() for my work, and I don't know if that makes an
 assumption of normality (may data is not normal). And I'm unsure of any
 other assumptions as my texts don't really discuss them. Although I can
 comfortably evaluate a model say using residual vs fitted, and F values
 turned to P, resampling and confidence intervals, and looking at sums
 of
 squares terms add to explanation of the model. I've tried the plot()
 function to help 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-07 Thread Greg Snow
A lot of this depends on what question you are really trying to answer.  For 
one way anova replacing y-values with their ranks essentially transforms the 
distribution to uniform (under the null) and the Central Limit Theorem kicks in 
for the uniform with samples larger than about 5, so the normal approximations 
are pretty good and the theory works, but what are you actually testing?  The 
most meaningful null that is being tested is that all data come from the exact 
same distribution.  So what does it mean when you reject that null?  It means 
that all the groups are not representing the same distribution, but is that 
because the means differ? Or the variances? Or the shapes? It can be any of 
those.  Some point out that if you make certain assumptions such as symmetry or 
shifts of the same distributions, then you can talk about differences in means 
or medians, but usually if I am using non-parametrics it is because I don't 
believe that things are symmetric and the shift idea doesn't fit in my mind.

Some alternatives include bootstrapping or permutation tests, or just 
transforming the data to get something closer to normal.

Now what does replacing by ranks do in 2-way anova where we want to test the 
difference in one factor without making assumptions about whether the other 
factor has an effect or not?  I'm not sure on this one.

I have seen regression on ranks, it basically tests for some level of 
relationship, but regression is usually used for some type of prediction and 
predicting from a rank-rank regression does not seem meaningful to me.

Fitting the regression model does not require normality, it is the tests on the 
coefficients and confidence and prediction intervals that assume normality 
(again the CLT helps for large samples (but not for prediction intervals)).  
Bootstrapping is an option for regression without assuming normality, 
transformations can also help.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of Ben Ward
 Sent: Thursday, January 06, 2011 2:00 PM
 To: r-help@r-project.org
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the
 normality

 On 06/01/2011 20:29, Greg Snow wrote:
  Some would argue to always use the kruskal wallis test since we never
 know for sure if we have normality.  Personally I am not sure that I
 understand what exactly that test is really testing.  Plus in your case
 you are doing a two-way anova and kruskal.test does one-way, so it will
 not work for your case.  There are other non-parametric options.
 Just read this and had queries of my own and comments on this subject:
 Would one of these options be to rank the data before doing whatever
 model or test you want to do? As I understand it makes the place of the
 data the same, but pulls extreme cases closer to the rest. Not an
 expert
 though.
 I've been doing lm() for my work, and I don't know if that makes an
 assumption of normality (may data is not normal). And I'm unsure of any
 other assumptions as my texts don't really discuss them. Although I can
 comfortably evaluate a model say using residual vs fitted, and F values
 turned to P, resampling and confidence intervals, and looking at sums
 of
 squares terms add to explanation of the model. I've tried the plot()
 function to help graphically evaluate a model, and I want to make sure
 I
 understand what it's showing me. I think the first, is showing me the
 models fitted values vs the residuals, and ideally, I think the closer
 the points are to the red line the better. The next plot is a Q-Q plot,
 the closer the points to the line, the more normal the model
 coefficients (or perhaps the data). I'm not sure what the next two
 plots
 are, but it is titled Scale-Location. And it looks to have the square
 root of standardized residuals on y, and fitted model values on x.
 Might
 this be similar to the first plot? The final one is titled Residuals vs
 Leverage, which has standardized residuals on y and leverage on x, and
 something called Cooks Distance is plotted as well.

 Thanks,
 Ben. W
  Whether to use anova and other normality based tests is really a
 matter of what assumptions you are willing to live with and what level
 of close enough you are comfortable with.  Consulting with a local
 consultant with experience in these areas is useful if you don't have
 enough experience to decide what you are comfortable with.
 
  For your description, I would try the proportional odds logistic
 regression, but again, you should probably consult with someone who has
 experience rather than trying that on your own until you have more
 training and experience.
  --
  Gregory (Greg) L. Snow Ph.D.
  Statistical Data Center
  Intermountain Healthcare
  greg.s...@imail.org
  801.408.8111
 
  From: Frodo Jedi [mailto:frodo.j...@yahoo.com]
  Sent: Thursday, 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-07 Thread Ben Ward
I believe what I'm doing, is an ancova, because I have two categorical 
and a numerical explanatory variables, and a numerical response variable 
(this is the same experiment as before, the bacteria), and I'm just, at 
the minute (because I'm only half way through), doing some modelling and 
seeing what I get with what I currently have. And I'm paying attention 
to 95% CI for the different terms of a model, as well as the 
coefficient, and the explanatory power of the term and likelyhood that 
the same result could be obtained at random through the P values, 
derived from F. To be honest I havent checked much what my data 
distributions are like and such becasue I'm not finished collecting it 
yet. I mainly mentioned the ranking because it was given considerable 
mention in one of my texts sections on hypothesis testing on models.



On 07/01/2011 18:34, Greg Snow wrote:

A lot of this depends on what question you are really trying to answer.  For 
one way anova replacing y-values with their ranks essentially transforms the 
distribution to uniform (under the null) and the Central Limit Theorem kicks in 
for the uniform with samples larger than about 5, so the normal approximations 
are pretty good and the theory works, but what are you actually testing?  The 
most meaningful null that is being tested is that all data come from the exact 
same distribution.  So what does it mean when you reject that null?  It means 
that all the groups are not representing the same distribution, but is that 
because the means differ? Or the variances? Or the shapes? It can be any of 
those.  Some point out that if you make certain assumptions such as symmetry or 
shifts of the same distributions, then you can talk about differences in means 
or medians, but usually if I am using non-parametrics it is because I don't 
believe that things are symmetric and the shift idea doesn't fit in my mind.

Some alternatives include bootstrapping or permutation tests, or just 
transforming the data to get something closer to normal.

Now what does replacing by ranks do in 2-way anova where we want to test the 
difference in one factor without making assumptions about whether the other 
factor has an effect or not?  I'm not sure on this one.

I have seen regression on ranks, it basically tests for some level of 
relationship, but regression is usually used for some type of prediction and 
predicting from a rank-rank regression does not seem meaningful to me.

Fitting the regression model does not require normality, it is the tests on the 
coefficients and confidence and prediction intervals that assume normality 
(again the CLT helps for large samples (but not for prediction intervals)).  
Bootstrapping is an option for regression without assuming normality, 
transformations can also help.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
project.org] On Behalf Of Ben Ward
Sent: Thursday, January 06, 2011 2:00 PM
To: r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the
normality

On 06/01/2011 20:29, Greg Snow wrote:

Some would argue to always use the kruskal wallis test since we never

know for sure if we have normality.  Personally I am not sure that I
understand what exactly that test is really testing.  Plus in your case
you are doing a two-way anova and kruskal.test does one-way, so it will
not work for your case.  There are other non-parametric options.
Just read this and had queries of my own and comments on this subject:
Would one of these options be to rank the data before doing whatever
model or test you want to do? As I understand it makes the place of the
data the same, but pulls extreme cases closer to the rest. Not an
expert
though.
I've been doing lm() for my work, and I don't know if that makes an
assumption of normality (may data is not normal). And I'm unsure of any
other assumptions as my texts don't really discuss them. Although I can
comfortably evaluate a model say using residual vs fitted, and F values
turned to P, resampling and confidence intervals, and looking at sums
of
squares terms add to explanation of the model. I've tried the plot()
function to help graphically evaluate a model, and I want to make sure
I
understand what it's showing me. I think the first, is showing me the
models fitted values vs the residuals, and ideally, I think the closer
the points are to the red line the better. The next plot is a Q-Q plot,
the closer the points to the line, the more normal the model
coefficients (or perhaps the data). I'm not sure what the next two
plots
are, but it is titled Scale-Location. And it looks to have the square
root of standardized residuals on y, and fitted model values on x.
Might
this be similar to the first plot? The final one is titled Residuals vs
Leverage, which has standardized residuals on y and leverage 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-06 Thread Frodo Jedi
Dear Robert,
thanks so much!!!  Now I understand!
So you also think that I have to check only the residuals and not the data 
directly.
Now just for curiosity I did the the shapiro test on the residuals. The problem 
is that on fit3 I don´t get from the test
that the data are normally distribuited. Why? Here the data:

 shapiro.test(residuals(fit1))

Shapiro-Wilk normality test

data:  residuals(fit1) 
W = 0.9848, p-value = 0.05693

#Here the test is ok: the test says that the data are distributed normally 
(p-value greather than 0.05)



 shapiro.test(residuals(fit2))

Shapiro-Wilk normality test

data:  residuals(fit2) 
W = 0.9853, p-value = 0.06525

#Here the test is ok: the test says that the data are distributed normally 
(p-value greather than 0.05)



 shapiro.test(residuals(fit3))

Shapiro-Wilk normality test

data:  residuals(fit3) 
W = 0.9621, p-value = 0.0001206



Now the test reveals p-value lower than 0.05: so the residuals for fit3 are not 
distributed normally
Why I get this beheaviour? Indeed in the histogram and Q-Q plot for fit3 
residuals I get a normal distribution.

















From: Robert Baer rb...@atsu.edu

Sent: Wed, January 5, 2011 8:56:50 PM
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality

 Someone suggested me that I don´t have to check the normality of the data, but
 the normality of the residuals I get after the fitting of the  linear model.
 I really ask you to help me to understand this point as I don´t find enough
 material online where to solve it.

Try the following:
# using your scrd data and your proposed models
fit1- lm(response ~ stimulus + condition + stimulus:condition, data=scrd)
fit2- lm(response ~ stimulus + condition, data=scrd)
fit3- lm(response ~ condition, data=scrd)

# Set up for 6 plots on 1 panel
op = par(mfrow=c(2,3))

# residuals function extracts residuals
# Visual inspection is a good start for checking normality
# You get a much better feel than from some magic number statistic
hist(residuals(fit1))
hist(residuals(fit2))
hist(residuals(fit3))

# especially qqnorm() plots which are linear for normal data
qqnorm(residuals(fit1))
qqnorm(residuals(fit2))
qqnorm(residuals(fit3))

# Restore plot parameters
par(op)

 
 If the data are not normally distributed I have to use the kruskal wallys test
 and not the ANOVA...so please help
 me to understand.

Indeed - Kruskal-Wallis is a good test to use for one factor data that is 
ordinal so it is a good alternative to your fit3.
Your response seems to be a discrete variable rather than a continuous 
variable.
You must decide if it is reasonable to approximate it with a normal 
distribution 
which is by definition continuous.

 
 I make a numerical example, could you please tell me if the data in this table
 are normally distributed or not?
 
 Help!
 
 
 number  stimulus condition response
 1 flat_550_W_realism A3
 2 flat_550_W_realism A3
 3 flat_550_W_realism A5
 4 flat_550_W_realism A3
 5 flat_550_W_realism A3
 6 flat_550_W_realism A3
 7 flat_550_W_realism A3
 8 flat_550_W_realism A5
 9 flat_550_W_realism A3
 10flat_550_W_realism A3
 11flat_550_W_realism A5
 12flat_550_W_realism A7
 13flat_550_W_realism A5
 14flat_550_W_realism A2
 15flat_550_W_realism A3
 16flat_550_W_realismAH7
 17flat_550_W_realismAH4
 18flat_550_W_realismAH5
 19flat_550_W_realismAH3
 20flat_550_W_realismAH6
 21flat_550_W_realismAH5
 22flat_550_W_realismAH3
 23flat_550_W_realismAH5
 24flat_550_W_realismAH5
 25flat_550_W_realismAH7
 26flat_550_W_realismAH2
 27flat_550_W_realismAH7
 28flat_550_W_realismAH5
 29flat_550_W_realismAH5
 30 bump_2_step_W_realism A1
 31 bump_2_step_W_realism A3
 32 bump_2_step_W_realism A5
 33 bump_2_step_W_realism A1
 34 bump_2_step_W_realism A3
 35 bump_2_step_W_realism A2
 36 bump_2_step_W_realism A5
 37 bump_2_step_W_realism A4
 38 bump_2_step_W_realism A4
 39 bump_2_step_W_realism A4

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-06 Thread Greg Snow
Remember that an non-significant result (especially one that is still near 
alpha like yours) does not give evidence that the null is true.  The reason 
that the 1st 2 tests below don't show significance is more due to lack of power 
than some of the residuals being normal.  The only test that I would trust for 
this is SnowsPenultimateNormalityTest (TeachingDemos package, the help page is 
more useful than the function itself).

But I think that you are mixing up 2 different concepts (a very common 
misunderstanding).  What is important if we want to do normal theory inference 
is that the coefficients/effects/estimates are normally distributed.  Now since 
these coefficients can be shown to be linear combinations of the error terms, 
if the errors are iid normal then the coefficients are also normally 
distributed.  So many people want to show that the residuals come from a 
perfectly normal distribution.  But it is the theoretical errors, not the 
observed residuals that are important (the observed residuals are not iid).  
You need to think about the source of your data to see if this is a reasonable 
assumption.  Now I cannot fathom any universe (theoretical or real) in which 
normally distributed errors added to means that they are independent of will 
result in a finite set of integers, so an assumption of exact normality is not 
reasonable (some may want to argue this, but convincing me will be very 
difficult).  But looking for exact normality is a bit of a red herring because, 
we also have the Central Limit Theorem that says that if the errors are not 
normal (but still iid) then the distribution of the coefficients will approach 
normality as the sample size increases.  This is what make statistics doable 
(because no real dataset entered into the computer is exactly normal).  The 
more important question is are the residuals normal enough?  for which there 
is not a definitive test (experience and plots help).

But this all depends on another assumption that I don't think that you have 
even considered.  Yes we can use normal theory even when the random part of the 
data is not normally distributed, but this still assumes that the data is at 
least interval data, i.e. that we firmly believe that the difference between a 
response of 1 and a response of 2 is exactly the same as a difference between a 
6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. 
 From your data and other descriptions, I don't think that that is a reasonable 
assumption.  If you are not willing to make that assumption (like me) then 
means and normal theory tests are meaningless and you should use other 
approaches.  One possibility is to use non-parametric methods (which I believe 
Frank has already suggested you use), another is to use proportional odds 
logistic regression.



--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of Frodo Jedi
 Sent: Wednesday, January 05, 2011 3:22 PM
 To: Robert Baer; r-help@r-project.org
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the
 normality

 Dear Robert,
 thanks so much!!!  Now I understand!
 So you also think that I have to check only the residuals and not the
 data
 directly.
 Now just for curiosity I did the the shapiro test on the residuals. The
 problem
 is that on fit3 I don´t get from the test
 that the data are normally distribuited. Why? Here the data:

  shapiro.test(residuals(fit1))

 Shapiro-Wilk normality test

 data:  residuals(fit1)
 W = 0.9848, p-value = 0.05693

 #Here the test is ok: the test says that the data are distributed
 normally
 (p-value greather than 0.05)



  shapiro.test(residuals(fit2))

 Shapiro-Wilk normality test

 data:  residuals(fit2)
 W = 0.9853, p-value = 0.06525

 #Here the test is ok: the test says that the data are distributed
 normally
 (p-value greather than 0.05)



  shapiro.test(residuals(fit3))

 Shapiro-Wilk normality test

 data:  residuals(fit3)
 W = 0.9621, p-value = 0.0001206



 Now the test reveals p-value lower than 0.05: so the residuals for fit3
 are not
 distributed normally
 Why I get this beheaviour? Indeed in the histogram and Q-Q plot for
 fit3
 residuals I get a normal distribution.
















 
 From: Robert Baer rb...@atsu.edu

 Sent: Wed, January 5, 2011 8:56:50 PM
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the
 normality

  Someone suggested me that I don´t have to check the normality of the
 data, but
  the normality of the residuals I get after the fitting of the  linear
 model.
  I really ask you to help me to understand this point as I don´t find
 enough
  material online where to solve it.

 Try the following:
 # using your scrd data and your proposed models
 fit1- lm(response ~ stimulus + condition + 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-06 Thread Frodo Jedi


Ok,
I see ;-)

Let´s put in this way then. When do I have to use the kruskal wallis test? I 
mean, when I am very sure that I have 

to use it instead of ANOVA?

Thanks


Best regards

P.S.  In addition, which is the non parametric methods corresponding to a 2 
ways 
anova?..or have I to
repeat many times the kruskal wallis test?




From: Greg Snow greg.s...@imail.org

r-help@r-project.org r-help@r-project.org
Sent: Thu, January 6, 2011 7:07:17 PM
Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality

Remember that an non-significant result (especially one that is still near 
alpha 
like yours) does not give evidence that the null is true.  The reason that the 
1st 2 tests below don't show significance is more due to lack of power than 
some 
of the residuals being normal.  The only test that I would trust for this is 
SnowsPenultimateNormalityTest (TeachingDemos package, the help page is more
useful than the function itself).

But I think that you are mixing up 2 different concepts (a very common 
misunderstanding).  What is important if we want to do normal theory inference 
is that the coefficients/effects/estimates are normally distributed.  Now since 
these coefficients can be shown to be linear combinations of the error terms, 
if 
the errors are iid normal then the coefficients are also normally distributed.  
So many people want to show that the residuals come from a perfectly normal
distribution.  But it is the theoretical errors, not the observed residuals 
that 
are important (the observed residuals are not iid).  You need to think about 
the 
source of your data to see if this is a reasonable assumption.  Now I cannot 
fathom any universe (theoretical or real) in which normally distributed errors 
added to means that they are independent of will result in a finite set of
integers, so an assumption of exact normality is not reasonable (some may want 
to argue this, but convincing me will be very difficult).  But looking for 
exact 
normality is a bit of a red herring because, we also have the Central Limit
Theorem that says that if the errors are not normal (but still iid) then the 
distribution of the coefficients will approach normality as the sample size
increases.  This is what make statistics doable (because no real dataset 
entered 
into the computer is exactly normal).  The more important question is are the 
residuals normal enough?  for which there is not a definitive test 
(experience 
and plots help).

But this all depends on another assumption that I don't think that you have 
even 
considered.  Yes we can use normal theory even when the random part of the data 
is not normally distributed, but this still assumes that the data is at least 
interval data, i.e. that we firmly believe that the difference between a 
response of 1 and a response of 2 is exactly the same as a difference between a 
6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. 
 
From your data and other descriptions, I don't think that that is a reasonable 
assumption.  If you are not willing to make that assumption (like me) then 
means 
and normal theory tests are meaningless and you should use other approaches.  
One possibility is to use non-parametric methods (which I believe Frank has
already suggested you use), another is to use proportional odds logistic 
regression.



--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of Frodo Jedi
 Sent: Wednesday, January 05, 2011 3:22 PM
 To: Robert Baer; r-help@r-project.org
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the
 normality

 Dear Robert,
[[elided Yahoo spam]]
 So you also think that I have to check only the residuals and not the
 data
 directly.
 Now just for curiosity I did the the shapiro test on the residuals. The
 problem
 is that on fit3 I don´t get from the test
 that the data are normally distribuited. Why? Here the data:

  shapiro.test(residuals(fit1))

 Shapiro-Wilk normality test

 data:  residuals(fit1)
 W = 0.9848, p-value = 0.05693

 #Here the test is ok: the test says that the data are distributed
 normally
 (p-value greather than 0.05)



  shapiro.test(residuals(fit2))

 Shapiro-Wilk normality test

 data:  residuals(fit2)
 W = 0.9853, p-value = 0.06525

 #Here the test is ok: the test says that the data are distributed
 normally
 (p-value greather than 0.05)



  shapiro.test(residuals(fit3))

 Shapiro-Wilk normality test

 data:  residuals(fit3)
 W = 0.9621, p-value = 0.0001206



 Now the test reveals p-value lower than 0.05: so the residuals for fit3
 are not
 distributed normally
 Why I get this beheaviour? Indeed in the histogram and Q-Q plot for
 fit3
 residuals I get a normal distribution.
















 
 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-06 Thread Greg Snow
Some would argue to always use the kruskal wallis test since we never know for 
sure if we have normality.  Personally I am not sure that I understand what 
exactly that test is really testing.  Plus in your case you are doing a two-way 
anova and kruskal.test does one-way, so it will not work for your case.  There 
are other non-parametric options.

Whether to use anova and other normality based tests is really a matter of what 
assumptions you are willing to live with and what level of close enough you 
are comfortable with.  Consulting with a local consultant with experience in 
these areas is useful if you don't have enough experience to decide what you 
are comfortable with.

For your description, I would try the proportional odds logistic regression, 
but again, you should probably consult with someone who has experience rather 
than trying that on your own until you have more training and experience.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111

From: Frodo Jedi [mailto:frodo.j...@yahoo.com]
Sent: Thursday, January 06, 2011 12:57 PM
To: Greg Snow; r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality


Ok,
I see ;-)

Let´s put in this way then. When do I have to use the kruskal wallis test? I 
mean, when I am very sure that I have
to use it instead of ANOVA?

Thanks


Best regards

P.S.  In addition, which is the non parametric methods corresponding to a 2 
ways anova?..or have I to
repeat many times the kruskal wallis test?

From: Greg Snow greg.s...@imail.org
To: Frodo Jedi frodo.j...@yahoo.com; Robert Baer rb...@atsu.edu; 
r-help@r-project.org r-help@r-project.org
Sent: Thu, January 6, 2011 7:07:17 PM
Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality

Remember that an non-significant result (especially one that is still near 
alpha like yours) does not give evidence that the null is true.  The reason 
that the 1st 2 tests below don't show significance is more due to lack of power 
than some of the residuals being normal.  The only test that I would trust for 
this is SnowsPenultimateNormalityTest (TeachingDemos package, the help page is 
more useful than the function itself).

But I think that you are mixing up 2 different concepts (a very common 
misunderstanding).  What is important if we want to do normal theory inference 
is that the coefficients/effects/estimates are normally distributed.  Now since 
these coefficients can be shown to be linear combinations of the error terms, 
if the errors are iid normal then the coefficients are also normally 
distributed.  So many people want to show that the residuals come from a 
perfectly normal distribution.  But it is the theoretical errors, not the 
observed residuals that are important (the observed residuals are not iid).  
You need to think about the source of your data to see if this is a reasonable 
assumption.  Now I cannot fathom any universe (theoretical or real) in which 
normally distributed errors added to means that they are independent of will 
result in a finite set of integers, so an assumption of exact normality is not 
reasonable (some may want to argue this, but convincing me will be very 
difficult).  But looking for exact normality is a bit of a red herring because, 
we also have the Central Limit Theorem that says that if the errors are not 
normal (but still iid) then the distribution of the coefficients will approach 
normality as the sample size increases.  This is what make statistics doable 
(because no real dataset entered into the computer is exactly normal).  The 
more important question is are the residuals normal enough?  for which there 
is not a definitive test (experience and plots help).

But this all depends on another assumption that I don't think that you have 
even considered.  Yes we can use normal theory even when the random part of the 
data is not normally distributed, but this still assumes that the data is at 
least interval data, i.e. that we firmly believe that the difference between a 
response of 1 and a response of 2 is exactly the same as a difference between a 
6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. 
 From your data and other descriptions, I don't think that that is a reasonable 
assumption.  If you are not willing to make that assumption (like me) then 
means and normal theory tests are meaningless and you should use other 
approaches.  One possibility is to use non-parametric methods (which I believe 
Frank has already suggested you use), another is to use proportional odds 
logistic regression.



--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.orgmailto:greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-
 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-06 Thread Ben Ward
On 06/01/2011 20:29, Greg Snow wrote:
 Some would argue to always use the kruskal wallis test since we never know 
 for sure if we have normality.  Personally I am not sure that I understand 
 what exactly that test is really testing.  Plus in your case you are doing a 
 two-way anova and kruskal.test does one-way, so it will not work for your 
 case.  There are other non-parametric options.
Just read this and had queries of my own and comments on this subject:
Would one of these options be to rank the data before doing whatever 
model or test you want to do? As I understand it makes the place of the 
data the same, but pulls extreme cases closer to the rest. Not an expert 
though.
I've been doing lm() for my work, and I don't know if that makes an 
assumption of normality (may data is not normal). And I'm unsure of any 
other assumptions as my texts don't really discuss them. Although I can 
comfortably evaluate a model say using residual vs fitted, and F values 
turned to P, resampling and confidence intervals, and looking at sums of 
squares terms add to explanation of the model. I've tried the plot() 
function to help graphically evaluate a model, and I want to make sure I 
understand what it's showing me. I think the first, is showing me the 
models fitted values vs the residuals, and ideally, I think the closer 
the points are to the red line the better. The next plot is a Q-Q plot, 
the closer the points to the line, the more normal the model 
coefficients (or perhaps the data). I'm not sure what the next two plots 
are, but it is titled Scale-Location. And it looks to have the square 
root of standardized residuals on y, and fitted model values on x. Might 
this be similar to the first plot? The final one is titled Residuals vs 
Leverage, which has standardized residuals on y and leverage on x, and 
something called Cooks Distance is plotted as well.

Thanks,
Ben. W
 Whether to use anova and other normality based tests is really a matter of 
 what assumptions you are willing to live with and what level of close 
 enough you are comfortable with.  Consulting with a local consultant with 
 experience in these areas is useful if you don't have enough experience to 
 decide what you are comfortable with.

 For your description, I would try the proportional odds logistic regression, 
 but again, you should probably consult with someone who has experience rather 
 than trying that on your own until you have more training and experience.
 --
 Gregory (Greg) L. Snow Ph.D.
 Statistical Data Center
 Intermountain Healthcare
 greg.s...@imail.org
 801.408.8111

 From: Frodo Jedi [mailto:frodo.j...@yahoo.com]
 Sent: Thursday, January 06, 2011 12:57 PM
 To: Greg Snow; r-help@r-project.org
 Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality


 Ok,
 I see ;-)

 Let´s put in this way then. When do I have to use the kruskal wallis test? I 
 mean, when I am very sure that I have
 to use it instead of ANOVA?

 Thanks


 Best regards

 P.S.  In addition, which is the non parametric methods corresponding to a 2 
 ways anova?..or have I to
 repeat many times the kruskal wallis test?
 
 From: Greg Snowgreg.s...@imail.org
 To: Frodo Jedifrodo.j...@yahoo.com; Robert Baerrb...@atsu.edu; 
 r-help@r-project.orgr-help@r-project.org
 Sent: Thu, January 6, 2011 7:07:17 PM
 Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality

 Remember that an non-significant result (especially one that is still near 
 alpha like yours) does not give evidence that the null is true.  The reason 
 that the 1st 2 tests below don't show significance is more due to lack of 
 power than some of the residuals being normal.  The only test that I would 
 trust for this is SnowsPenultimateNormalityTest (TeachingDemos package, the 
 help page is more useful than the function itself).

 But I think that you are mixing up 2 different concepts (a very common 
 misunderstanding).  What is important if we want to do normal theory 
 inference is that the coefficients/effects/estimates are normally 
 distributed.  Now since these coefficients can be shown to be linear 
 combinations of the error terms, if the errors are iid normal then the 
 coefficients are also normally distributed.  So many people want to show that 
 the residuals come from a perfectly normal distribution.  But it is the 
 theoretical errors, not the observed residuals that are important (the 
 observed residuals are not iid).  You need to think about the source of your 
 data to see if this is a reasonable assumption.  Now I cannot fathom any 
 universe (theoretical or real) in which normally distributed errors added to 
 means that they are independent of will result in a finite set of integers, 
 so an assumption of exact normality is not reasonable (some may want to argue 
 this, but convincing me will be very difficult).  But looking for exact 
 normality is a bit of a red herring because, we also have the 

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-06 Thread Frodo Jedi
Thanks a lot Greg, 
you have been very helpful.

All the best





From: Greg Snow greg.s...@imail.org

r-help@r-project.org
Sent: Thu, January 6, 2011 9:29:36 PM
Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality


Some would argue to always use the kruskal wallis test since we never know for 
sure if we have normality.  Personally I am not sure that I understand what
exactly that test is really testing.  Plus in your case you are doing a two-way 
anova and kruskal.test does one-way, so it will not work for your case.  There 
are other non-parametric options.

Whether to use anova and other normality based tests is really a matter of what 
assumptions you are willing to live with and what level of “close enough” 
you 
are comfortable with.  Consulting with a local consultant with experience in 
these areas is useful if you don’t have enough experience to decide what you 
are 
comfortable with.

For your description, I would try the proportional odds logistic regression, 
but 
again, you should probably consult with someone who has experience rather than 
trying that on your own until you have more training and experience.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


Sent: Thursday, January 06, 2011 12:57 PM
To: Greg Snow; r-help@r-project.org
Subject: Re: [R] Assumptions for ANOVA: the right way to check the normality


Ok,
I see ;-)

Let´s put in this way then. When do I have to use the kruskal wallis test? I 
mean, when I am very sure that I have 

to use it instead of ANOVA?

Thanks


Best regards

P.S.  In addition, which is the non parametric methods corresponding to a 2 
ways 
anova?..or have I to
repeat many times the kruskal wallis test?



From:Greg Snow greg.s...@imail.org

r-help@r-project.org r-help@r-project.org
Sent: Thu, January 6, 2011 7:07:17 PM
Subject: RE: [R] Assumptions for ANOVA: the right way to check the normality

Remember that an non-significant result (especially one that is still near 
alpha 
like yours) does not give evidence that the null is true.  The reason that the 
1st 2 tests below don't show significance is more due to lack of power than 
some 
of the residuals being normal.  The only test that I would trust for this is 
SnowsPenultimateNormalityTest (TeachingDemos package, the help page is more
useful than the function itself).

But I think that you are mixing up 2 different concepts (a very common 
misunderstanding).  What is important if we want to do normal theory inference 
is that the coefficients/effects/estimates are normally distributed.  Now since 
these coefficients can be shown to be linear combinations of the error terms, 
if 
the errors are iid normal then the coefficients are also normally distributed.  
So many people want to show that the residuals come from a perfectly normal
distribution.  But it is the theoretical errors, not the observed residuals 
that 
are important (the observed residuals are not iid).  You need to think about 
the 
source of your data to see if this is a reasonable assumption.  Now I cannot 
fathom any universe (theoretical or real) in which normally distributed errors 
added to means that they are independent of will result in a finite set of
integers, so an assumption of exact normality is not reasonable (some may want 
to argue this, but convincing me will be very difficult).  But looking for 
exact 
normality is a bit of a red herring because, we also have the Central Limit
Theorem that says that if the errors are not normal (but still iid) then the 
distribution of the coefficients will approach normality as the sample size
increases.  This is what make statistics doable (because no real dataset 
entered 
into the computer is exactly normal).  The more important question is are the 
residuals normal enough?  for which there is not a definitive test 
(experience 
and plots help).

But this all depends on another assumption that I don't think that you have 
even 
considered.  Yes we can use normal theory even when the random part of the data 
is not normally distributed, but this still assumes that the data is at least 
interval data, i.e. that we firmly believe that the difference between a 
response of 1 and a response of 2 is exactly the same as a difference between a 
6 and a 7 and that the difference from 4 to 6 is exactly twice that of 1 vs. 2. 
 
From your data and other descriptions, I don't think that that is a reasonable 
assumption.  If you are not willing to make that assumption (like me) then 
means 
and normal theory tests are meaningless and you should use other approaches.  
One possibility is to use non-parametric methods (which I believe Frank has
already suggested you use), another is to use proportional odds logistic 
regression.



--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org

Re: [R] Assumptions for ANOVA: the right way to check the normality

2011-01-05 Thread Robert Baer
Someone suggested me that I don´t have to check the normality of the 
data, but
the normality of the residuals I get after the fitting of the  linear 
model.
I really ask you to help me to understand this point as I don´t find 
enough

material online where to solve it.


Try the following:
# using your scrd data and your proposed models
fit1- lm(response ~ stimulus + condition + stimulus:condition, data=scrd)
fit2- lm(response ~ stimulus + condition, data=scrd)
fit3- lm(response ~ condition, data=scrd)

# Set up for 6 plots on 1 panel
op = par(mfrow=c(2,3))

# residuals function extracts residuals
# Visual inspection is a good start for checking normality
# You get a much better feel than from some magic number statistic
hist(residuals(fit1))
hist(residuals(fit2))
hist(residuals(fit3))

# especially qqnorm() plots which are linear for normal data
qqnorm(residuals(fit1))
qqnorm(residuals(fit2))
qqnorm(residuals(fit3))

# Restore plot parameters
par(op)



If the data are not normally distributed I have to use the kruskal wallys 
test

and not the ANOVA...so please help
me to understand.


Indeed - Kruskal-Wallis is a good test to use for one factor data that is 
ordinal so it is a good alternative to your fit3.
Your response seems to be a discrete variable rather than a continuous 
variable.
You must decide if it is reasonable to approximate it with a normal 
distribution which is by definition continuous.




I make a numerical example, could you please tell me if the data in this 
table

are normally distributed or not?

Help!


number  stimulus condition response
1 flat_550_W_realism A3
2 flat_550_W_realism A3
3 flat_550_W_realism A5
4 flat_550_W_realism A3
5 flat_550_W_realism A3
6 flat_550_W_realism A3
7 flat_550_W_realism A3
8 flat_550_W_realism A5
9 flat_550_W_realism A3
10flat_550_W_realism A3
11flat_550_W_realism A5
12flat_550_W_realism A7
13flat_550_W_realism A5
14flat_550_W_realism A2
15flat_550_W_realism A3
16flat_550_W_realismAH7
17flat_550_W_realismAH4
18flat_550_W_realismAH5
19flat_550_W_realismAH3
20flat_550_W_realismAH6
21flat_550_W_realismAH5
22flat_550_W_realismAH3
23flat_550_W_realismAH5
24flat_550_W_realismAH5
25flat_550_W_realismAH7
26flat_550_W_realismAH2
27flat_550_W_realismAH7
28flat_550_W_realismAH5
29flat_550_W_realismAH5
30 bump_2_step_W_realism A1
31 bump_2_step_W_realism A3
32 bump_2_step_W_realism A5
33 bump_2_step_W_realism A1
34 bump_2_step_W_realism A3
35 bump_2_step_W_realism A2
36 bump_2_step_W_realism A5
37 bump_2_step_W_realism A4
38 bump_2_step_W_realism A4
39 bump_2_step_W_realism A4
40 bump_2_step_W_realism A4
41 bump_2_step_W_realismAH3
42 bump_2_step_W_realismAH5
43 bump_2_step_W_realismAH1
44 bump_2_step_W_realismAH5
45 bump_2_step_W_realismAH4
46 bump_2_step_W_realismAH4
47 bump_2_step_W_realismAH5
48 bump_2_step_W_realismAH4
49 bump_2_step_W_realismAH3
50 bump_2_step_W_realismAH4
51 bump_2_step_W_realismAH5
52 bump_2_step_W_realismAH4
53 hole_2_step_W_realism A3
54 hole_2_step_W_realism A3
55 hole_2_step_W_realism A4
56 hole_2_step_W_realism A1
57 hole_2_step_W_realism A4
58 hole_2_step_W_realism A3
59 hole_2_step_W_realism A5
60 hole_2_step_W_realism A4
61 hole_2_step_W_realism A3
62 hole_2_step_W_realism A4
63 hole_2_step_W_realism A7
64 hole_2_step_W_realism A5
65 hole_2_step_W_realism A1
66