Re: [R] correlation between categorical data

2015-01-28 Thread Heinz Tuechler

comment inline

David Winsemius wrote on 24.01.2015 21:08:


On Jan 23, 2015, at 5:54 PM, JohnDee wrote:


Heinz Tuechler wrote

At 07:40 21.06.2009, J Dougherty wrote:

[...]

There are other ways of regarding the FET.  Since it is precisely
what it says
- an exact test - you can argue that you should avoid carrying over any
conclusions drawn about the small population the test was applied to and
employing them in a broader context.  In so far as the test is concerned,

the

sample data and the contingency table it is arrayed in are the entire
universe.  In that sense, the FET can't be conservative or liberal.

It

isn't actually a hypothesis test and should not be thought of as one or

used

in the place of one.



JDougherty


Could you give some reference, supporting this, for me, surprising
view? I don't see a necessary connection between an exact test and
the idea that it does not test a hypothesis.

Thanks,
Heinz







Fisher's Exact Test is a nonparametric test.  It tests the distribution in
the contingency table against the total possible arrangements and gives you
the precise likelihood of that many items being arranged in that manner.


That's not the way I understand the construction of the result. The statistic 
gives rather the ratio of the number of permutations as extreme or more extreme 
(as measured by the odds ratio) while holding the marginals constant which is 
then divided by the total number of possible permutations of the data.



  No
more and no less.  You could argue about the greater population from which
your sample is drawn, but FET makes no assumptions at all about any greater
sample universe.


It is conditional on the margins, so that is the description of the universe.


  Also, since the population being used in FET is strictly
limited to the members of the contingency table, the results are a subset of
a finite group of possible results that are relevant to that specific
arrangement of data.  You are not estimating parameters of a parent
population or making any assumptions about the parent distribution.  You can
designate a p value such as 0.05 as a level of significance, but there is
no error term in the FET result.  Fisher stated that the test DOES assume
a null hypothesis of independence to a hypergeometric distribution of the
cell members.  But that creates other issues if you are attempting to use
the results in conjunction with assumptions about a broader sample universe
than that in the test.  For instance you have to carry the assumption of a
hypergeometric distribution over in to the land of reality your sample is
drawn from and you then have to justify that.
In this respect I agree. A real world situation with a universe of fixed 
margins seems unusual to me.




And this is off-topic on Rhelp .
Sorry for asking a question off-topic more than five years ago. A nice 
surprise to get an answer.

Thanks,
Heinz




--
View this message in context: 
http://r.789695.n4.nabble.com/correlation-between-categorical-data-tp888975p4702235.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] correlation between categorical data

2015-01-24 Thread David Winsemius

On Jan 23, 2015, at 5:54 PM, JohnDee wrote:

 Heinz Tuechler wrote
 At 07:40 21.06.2009, J Dougherty wrote:
 
 [...]
 There are other ways of regarding the FET.  Since it is precisely 
 what it says
 - an exact test - you can argue that you should avoid carrying over any
 conclusions drawn about the small population the test was applied to and
 employing them in a broader context.  In so far as the test is concerned,
 the
 sample data and the contingency table it is arrayed in are the entire
 universe.  In that sense, the FET can't be conservative or liberal. 
 It
 isn't actually a hypothesis test and should not be thought of as one or
 used
 in the place of one.
 
 JDougherty
 
 Could you give some reference, supporting this, for me, surprising 
 view? I don't see a necessary connection between an exact test and 
 the idea that it does not test a hypothesis.
 
 Thanks,
 Heinz
 
 


 Fisher's Exact Test is a nonparametric test.  It tests the distribution in
 the contingency table against the total possible arrangements and gives you
 the precise likelihood of that many items being arranged in that manner.

That's not the way I understand the construction of the result. The statistic 
gives rather the ratio of the number of permutations as extreme or more extreme 
(as measured by the odds ratio) while holding the marginals constant which is 
then divided by the total number of possible permutations of the data.


  No
 more and no less.  You could argue about the greater population from which
 your sample is drawn, but FET makes no assumptions at all about any greater
 sample universe.

It is conditional on the margins, so that is the description of the universe.

  Also, since the population being used in FET is strictly
 limited to the members of the contingency table, the results are a subset of
 a finite group of possible results that are relevant to that specific
 arrangement of data.  You are not estimating parameters of a parent
 population or making any assumptions about the parent distribution.  You can
 designate a p value such as 0.05 as a level of significance, but there is
 no error term in the FET result.  Fisher stated that the test DOES assume
 a null hypothesis of independence to a hypergeometric distribution of the
 cell members.  But that creates other issues if you are attempting to use
 the results in conjunction with assumptions about a broader sample universe
 than that in the test.  For instance you have to carry the assumption of a
 hypergeometric distribution over in to the land of reality your sample is
 drawn from and you then have to justify that.  
 

And this is off-topic on Rhelp .
 
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/correlation-between-categorical-data-tp888975p4702235.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] correlation between categorical data

2015-01-23 Thread JohnDee
Heinz Tuechler wrote
 At 07:40 21.06.2009, J Dougherty wrote:
 
 [...]
There are other ways of regarding the FET.  Since it is precisely 
what it says
- an exact test - you can argue that you should avoid carrying over any
conclusions drawn about the small population the test was applied to and
employing them in a broader context.  In so far as the test is concerned,
the
sample data and the contingency table it is arrayed in are the entire
universe.  In that sense, the FET can't be conservative or liberal. 
It
isn't actually a hypothesis test and should not be thought of as one or
used
in the place of one.
 
JDougherty
 
 Could you give some reference, supporting this, for me, surprising 
 view? I don't see a necessary connection between an exact test and 
 the idea that it does not test a hypothesis.
 
 Thanks,
 Heinz
 
 __

 R-help@

  mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Fisher's Exact Test is a nonparametric test.  It tests the distribution in
the contingency table against the total possible arrangements and gives you
the precise likelihood of that many items being arranged in that manner.  No
more and no less.  You could argue about the greater population from which
your sample is drawn, but FET makes no assumptions at all about any greater
sample universe.  Also, since the population being used in FET is strictly
limited to the members of the contingency table, the results are a subset of
a finite group of possible results that are relevant to that specific
arrangement of data.  You are not estimating parameters of a parent
population or making any assumptions about the parent distribution.  You can
designate a p value such as 0.05 as a level of significance, but there is
no error term in the FET result.  Fisher stated that the test DOES assume
a null hypothesis of independence to a hypergeometric distribution of the
cell members.  But that creates other issues if you are attempting to use
the results in conjunction with assumptions about a broader sample universe
than that in the test.  For instance you have to carry the assumption of a
hypergeometric distribution over in to the land of reality your sample is
drawn from and you then have to justify that.  



--
View this message in context: 
http://r.789695.n4.nabble.com/correlation-between-categorical-data-tp888975p4702235.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] correlation between categorical data

2009-06-21 Thread Heinz Tuechler

At 07:40 21.06.2009, J Dougherty wrote:

[...]
There are other ways of regarding the FET.  Since it is precisely 
what it says

- an exact test - you can argue that you should avoid carrying over any
conclusions drawn about the small population the test was applied to and
employing them in a broader context.  In so far as the test is concerned, the
sample data and the contingency table it is arrayed in are the entire
universe.  In that sense, the FET can't be conservative or liberal.  It
isn't actually a hypothesis test and should not be thought of as one or used
in the place of one.

JDougherty


Could you give some reference, supporting this, for me, surprising 
view? I don't see a necessary connection between an exact test and 
the idea that it does not test a hypothesis.


Thanks,
Heinz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] correlation between categorical data

2009-06-21 Thread Daniel Malter
 For measures of association between two variables with two values each,
Cramer's V and Yule's Q are useful statistics. Look into this thread, for
example: http://markmail.org/message/sjd53z2dv2pb5nd6

To get a grasp from plotting (sometimes), you may use the jitter function in
the plot...

e=rnorm(n,0,1)
y=x+e
xprob=exp(x)/(1+exp(x))
yprob=exp(y)/(1+exp(y))
xcat=rbinom(n,1,xprob)
ycat=rbinom(n,1,yprob)
plot(ycat~xcat) #totally useless
plot(jitter(ycat)~jitter(xcat)) #can be somewhat useful
table(ycat,xcat) # interesting

#A measure of correlation between nominal variables
yule.Q=function(x,y){(table(x,y)[1,1]*table(x,y)[2,2]-table(x,y)[1,2]*table(
x,y)[2,1])/(table(x,y)[1,1]*table(x,y)[2,2]+table(x,y)[1,2]*table(x,y)[2,1])
}
yule.Q(ycat,xcat)

Best,
Daniel




-
cuncta stricte discussurus
-

-Ursprüngliche Nachricht-
Von: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] Im
Auftrag von Marc Schwartz
Gesendet: Saturday, June 20, 2009 7:37 PM
An: Jason Morgan
Cc: r-help
Betreff: Re: [R] correlation between categorical data


On Jun 20, 2009, at 2:05 PM, Jason Morgan wrote:

 On 2009.06.19 14:04:59, Michael wrote:
 Hi all,

 In a data-frame, I have two columns of data that are categorical.

 How do I form some sort of measure of correlation between these two 
 columns?

 For numerical data, I just need to regress one to the other, or do 
 some pairs plot.

 But for categorical data, how do I find and/or visualize correlation 
 between the two columns of data?

 As Dylan mentioned, using crosstabs may be the easiest way. Also, a 
 simple correlation between the two variables may be informative. If 
 each variable is ordinal, you can use Kendall's tau-b (square table) 
 or tau-c (rectangular table). The former you can calculate with ?cor 
 (set method=kendall), the latter you may have to hack something 
 together yourself, there is code on the Internet to do this. If the 
 data are nominal, then a simple chi-squared test (large-n) or Fisher's 
 exact test (small-n) may be more appropriate. There are rules about 
 which to use when one variable is ordinal and one is nominal, but I 
 don't have my notes in front of me. Maybe someone else can provide 
 more assistance (and correct me if I'm wrong :).



I would be cautious in recommending the Fisher Exact Test based upon small
samples sizes, as the FET has been shown to be overly conservative. This
also applies to the use of the continuity correction for the chi-square test
(which replicates the behavior of the FET).

For more information see:
Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample
recommendations Ian Campbell Stat in Med 26:3661-3675; 2007
http://www3.interscience.wiley.com/journal/114125487/abstract
and:
How conservative is Fisher's exact test?
A quantitative evaluation of the two-sample comparative binomial trial
Gerald G. Crans, Jonathan J. Shuster Stat Med. 2008 Aug 15;27(18):3598-611.
http://www3.interscience.wiley.com/journal/117929459/abstract


Frank also has some comments here (bottom of the page):

http://biostat.mc.vanderbilt.edu/wiki/Main/DataAnalysisDisc#Some_Important_P
oints_about_Cont


More generally, Agresti's Categorical Data Analysis is typically the first
reference in this domain to reach for. There is also a document written by
Laura Thompson which provides for a nice R companion to Agresti. It is
available from:

https://home.comcast.net/~lthompson221/Splusdiscrete2.pdf


HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] correlation between categorical data

2009-06-20 Thread Marc Schwartz


On Jun 20, 2009, at 2:05 PM, Jason Morgan wrote:


On 2009.06.19 14:04:59, Michael wrote:

Hi all,

In a data-frame, I have two columns of data that are categorical.

How do I form some sort of measure of correlation between these two  
columns?


For numerical data, I just need to regress one to the other, or do
some pairs plot.

But for categorical data, how do I find and/or visualize correlation
between the two columns of data?


As Dylan mentioned, using crosstabs may be the easiest way. Also, a
simple correlation between the two variables may be informative. If
each variable is ordinal, you can use Kendall's tau-b (square table)
or tau-c (rectangular table). The former you can calculate with ?cor
(set method=kendall), the latter you may have to hack something
together yourself, there is code on the Internet to do this. If the
data are nominal, then a simple chi-squared test (large-n) or Fisher's
exact test (small-n) may be more appropriate. There are rules about
which to use when one variable is ordinal and one is nominal, but I
don't have my notes in front of me. Maybe someone else can provide
more assistance (and correct me if I'm wrong :).




I would be cautious in recommending the Fisher Exact Test based upon  
small samples sizes, as the FET has been shown to be overly  
conservative. This also applies to the use of the continuity  
correction for the chi-square test (which replicates the behavior of  
the FET).


For more information see:
Chi-squared and Fisher-Irwin tests of two-by-two tables with small  
sample recommendations

Ian Campbell
Stat in Med 26:3661-3675; 2007
http://www3.interscience.wiley.com/journal/114125487/abstract
and:
How conservative is Fisher's exact test?
A quantitative evaluation of the two-sample comparative binomial trial
Gerald G. Crans, Jonathan J. Shuster
Stat Med. 2008 Aug 15;27(18):3598-611.
http://www3.interscience.wiley.com/journal/117929459/abstract


Frank also has some comments here (bottom of the page):

http://biostat.mc.vanderbilt.edu/wiki/Main/DataAnalysisDisc#Some_Important_Points_about_Cont


More generally, Agresti's Categorical Data Analysis is typically the  
first reference in this domain to reach for. There is also a document  
written by Laura Thompson which provides for a nice R companion to  
Agresti. It is available from:


https://home.comcast.net/~lthompson221/Splusdiscrete2.pdf


HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] correlation between categorical data

2009-06-20 Thread J Dougherty
On Saturday 20 June 2009 04:36:55 pm Marc Schwartz wrote:
 On Jun 20, 2009, at 2:05 PM, Jason Morgan wrote:
  On 2009.06.19 14:04:59, Michael wrote:
  Hi all,
 
  In a data-frame, I have two columns of data that are categorical.
 
  How do I form some sort of measure of correlation between these two
  columns?
 
  For numerical data, I just need to regress one to the other, or do
  some pairs plot.
 
  But for categorical data, how do I find and/or visualize correlation
  between the two columns of data?
 
  As Dylan mentioned, using crosstabs may be the easiest way. Also, a
  simple correlation between the two variables may be informative. If
  each variable is ordinal, you can use Kendall's tau-b (square table)
  or tau-c (rectangular table). The former you can calculate with ?cor
  (set method=kendall), the latter you may have to hack something
  together yourself, there is code on the Internet to do this. If the
  data are nominal, then a simple chi-squared test (large-n) or Fisher's
  exact test (small-n) may be more appropriate. There are rules about
  which to use when one variable is ordinal and one is nominal, but I
  don't have my notes in front of me. Maybe someone else can provide
  more assistance (and correct me if I'm wrong :).

 I would be cautious in recommending the Fisher Exact Test based upon
 small samples sizes, as the FET has been shown to be overly
 conservative. 
 
 . . .
There are other ways of regarding the FET.  Since it is precisely what it says 
- an exact test - you can argue that you should avoid carrying over any 
conclusions drawn about the small population the test was applied to and 
employing them in a broader context.  In so far as the test is concerned, the 
sample data and the contingency table it is arrayed in are the entire 
universe.  In that sense, the FET can't be conservative or liberal.  It 
isn't actually a hypothesis test and should not be thought of as one or used 
in the place of one.  

JDougherty


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] correlation between categorical data

2009-06-19 Thread Michael
Hi all,

In a data-frame, I have two columns of data that are categorical.

How do I form some sort of measure of correlation between these two columns?

For numerical data, I just need to regress one to the other, or do
some pairs plot.

But for categorical data, how do I find and/or visualize correlation
between the two columns of data?

Thanks!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] correlation between categorical data

2009-06-19 Thread Dylan Beaudette
Not an expert, but I would try some of the following:

# tabulate joint frequencies
?table
?xtabs

# plotting
mosaicplot(Titanic, main = Survival on the Titanic, color = TRUE, shade=TRUE)

# log-linear models

check the library for more ideas.

Cheers,
Dylan

On Fri, Jun 19, 2009 at 2:04 PM, Michaelcomtech@gmail.com wrote:
 Hi all,

 In a data-frame, I have two columns of data that are categorical.

 How do I form some sort of measure of correlation between these two columns?

 For numerical data, I just need to regress one to the other, or do
 some pairs plot.

 But for categorical data, how do I find and/or visualize correlation
 between the two columns of data?

 Thanks!

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.