Re: [R] Basis of fisher.test

2006-01-13 Thread Ted Harding
On 13-Jan-06 Prof Brian Ripley wrote:
 On Thu, 12 Jan 2006 [EMAIL PROTECTED] wrote:
[...]
 ?fisher.test says only:
 
 [That following is not a quote from a current version of R.]
 
 In the one-sided 2 by 2 cases, p-values are obtained
 directly using the hypergeometric distribution.
 Otherwise, computations are based on a C version of
 the FORTRAN subroutine FEXACT which implements the
 network developed by Mehta and Patel (1986) and
 improved by Clarkson, Fan  Joe (1993). The FORTRAN
 code can be obtained from
 URL: http://www.netlib.org/toms/643.
 
 No, it *also* says
 
   Two-sided tests are based on the probabilities of the tables, and
   take as 'more extreme' all tables with probabilities less than or
   equal to that of the observed table, the p-value being the sum of
   such probabilities.
 
 which answers the question (there are only two-sided tests for such 
 tables).

Thanks for the above information, which is indeed the definitive
straightforward answer to my question!

(Not sure that I quite agree with the two-sided terminology, though,
since the ranking is unidirectional based on decreasing probability,
and the P-value is that of the least-probability tail -- i.e. analagous
to the large (-2*loglik) tail of a likelihood-ratio test -- which
I've always visualised as a 1-tailed test (depite the fact that
the other tail can on occasion be indicative of a fit too good to
be true).

 Now, what does the posting guide say about stating the R version and 
 updating before posting?

Well, I plead that in practice there is necessarily a grey area
here! My quotation was from ?fisher.test in R-2.1.0beta of
2004/04/08, the most recent version installed on any of my machines.
Admittedly a bit behind the times, but not grossly; and that help
page has not changed in this respect since the earliest version I
have installed, which is R-1.2.3 of 2001/04/26.

Contents of help pages can change overnight as R evolves.
While it is better to be up-to-date than behind the times (even
slightly), there is a compromise to be struck between upgrading
to the latest R every time one has a question which might be
answered thereby, or going on-line to read the latest PDF
documentation from CRAN, on the one hand, and on the other asking
a straightforward question to the list.

Thanks again, and best wishes,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jan-06   Time: 08:55:11
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Basis of fisher.test

2006-01-13 Thread Prof Brian Ripley
On Fri, 13 Jan 2006 [EMAIL PROTECTED] wrote:

 On 13-Jan-06 Prof Brian Ripley wrote:
 On Thu, 12 Jan 2006 [EMAIL PROTECTED] wrote:
 [...]
 ?fisher.test says only:

 [That following is not a quote from a current version of R.]

 In the one-sided 2 by 2 cases, p-values are obtained
 directly using the hypergeometric distribution.
 Otherwise, computations are based on a C version of
 the FORTRAN subroutine FEXACT which implements the
 network developed by Mehta and Patel (1986) and
 improved by Clarkson, Fan  Joe (1993). The FORTRAN
 code can be obtained from
 URL: http://www.netlib.org/toms/643.

 No, it *also* says

   Two-sided tests are based on the probabilities of the tables, and
   take as 'more extreme' all tables with probabilities less than or
   equal to that of the observed table, the p-value being the sum of
   such probabilities.

 which answers the question (there are only two-sided tests for such
 tables).

 Thanks for the above information, which is indeed the definitive
 straightforward answer to my question!

 (Not sure that I quite agree with the two-sided terminology, though,
 since the ranking is unidirectional based on decreasing probability,
 and the P-value is that of the least-probability tail -- i.e. analagous
 to the large (-2*loglik) tail of a likelihood-ratio test -- which
 I've always visualised as a 1-tailed test (depite the fact that
 the other tail can on occasion be indicative of a fit too good to
 be true).

As statistics is usually taught, significance tests are always one-tailed. 
The two-sided t-test is one-tailed, the test statistic being |T|.

In any case, the `two-sided' is part of the arguments given to the 
function, so this para is just using the already-established terminology.

 Now, what does the posting guide say about stating the R version and
 updating before posting?

 Well, I plead that in practice there is necessarily a grey area
 here! My quotation was from ?fisher.test in R-2.1.0beta of
 2004/04/08, the most recent version installed on any of my machines.
 Admittedly a bit behind the times, but not grossly; and that help
 page has not changed in this respect since the earliest version I
 have installed, which is R-1.2.3 of 2001/04/26.

 Contents of help pages can change overnight as R evolves.
 While it is better to be up-to-date than behind the times (even
 slightly), there is a compromise to be struck between upgrading
 to the latest R every time one has a question which might be
 answered thereby, or going on-line to read the latest PDF
 documentation from CRAN, on the one hand, and on the other asking
 a straightforward question to the list.

Well, if you had given the R version number the problem would have been 
much more obvious.

 Thanks again, and best wishes,
 Ted.

 
 E-Mail: (Ted Harding) [EMAIL PROTECTED]
 Fax-to-email: +44 (0)870 094 0861
 Date: 13-Jan-06   Time: 08:55:11
 -- XFMail --

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Basis of fisher.test

2006-01-12 Thread Ted Harding
I want to ascertain the basis of the table ranking,
i.e. the meaning of extreme, in Fisher's Exact Test
as implemented in 'fisher.test', when applied to RxC
tables which are larger than 2x2.

One can summarise a strategy for the test as

1) For each table compatible with the margins
   of the observed table, compute the probability
   of this table conditional on the marginal totals.

2) Rank the possible tables in order of a measure
   of discrepancy between the table and the null
   hypothesis of no association.

3) Locate the observed table, and compute the sum
   of the probabilties, computed in (1), for this
   table and more extreme tables in the sense of
   the ranking in (2).

The question is: what measure of discrepancy is
used in 'fisher.test' corresponding to stage (2)?

(There are in principle several possibilities, e.g.
value of a Pearson chi-squared, large values being
discrepant; the probability calculated in (2),
small values being discrepant; ... )

?fisher.test says only:

 In the one-sided 2 by 2 cases, p-values are obtained
 directly using the hypergeometric distribution.
 Otherwise, computations are based on a C version of
 the FORTRAN subroutine FEXACT which implements the
 network developed by Mehta and Patel (1986) and
 improved by Clarkson, Fan  Joe (1993). The FORTRAN
 code can be obtained from
 URL: http://www.netlib.org/toms/643.

I have had a look at this FORTRAN code, and cannot ascertain
it from the code itself. However, there is a Comment to the
effect:

c PRE- Table p-value.  (Output)
c  PRE is the probability of a more extreme table, where
c  'extreme' is in a probabilistic sense.

which suggests that the tables are ranked in order of their
probabilities as computed in (2).

Can anyone confirm definitively what goes on?

With thanks,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 12-Jan-06   Time: 20:19:02
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Basis of fisher.test

2006-01-12 Thread Peter Dalgaard
(Ted Harding) [EMAIL PROTECTED] writes:

 I want to ascertain the basis of the table ranking,
 i.e. the meaning of extreme, in Fisher's Exact Test
 as implemented in 'fisher.test', when applied to RxC
 tables which are larger than 2x2.
 
 One can summarise a strategy for the test as
 
 1) For each table compatible with the margins
of the observed table, compute the probability
of this table conditional on the marginal totals.
 
 2) Rank the possible tables in order of a measure
of discrepancy between the table and the null
hypothesis of no association.
 
 3) Locate the observed table, and compute the sum
of the probabilties, computed in (1), for this
table and more extreme tables in the sense of
the ranking in (2).
 
 The question is: what measure of discrepancy is
 used in 'fisher.test' corresponding to stage (2)?
 
 (There are in principle several possibilities, e.g.
 value of a Pearson chi-squared, large values being
 discrepant; the probability calculated in (2),
 small values being discrepant; ... )
 
 ?fisher.test says only:
 
  In the one-sided 2 by 2 cases, p-values are obtained
  directly using the hypergeometric distribution.
  Otherwise, computations are based on a C version of
  the FORTRAN subroutine FEXACT which implements the
  network developed by Mehta and Patel (1986) and
  improved by Clarkson, Fan  Joe (1993). The FORTRAN
  code can be obtained from
  URL: http://www.netlib.org/toms/643.
 
 I have had a look at this FORTRAN code, and cannot ascertain
 it from the code itself. However, there is a Comment to the
 effect:
 
 c PRE- Table p-value.  (Output)
 c  PRE is the probability of a more extreme table, where
 c  'extreme' is in a probabilistic sense.
 
 which suggests that the tables are ranked in order of their
 probabilities as computed in (2).
 
 Can anyone confirm definitively what goes on?

To my knowledge, it is the table probability, according to the
hypergeometric distribution, i.e. the probability of the table given
the marginals, which can be translated to sampling a+b balls without
replacement from a box with a+c white and b+d black balls. 

Playing around with dhyper should be instructive.

(You're right that the two-sided p values are obtained by summing
all smaller or equal table probabilities. This is the traditional way,
but there are alternatives, e.g. tail balancing.)

-- 
   O__   Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Basis of fisher.test

2006-01-12 Thread Prof Brian Ripley
On Thu, 12 Jan 2006 [EMAIL PROTECTED] wrote:

 I want to ascertain the basis of the table ranking,
 i.e. the meaning of extreme, in Fisher's Exact Test
 as implemented in 'fisher.test', when applied to RxC
 tables which are larger than 2x2.

 One can summarise a strategy for the test as

 1) For each table compatible with the margins
   of the observed table, compute the probability
   of this table conditional on the marginal totals.

 2) Rank the possible tables in order of a measure
   of discrepancy between the table and the null
   hypothesis of no association.

 3) Locate the observed table, and compute the sum
   of the probabilties, computed in (1), for this
   table and more extreme tables in the sense of
   the ranking in (2).

 The question is: what measure of discrepancy is
 used in 'fisher.test' corresponding to stage (2)?

 (There are in principle several possibilities, e.g.
 value of a Pearson chi-squared, large values being
 discrepant; the probability calculated in (2),
 small values being discrepant; ... )

 ?fisher.test says only:

[That following is not a quote from a current version of R.]

 In the one-sided 2 by 2 cases, p-values are obtained
 directly using the hypergeometric distribution.
 Otherwise, computations are based on a C version of
 the FORTRAN subroutine FEXACT which implements the
 network developed by Mehta and Patel (1986) and
 improved by Clarkson, Fan  Joe (1993). The FORTRAN
 code can be obtained from
 URL: http://www.netlib.org/toms/643.

No, it *also* says

  Two-sided tests are based on the probabilities of the tables, and
  take as 'more extreme' all tables with probabilities less than or
  equal to that of the observed table, the p-value being the sum of
  such probabilities.

which answers the question (there are only two-sided tests for such 
tables).

Now, what does the posting guide say about stating the R version and 
updating before posting?

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html