Thanks for these questions - all of the details are quite helpful. And
yes, I think your method for computing n12 and n22 are just fine.
As a historical note, it's worth pointing out the Fishing for
Exactness paper pre-dates Text-NSP by a number of years. This paper
was published 1996, and Text-NSP began in about 2002 and was actively
developed for several years thereafter. That said, when implementing
Text-NSP we were certainly basing it off of this earlier work and so
I'd hope the results from Text-NSP would be consistent with the paper.
To that end I ran the example you gave on Text-NSP and show the
results below. What you see is consistent with what you ran in python,
and so it seems pretty clear that the results from the paper are
indeed the two tailed test (contrary to what the paper says).
cat x.cnt
1382828
and<>industry<>22 30707 952
statistic.pl leftFisher x.left x.cnt
cat x.left
1382828
and<>industry<>1 0.6297 22 30707 952
statistic.pl rightFisher x.right x.cnt
cat x.right
1382828
and<>industry<>1 0.4546 22 30707 952
statistic.pl twotailed x.two x.cnt
cat x.two
1382828
and<>industry<>1 0.8253 22 30707 952
As to your more general question of what should be done, I will need
to refresh my recollection of this, although in general the
interpretation of left, right and two sided tests depend on your null
hypothesis. In our case, and for finding "dependent" bigrams in
general, the null hypothesis is that the two words are independent,
and so we are seeking evidence to either confirm or deny that
hypothesis. The left sided test (for Fisher's exact) is giving us the
p-value of n11 < 22. How to interpret that is where I need to refresh
my recollection, but that is the general direction things are heading.
I think a one sided test makes more sense for identifying dependent
bigrams, since in general if you have more occurrences than you expect
by chance, at some point beyond that expected value you are going to
decide it's not a chance occurrence. There is no value above the
expected value where you are going to say (I don't think) oh no, these
two words are no longer dependent on each other (ie they are occurring
too frequently to be dependent). I think a two tailed test makes the
most sense if there is a point both above and below the expected value
where your null hypothesis is potentially rejected.
In the case of "and industry" where the expected value is 21.14, it
seems very hard to argue that 22 occurrences is enough to say that
they are dependent. But, this is where I'm just a little foggy right
now. I'll look at this a little more and reply a bit more precisely.
I'm not sure about they keyword extraction case, but if you have an
example I'd be happy to think a little further about that as well!
More soon,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederseOn Sun, Nov 25, 2018 at 11:32 AM BLK
Serene wrote:
>
> Thanks for the clarification!
>
> And I have some other question about your paper "Fishing for Exactness"
>
> 1. The paper says that "In the test for association to determine bigram
> dependence Fisher's exact test is interpreted as a left-sided test."
> And in last part "Experiment: Test for Association", it also says that "In
> this experiment, we compare the significance values computed using the
> t-test, the x2 approximation to the distribution of both G2 and X2 and
> Fisher's exact test (left sided)".
> But as for the examples given in "Figure 8: test for association:
> industry":
> E.g. for word "and", the given data is:
> n++ (total number of tokens in the corpus): 1382828 (taken from "Figure
> 3")
> n+1 (total frequency of "industry"): 952 (taken from "Figure 3")
>
> n11 = 22
> n21 = 952 - 22 = 930
>
> Since n12 is not given in the table, I have to compute it by
> m11 = n1+ * n+1 / n++
> so n1+ is 21.14 * 1382828 / 952 = 30706.915882352943 (approximately 30707)
>
> And then:
> n12 = 30707 - 22 = 30685
> n22 = 1382828 - 952 - 30707 + 22 = 1351191
>
> I'm not sure if my calculation is correct, but when using n11 = 22, n12 =
> 30685, n21 = 930, n22 = 1351191 as the input, the left-sided fisher's exact
> test gives the result 0.6296644386744733 which is not matched with 0.8255
> given in the example. I use Python's Scipy module to calculate this:
>
> >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative =
> >>> 'less') # the parameter "alternative" specifies the left-sided test be
> >>> used
> (1.041670459980972, 0.6296644386744733) # The first value is Odds Ratio
> (irrelevant), the second is the p-value given by Fisher's exact test
>
> Then I tried the two-tailed test, which gave the expected value
> (approximately):
>
> >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative =
> >>> 'two-sided') # Two-sided test
> (1.041670459980972, 0.8253462481347)
>
> So I suppose that the results given in the figure is actually calculated
> using the two-sided Fisher's exact test (is it a mistake or