As you may have heard, Yahoo Groups is going away in a few weeks. This is
what we have been using (for more than 15 years now) for the NSP (Ngram
Statistics Package) mailing list (ngram).
https://help.yahoo.com/kb/SLN31010.html
Over the years I've been archiving the ngram mailing list to
My apologies for being a bit slow in following up on this. But, I
think for identifying significant or interesting bigrams with Fisher's
exact test, a left sided test makes the most sense. The left sided
test gives us the probability that the pair of words would occur
together less frequently if
Thanks for these questions - all of the details are quite helpful. And
yes, I think your method for computing n12 and n22 are just fine.
As a historical note, it's worth pointing out the Fishing for
Exactness paper pre-dates Text-NSP by a number of years. This paper
was published 1996, and
Hi Blk,
Thanks for pointing these out. On the Poisson Stirling measure, I
think the reason we haven't included log n is that log n would simply
be a constant (log of the total number of bigrams) and so would not
change the rankings that we get from these scores. That said, if you
were comparing
There is not a way to make huge-count.pl (or count.pl) case insensitive. It
will take the input pretty much "as is" and use that. So, I think you'd
need to lower case your files before they made it to huge-count.pl. You can
use --token to specify how you tokenize words (like do you treat don't as
Hi Catherine,
Here are a few answers to your questions, hopefully.
I don't think we'll be able to update this code anytime soon - we just
don't have anyone available to work on that right now, unfortunately. That
said we are very open to others making contributions, fixes, etc.
The number of
Let me go back and revisit this again, I seem to have confused myself!
More soon,
Ted
On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:
>
>
> Did I misread the documentation then?
>
> "huge-count.pl doesn't consider bigrams at file
Hi Catherine,
Just to make sure I'm understanding what you'd like to do, could you send
the command you are trying to run, and some idea of the number of files
you'd like to process?
Thanks!
Ted
On Sun, Apr 15, 2018 at 6:01 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com>
I guess my first thought would be to see if there is a simple way to
compute the input you are providing to huge count into fewer files. If you
have a lot of files that start with the letter 'a', for example, you could
concatentate them all together via a (Linux) command like
cat a* >
Hi Julio,
Thanks for your question. In NSP we are always counting ngrams, so the
order of the words making up the ngram is considered. When we are counting
bigrams (the default case for NSP) word1 is always the first word in a
bigram, and word2 is always the second word. I think in other
I think this mail was somehow delayed, but I hope this response is still
useful.
NSP has a command line interface. In general you specify the output file
first, and the input file second. So if you want to write the output of
count.pl to a file called myoutput.txt, and if your input text is
Text::NSP has a command line interface that allows you to provide a file or
a folder/directory for input. There are some simple examples shown below
that take a single file as input. That might be a good place to start, just
to make sure everything is working as expected.
The regex in token should look like this :
/\S+/
I think not having the / / is causing the delimeter errors...
On Thu, May 12, 2016 at 2:11 AM, amir.jad...@yahoo.com [ngram] <
ngram@yahoogroups.com> wrote:
>
>
> I'm running count.pl on a set of unicode documents. Create a new
> file('token')
Tokenization and the --token option are described here :
http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod#2._Tokens
On Tue, May 10, 2016 at 8:14 AM, amir.jad...@yahoo.com [ngram] <
ngram@yahoogroups.com> wrote:
>
> [Attachment(s) <#m_-6964475169159201585_TopText> from
>
The Ngram Statistics Package is mostly intended to help you find the most
frequent ngrams in a corpus, or the most strongly associated ngrams in a
corpus. It doesn't necessarily directly give you informativeness, although
you can certainly come up with ways to use frequency and measures of
For many years now, http://search.cpan.org has been my go-to link for
finding CPAN distributions, and has been the URL we've listed on our web
sites directing users to Perl software downloads.
Sadly the site has become very unreliable in the last few months, and there
does not appear to be a
16 matches
Mail list logo