[ngram] yahoo groups going away - ngram - Ngram Statistic Package

2019-10-21 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
As you may have heard, Yahoo Groups is going away in a few weeks. This is what we have been using (for more than 15 years now) for the NSP (Ngram Statistics Package) mailing list (ngram). https://help.yahoo.com/kb/SLN31010.html Over the years I've been archiving the ngram mailing list to

[ngram] Re: Some questions about Text-NSP

2018-12-06 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
My apologies for being a bit slow in following up on this. But, I think for identifying significant or interesting bigrams with Fisher's exact test, a left sided test makes the most sense. The left sided test gives us the probability that the pair of words would occur together less frequently if

[ngram] Re: Some questions about Text-NSP

2018-11-25 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Thanks for these questions - all of the details are quite helpful. And yes, I think your method for computing n12 and n22 are just fine. As a historical note, it's worth pointing out the Fishing for Exactness paper pre-dates Text-NSP by a number of years. This paper was published 1996, and

[ngram] Re: Some questions about Text-NSP

2018-11-25 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Blk, Thanks for pointing these out. On the Poisson Stirling measure, I think the reason we haven't included log n is that log n would simply be a constant (log of the total number of bigrams) and so would not change the rankings that we get from these scores. That said, if you were comparing

Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
There is not a way to make huge-count.pl (or count.pl) case insensitive. It will take the input pretty much "as is" and use that. So, I think you'd need to lower case your files before they made it to huge-count.pl. You can use --token to specify how you tokenize words (like do you treat don't as

Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine, Here are a few answers to your questions, hopefully. I don't think we'll be able to update this code anytime soon - we just don't have anyone available to work on that right now, unfortunately. That said we are very open to others making contributions, fixes, etc. The number of

Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-16 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Let me go back and revisit this again, I seem to have confused myself! More soon, Ted On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] < ngram@yahoogroups.com> wrote: > > > Did I misread the documentation then? > > "huge-count.pl doesn't consider bigrams at file

Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-15 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine, Just to make sure I'm understanding what you'd like to do, could you send the command you are trying to run, and some idea of the number of files you'd like to process? Thanks! Ted On Sun, Apr 15, 2018 at 6:01 PM, catherine.dejage...@gmail.com [ngram] < ngram@yahoogroups.com>

Re: [ngram] Using huge-count.pl with lots of files

2018-04-15 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
I guess my first thought would be to see if there is a simple way to compute the input you are providing to huge count into fewer files. If you have a lot of files that start with the letter 'a', for example, you could concatentate them all together via a (Linux) command like cat a* >

[ngram] Re: PMI Query

2017-05-14 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Julio, Thanks for your question. In NSP we are always counting ngrams, so the order of the words making up the ngram is considered. When we are counting bigrams (the default case for NSP) word1 is always the first word in a bigram, and word2 is always the second word. I think in other

Re: [ngram] Upload files

2017-04-01 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
I think this mail was somehow delayed, but I hope this response is still useful. NSP has a command line interface. In general you specify the output file first, and the input file second. So if you want to write the output of count.pl to a file called myoutput.txt, and if your input text is

Re: [ngram] Upload files

2017-01-31 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Text::NSP has a command line interface that allows you to provide a file or a folder/directory for input. There are some simple examples shown below that take a single file as input. That might be a good place to start, just to make sure everything is working as expected.

Re: [ngram] Ignoring regex with no delimiters

2016-05-12 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
The regex in token should look like this : /\S+/ I think not having the / / is causing the delimeter errors... On Thu, May 12, 2016 at 2:11 AM, amir.jad...@yahoo.com [ngram] < ngram@yahoogroups.com> wrote: > > > I'm running count.pl on a set of unicode documents. Create a new > file('token')

Re: [ngram] count.pl for unicode documents

2016-05-10 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Tokenization and the --token option are described here : http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod#2._Tokens On Tue, May 10, 2016 at 8:14 AM, amir.jad...@yahoo.com [ngram] < ngram@yahoogroups.com> wrote: > > [Attachment(s) <#m_-6964475169159201585_TopText> from >

Re: [ngram] How to recognize informative n-grams in a corpus?

2016-05-10 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
The Ngram Statistics Package is mostly intended to help you find the most frequent ngrams in a corpus, or the most strongly associated ngrams in a corpus. It doesn't necessarily directly give you informativeness, although you can certainly come up with ways to use frequency and measures of

[ngram] the (apparent) demise of search.cpan.org

2014-07-18 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
For many years now, http://search.cpan.org has been my go-to link for finding CPAN distributions, and has been the URL we've listed on our web sites directing users to Perl software downloads. Sadly the site has become very unreliable in the last few months, and there does not appear to be a