[ngram] yahoo groups going away - ngram - Ngram Statistic Package

2019-10-21 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
As you may have heard, Yahoo Groups is going away in a few weeks. This is
what we have been using (for more than 15 years now) for the NSP (Ngram
Statistics Package) mailing list (ngram).

https://help.yahoo.com/kb/SLN31010.html

Over the years I've been archiving the ngram mailing list to mail-archive,
so previous content is available there (going back many years now).

https://www.mail-archive.com/ngram@yahoogroups.com/

The email list is not too active these days, so I am planning to use the
more general DuluthNLP email list as a place to post updates about NSP or
where you can post if you have questions. Folks continue to use NSP so we
will continue to answer questions as they arise. Please feel free to join
up if you would like to stay in touch.

https://groups.google.com/forum/#!forum/duluthnlp

The NSP project page remains at :

http://ngram.sourceforge.net/

Thanks for your interest in NSP over the years, and please do stay in
touch.

Cordially,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Re: Some questions about Text-NSP

2018-12-06 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
My apologies for being a bit slow in following up on this. But, I
think for identifying significant or interesting bigrams with Fisher's
exact test, a left sided test makes the most sense. The left sided
test gives us the probability that the pair of words would occur
together less frequently if we repeated our experiment on another
sample of text. If the left sided probability is high it means our
current observation is a much more frequent than we'd expect (just
based on pure chance) and so the pair of words we have observed are
that likely to be significant or interesting.

I hope this makes some sense, but please feel free to follow up if it
doesn't or if you think I may be misinterpreting something here.

Cordially,
Ted

---
Ted Pedersen
http://www.d.umn.edu/~tpederse

On Sun, Nov 25, 2018 at 6:28 PM Ted Pedersen  wrote:
>
> Thanks for these questions - all of the details are quite helpful. And
> yes, I think your method for computing n12 and n22 are just fine.
>
> As a historical note, it's worth pointing out the Fishing for
> Exactness paper pre-dates Text-NSP by a number of years. This paper
> was published 1996, and Text-NSP began in about 2002 and was actively
> developed for several years thereafter. That said, when implementing
> Text-NSP we were certainly basing it off of this earlier work and so
> I'd hope the results from Text-NSP would be consistent with the paper.
> To that end I ran the example you gave on Text-NSP and show the
> results below. What you see is consistent with what you ran in python,
> and so it seems pretty clear that the results from the paper are
> indeed the two tailed test (contrary to what the paper says).
>
> cat x.cnt
> 1382828
> and<>industry<>22 30707 952
>
> statistic.pl leftFisher x.left x.cnt
>
> cat x.left
> 1382828
> and<>industry<>1 0.6297 22 30707 952
>
> statistic.pl rightFisher x.right x.cnt
>
> cat x.right
> 1382828
> and<>industry<>1 0.4546 22 30707 952
>
> statistic.pl twotailed x.two x.cnt
>
> cat x.two
> 1382828
> and<>industry<>1 0.8253 22 30707 952
>
> As to your more general question of what should be done, I will need
> to refresh my recollection of this, although in general the
> interpretation of left, right and two sided tests depend on your null
> hypothesis. In our case, and for finding "dependent" bigrams in
> general, the null hypothesis is that the two words are independent,
> and so we are seeking evidence to either confirm or deny that
> hypothesis. The left sided test (for Fisher's exact) is giving us the
> p-value of n11 < 22. How to interpret that is where I need to refresh
> my recollection, but that is the general direction things are heading.
>
> I think a one sided test makes more sense for identifying dependent
> bigrams, since in general if you have more occurrences than you expect
> by chance, at some point beyond that expected value you are going to
> decide it's not a chance occurrence. There is no value above the
> expected value where you are going to say (I don't think) oh no, these
> two words are no longer dependent on each other (ie they are occurring
> too frequently to be dependent). I think a two tailed test makes the
> most sense if there is a point both above and below the expected value
> where your null hypothesis is potentially rejected.
>
> In the case of "and industry" where the expected value is 21.14, it
> seems very hard to argue that 22 occurrences is enough to say that
> they are dependent. But, this is where I'm just a little foggy right
> now. I'll look at this a little more and reply a bit more precisely.
>
> I'm not sure about they keyword extraction case, but if you have an
> example I'd be happy to think a little further about that as well!
>
> More soon,
> Ted
> ---
> Ted Pedersen
> http://www.d.umn.edu/~tpederseOn Sun, Nov 25, 2018 at 11:32 AM BLK
> Serene  wrote:
> >
> > Thanks for the clarification!
> >
> > And I have some other question about your paper "Fishing for Exactness"
> >
> > 1. The paper says that "In the test for association to determine bigram 
> > dependence Fisher's exact test is interpreted as a left-sided test."
> > And in last part "Experiment: Test for Association", it also says that "In 
> > this experiment, we compare the significance values computed using the 
> > t-test, the x2 approximation to the distribution of both G2 and X2 and 
> > Fisher's exact test (left sided)".
> > But as for the examples given in "Figure 8: test for association:  
> > industry":
> > E.g. for word "and", the given data is:
> > n++ (total number of tokens in the 

[ngram] Re: Some questions about Text-NSP

2018-11-25 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Thanks for these questions - all of the details are quite helpful. And
yes, I think your method for computing n12 and n22 are just fine.

As a historical note, it's worth pointing out the Fishing for
Exactness paper pre-dates Text-NSP by a number of years. This paper
was published 1996, and Text-NSP began in about 2002 and was actively
developed for several years thereafter. That said, when implementing
Text-NSP we were certainly basing it off of this earlier work and so
I'd hope the results from Text-NSP would be consistent with the paper.
To that end I ran the example you gave on Text-NSP and show the
results below. What you see is consistent with what you ran in python,
and so it seems pretty clear that the results from the paper are
indeed the two tailed test (contrary to what the paper says).

cat x.cnt
1382828
and<>industry<>22 30707 952

statistic.pl leftFisher x.left x.cnt

cat x.left
1382828
and<>industry<>1 0.6297 22 30707 952

statistic.pl rightFisher x.right x.cnt

cat x.right
1382828
and<>industry<>1 0.4546 22 30707 952

statistic.pl twotailed x.two x.cnt

cat x.two
1382828
and<>industry<>1 0.8253 22 30707 952

As to your more general question of what should be done, I will need
to refresh my recollection of this, although in general the
interpretation of left, right and two sided tests depend on your null
hypothesis. In our case, and for finding "dependent" bigrams in
general, the null hypothesis is that the two words are independent,
and so we are seeking evidence to either confirm or deny that
hypothesis. The left sided test (for Fisher's exact) is giving us the
p-value of n11 < 22. How to interpret that is where I need to refresh
my recollection, but that is the general direction things are heading.

I think a one sided test makes more sense for identifying dependent
bigrams, since in general if you have more occurrences than you expect
by chance, at some point beyond that expected value you are going to
decide it's not a chance occurrence. There is no value above the
expected value where you are going to say (I don't think) oh no, these
two words are no longer dependent on each other (ie they are occurring
too frequently to be dependent). I think a two tailed test makes the
most sense if there is a point both above and below the expected value
where your null hypothesis is potentially rejected.

In the case of "and industry" where the expected value is 21.14, it
seems very hard to argue that 22 occurrences is enough to say that
they are dependent. But, this is where I'm just a little foggy right
now. I'll look at this a little more and reply a bit more precisely.

I'm not sure about they keyword extraction case, but if you have an
example I'd be happy to think a little further about that as well!

More soon,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederseOn Sun, Nov 25, 2018 at 11:32 AM BLK
Serene  wrote:
>
> Thanks for the clarification!
>
> And I have some other question about your paper "Fishing for Exactness"
>
> 1. The paper says that "In the test for association to determine bigram 
> dependence Fisher's exact test is interpreted as a left-sided test."
> And in last part "Experiment: Test for Association", it also says that "In 
> this experiment, we compare the significance values computed using the 
> t-test, the x2 approximation to the distribution of both G2 and X2 and 
> Fisher's exact test (left sided)".
> But as for the examples given in "Figure 8: test for association:  
> industry":
> E.g. for word "and", the given data is:
> n++ (total number of tokens in the corpus): 1382828 (taken from "Figure 
> 3")
> n+1 (total frequency of "industry"): 952 (taken from "Figure 3")
>
> n11 = 22
> n21 = 952 - 22 = 930
>
> Since n12 is not given in the table, I have to compute it by
> m11 = n1+ * n+1 / n++
> so n1+ is 21.14 * 1382828 / 952 = 30706.915882352943 (approximately 30707)
>
> And then:
> n12 = 30707 - 22 = 30685
> n22 = 1382828 - 952 - 30707 + 22 = 1351191
>
> I'm not sure if my calculation is correct, but when using n11 = 22, n12 = 
> 30685, n21 = 930, n22 = 1351191 as the input, the left-sided fisher's exact 
> test gives the result 0.6296644386744733 which is not matched with 0.8255 
> given in the example. I use Python's Scipy module to calculate this:
>
> >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative = 
> >>> 'less') # the parameter "alternative" specifies the left-sided test be 
> >>> used
> (1.041670459980972, 0.6296644386744733) # The first value is Odds Ratio 
> (irrelevant), the second is the p-value given by Fisher's exact test
>
> Then I tried the two-tailed test, which gave the expe

[ngram] Re: Some questions about Text-NSP

2018-11-25 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Blk,

Thanks for pointing these out. On the Poisson Stirling measure, I
think the reason we haven't included log n is that log n would simply
be a constant (log of the total number of bigrams) and so would not
change the rankings that we get from these scores. That said, if you
were comparing scores across different sized corpora then the
denominator would likely be important to include.

Thanks for pointing out the typos. Text-NSP is right now in a fairly
dormant state, but I do have a list of small changes to make and will
add yours to these.

Thanks for your interest, and please let us know if you have any other
questions.

Cordially,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederse

On Sun, Nov 25, 2018 at 4:13 AM BLK Serene  wrote:
>
> Hi, I have some questions about the association measures implemented in 
> Text-NSP:
>
> The Poisson-Sterling Measure given in the documentation is:
> Poisson-Stirling = n11 * ( log(n11) - log(m11) - 1)
>
> But in Quasthoff's paper the formulae given by the author is:
> sig(A, B) = (k * (log k - log λ - 1)) / log n
>
> I'm a little confused since I know little about math or statistics. Why is 
> the denominator omitted here?
>
> And some typos in the doc:
> square of phi coefficient:
> PHI^2 = ((n11 * n22) - (n21 * n21))^2/(n1p * np1 * np2 * n2p)
> where n21 *n21 should be n12 * n21
>
> chi-squared test:
> Pearson's Chi-squred test measures the devitation (should be deviation) 
> between
>
> Pearson's Chi-Squared = 2 * [((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
>  ((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2]
> should be: ((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
>((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2
>
> And chi2: same as above.
>
> Thanks in advance.


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
There is not a way to make huge-count.pl (or count.pl) case insensitive. It
will take the input pretty much "as is" and use that. So, I think you'd
need to lower case your files before they made it to huge-count.pl. You can
use --token to specify how you tokenize words (like do you treat don't as
three tokens (don ' t) or one (don't). --stop lets you exclude words from
being counted, but there isn't anything that lets you ignore case.

On Tue, Apr 17, 2018 at 8:51 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:

> Hi Catherine,
>
> Here are a few answers to your questions, hopefully.
>
> I don't think we'll be able to update this code anytime soon - we just
> don't have anyone available to work on that right now, unfortunately. That
> said we are very open to others making contributions, fixes, etc.
>
> The number of files that your system allows is pretty dependent on your
> system and operating system. On Linux you can actually adjust that (if you
> have sudo access) by running
>
> ulimit -s 10
>
> or
>
> ulimit -s unlimited
>
> This increases the number of processes that can be run at any one time,
> which can allow your system to handle more command line arguments (since
> each file name probably causes it's own process to be created...?,
> speculating just a bit there) But if you don't have sudo access this is not
> something you can do.
>
> As far as taking multiple outputs from huge-count.pl and merging them
> with huge-merge, I think the answer is that's almost possible, but not
> quite. huge-merge is not expecting the bigram count that appears on the
> first line of huge-count.pl output to be there, and seems to fail as a
> result. So you would need to remove that first line from your
> huge-count.pl output before merging.
>
> The commands below kind of break down what is happening within
> huge-count.pl. If you run this you can get an idea of the input output
> expected by each stage...
>
> count.pl --tokenlist input1.out input1
>
> count.pl --tokenlist input2.out input2
>
> huge-sort.pl --keep input1.out
>
> huge-sort.pl --keep input2.out
>
> mkdir output-directory
>
> mv input1.out-sorted output-directory
>
> mv input2.out-sorted output-directory
>
> huge-merge.pl --keep output-directory
>
> I hope this helps. I realize it's not exactly a solution, but I hope it's
> helpful all the same. I'll go through your notes again and see if there are
> other issues to address...and of course if you try something and it does or
> doesn't work I'm very interested in hearing about that...
>
> Cordially,
> Ted
>
>
> On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:
>
>> The good news is that our documentation is more reliable than my memory.
>> :) huge-count treats each file separately and so bigrams do not cross file
>> boundaries. Having verified that I'll get back to your original question..
>> Sorry about the diversion and the confusion that might have caused.
>>
>> More soon,
>> Ted
>>
>> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen <tpede...@d.umn.edu> wrote:
>>
>>> Let me go back and revisit this again, I seem to have confused myself!
>>>
>>> More soon,
>>> Ted
>>>
>>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram]
>>> <ngram@yahoogroups.com> wrote:
>>>
>>>>
>>>>
>>>> Did I misread the documentation then?
>>>>
>>>> "huge-count.pl doesn't consider bigrams at file boundaries. In other
>>>> words,
>>>> the result of count.pl and huge-count.pl on the same data file will
>>>> differ if --newLine is not used, in that, huge-count.pl runs count.pl
>>>> on multiple files separately and thus looses the track of the bigrams
>>>> on file boundaries. With --window not specified, there will be loss
>>>> of one bigram at each file boundary while its W bigrams with --window
>>>> W."
>>>>
>>>> I thought that means bigrams won't cross from one file to the next?
>>>>
>>>> If bigrams don't cross from one file to the next, then I just need to
>>>> run huge-count.pl on smaller inputs, then combine. So if I break
>>>> @filenames into smaller subsets, then call huge-count.pl on the
>>>> subsets, then call huge-merge.pl to combine the counts, I think that
>>>> should work.
>>>>
>>>> I have a few more questions related to usage:
>>>>
>>>>- Do you know how many arguments are allowed for huge-count.pl? It
>>>>would be good to know what size chunks I need to split my data into.. 
>>>> Or if
>>>>not, then how would I do a try catch block to catch the error "Argument
>>>>list to long" from the IPC::System::Simple::system call?
>>>>- Is there case-insensitive way to count bigrams, or would I need
>>>>to convert all the text to lowercase before calling huge-count.pl?
>>>>- Would you consider modifying huge-count.pl so that the user can
>>>>specify the final output filename, instead of just automatically calling
>>>>the output file complete-huge-count.output?
>>>>
>>>> Thank you,
>>>> Catherine
>>>>
>>>> 
>>>>
>>>
>>>
>>
>


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine,

Here are a few answers to your questions, hopefully.

I don't think we'll be able to update this code anytime soon - we just
don't have anyone available to work on that right now, unfortunately. That
said we are very open to others making contributions, fixes, etc.

The number of files that your system allows is pretty dependent on your
system and operating system. On Linux you can actually adjust that (if you
have sudo access) by running

ulimit -s 10

or

ulimit -s unlimited

This increases the number of processes that can be run at any one time,
which can allow your system to handle more command line arguments (since
each file name probably causes it's own process to be created...?,
speculating just a bit there) But if you don't have sudo access this is not
something you can do.

As far as taking multiple outputs from huge-count.pl and merging them with
huge-merge, I think the answer is that's almost possible, but not quite.
huge-merge is not expecting the bigram count that appears on the first line
of huge-count.pl output to be there, and seems to fail as a result. So you
would need to remove that first line from your huge-count.pl output before
merging.

The commands below kind of break down what is happening within huge-count.pl.
If you run this you can get an idea of the input output expected by each
stage...

count.pl --tokenlist input1.out input1

count.pl --tokenlist input2.out input2

huge-sort.pl --keep input1.out

huge-sort.pl --keep input2.out

mkdir output-directory

mv input1.out-sorted output-directory

mv input2.out-sorted output-directory

huge-merge.pl --keep output-directory

I hope this helps. I realize it's not exactly a solution, but I hope it's
helpful all the same. I'll go through your notes again and see if there are
other issues to address...and of course if you try something and it does or
doesn't work I'm very interested in hearing about that...

Cordially,
Ted


On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:

> The good news is that our documentation is more reliable than my memory.
> :) huge-count treats each file separately and so bigrams do not cross file
> boundaries. Having verified that I'll get back to your original question.
> Sorry about the diversion and the confusion that might have caused.
>
> More soon,
> Ted
>
> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen <tpede...@d.umn.edu> wrote:
>
>> Let me go back and revisit this again, I seem to have confused myself!
>>
>> More soon,
>> Ted
>>
>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] <
>> ngram@yahoogroups.com> wrote:
>>
>>>
>>>
>>> Did I misread the documentation then?
>>>
>>> "huge-count.pl doesn't consider bigrams at file boundaries. In other
>>> words,
>>> the result of count.pl and huge-count.pl on the same data file will
>>> differ if --newLine is not used, in that, huge-count.pl runs count.pl
>>> on multiple files separately and thus looses the track of the bigrams
>>> on file boundaries. With --window not specified, there will be loss
>>> of one bigram at each file boundary while its W bigrams with --window W.."
>>>
>>> I thought that means bigrams won't cross from one file to the next?
>>>
>>> If bigrams don't cross from one file to the next, then I just need to
>>> run huge-count.pl on smaller inputs, then combine. So if I break
>>> @filenames into smaller subsets, then call huge-count.pl on the
>>> subsets, then call huge-merge.pl to combine the counts, I think that
>>> should work.
>>>
>>> I have a few more questions related to usage:
>>>
>>>- Do you know how many arguments are allowed for huge-count.pl? It
>>>would be good to know what size chunks I need to split my data into. Or 
>>> if
>>>not, then how would I do a try catch block to catch the error "Argument
>>>list to long" from the IPC::System::Simple::system call?
>>>- Is there case-insensitive way to count bigrams, or would I need to
>>>convert all the text to lowercase before calling huge-count.pl?
>>>- Would you consider modifying huge-count.pl so that the user can
>>>specify the final output filename, instead of just automatically calling
>>>the output file complete-huge-count.output?
>>>
>>> Thank you,
>>> Catherine
>>>
>>> 
>>>
>>
>>
>


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-16 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Let me go back and revisit this again, I seem to have confused myself!

More soon,
Ted

On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> Did I misread the documentation then?
>
> "huge-count.pl doesn't consider bigrams at file boundaries. In other
> words,
> the result of count.pl and huge-count.pl on the same data file will
> differ if --newLine is not used, in that, huge-count.pl runs count.pl
> on multiple files separately and thus looses the track of the bigrams
> on file boundaries. With --window not specified, there will be loss
> of one bigram at each file boundary while its W bigrams with --window W."
>
> I thought that means bigrams won't cross from one file to the next?
>
> If bigrams don't cross from one file to the next, then I just need to run
> huge-count.pl on smaller inputs, then combine. So if I break @filenames
> into smaller subsets, then call huge-count.pl on the subsets, then call
> huge-merge.pl to combine the counts, I think that should work.
>
> I have a few more questions related to usage:
>
>- Do you know how many arguments are allowed for huge-count.pl? It
>would be good to know what size chunks I need to split my data into. Or if
>not, then how would I do a try catch block to catch the error "Argument
>list to long" from the IPC::System::Simple::system call?
>- Is there case-insensitive way to count bigrams, or would I need to
>convert all the text to lowercase before calling huge-count.pl?
>- Would you consider modifying huge-count.pl so that the user can
>specify the final output filename, instead of just automatically calling
>the output file complete-huge-count.output?
>
> Thank you,
> Catherine
>
> 
>


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-15 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine,

Just to make sure I'm understanding what you'd like to do, could you send
the command you are trying to run, and some idea of the number of files
you'd like to process?

Thanks!
Ted

On Sun, Apr 15, 2018 at 6:01 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> That makes sense, but I'm not sure it will give me the behavior I want. I
> don't want bigrams to span from one file to the next, but I do want them to
> span across newlines. If I concatenate the files, then as I understand it
> my first condition is no longer met. Could I run huge-count.pl on
> subgroups of files, then combine the results? And how would I do that?
> 
>


Re: [ngram] Using huge-count.pl with lots of files

2018-04-15 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
I guess my first thought would be to see if there is a simple way to
compute the input you are providing to huge count into fewer files. If you
have a lot of files that start with the letter 'a', for example, you could
concatentate them all together via a (Linux) command like

cat a* > myafiles.txt

and then use myafiles.txt as an input to huge_count.

This is just one idea, but it's a start perhaps. If this isn't helpful
please let us know and we can try again!

On Sun, Apr 15, 2018 at 1:19 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> I am trying to get the bigram counts aggregated across a lot of files.
> However, when I ran huge-count.pl using the list of files as an input, I
> got the error "Argument list too long". What would you recommend for
> combining many files, when there are too many files to just run
> huge-count.pl as is?
>
>
> Thank you,
>
> Catherine
>
>
> 
>


[ngram] Re: PMI Query

2017-05-14 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Julio,

Thanks for your question. In NSP we are always counting ngrams, so the
order of the words making up the ngram is considered. When we are counting
bigrams (the default case for NSP)  word1 is always the first word in a
bigram, and word2 is always the second word. I think in other presentations
of PMI word1 and word2 are simply co-occurrences, so the order does not
matter. However, for NSP order does matter and so n1p is the number of
times word1 occurs as the first word in a ngram.

Here's a very simple example where cat occurs as the first word in a bigram
3 times and as the second word in a bigram 1 time. Note that I've used the
--newline option so that ngrams do not extend across lines.

ukko(14): cat test
cat mouse
cat mouse
cat mouse
house cat
ukko(15): count.pl --newline test.cnt test
ukko(16): cat test.cnt
4
cat<>mouse<>3 3 3
house<>cat<>1 1 1

This is described in more detail in the NSP paper (see below), which would
be a reasonable reference I think :I hope this helps, and please let us
know if other questions arise.

Cordially,
Ted

The Design, Implementation, and Use of the Ngram Statistics Package
(Banerjee and
Pedersen) - Appears in the Proceedings of the Fourth International
Conference on Intelligent Text Processing and Computational Linguistics,
pp. 370-381, February 17-21, 2003, Mexico City.



On Sun, May 14, 2017 at 12:10 AM, Julio Santisteban 
wrote:

> Hi Ted & Satanjeev ,
>
> I am Julio from Peru and I have a small query. In your Perl implementation
> of PMI  you mention about the contingency table: "n1p is the number of
> times in total that word1 occurs as the first word in a bigram".  but this
> is not the case, usually PMI is workout  with n1p  as the marginals (total
> frequency of word1) from the contingency table.
>
> I am sure you are correct, I just want to ask you some reference about it.
>
> http://search.cpan.org/~tpederse/Text-NSP-1.31/lib/
> Text/NSP/Measures/2D/MI/pmi.pm
>
>   word2   ~word2
>   word1n11  n12 | n1p
>  ~word1n21  n22 | n2p
>--
>np1  np2   npp
>
>
> Regards,
> Julio Santisteban
>


Re: [ngram] Upload files

2017-04-01 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
I think this mail was somehow delayed, but I hope this response is still
useful.

NSP has a  command line interface. In general you specify the output file
first, and the input file second. So if you want to write the output of
count.pl to a file called myoutput.txt, and if your input text is
myinput.txt, you could submit the following command.

count.pl myoutput.txt myinput.txt

Here's an example 

ted@ted-HP-Z210-CMT-Workstation ~ $ cat myinput.txt
hi this is ted speaking how are you today!
I am well.
Today is April 1.

ted@ted-HP-Z210-CMT-Workstation ~ $ count.pl myoutput.txt myinput.txt

ted@ted-HP-Z210-CMT-Workstation ~ $ cat myoutput.txt
18
you<>today<>1 1 1
well<>.<>1 1 2
how<>are<>1 1 1
today<>!<>1 1 1
am<>well<>1 1 1
is<>ted<>1 2 1
I<>am<>1 1 1
is<>April<>1 2 1
ted<>speaking<>1 1 1
.<>Today<>1 1 1
speaking<>how<>1 1 1
hi<>this<>1 1 1
this<>is<>1 1 2
April<>1<>1 1 1
1<>.<>1 1 2
Today<>is<>1 1 2
are<>you<>1 1 1
!<>I<>1 1 1

I hope this helps!
Ted

On Tue, Jan 31, 2017 at 9:54 AM, rocioc...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> Hello Ted,
>
> Thank you very much for your message, but I still don't know how I can
> take a file as input :( this is being a huge challenge for me, I hope you
> can still give some help with that.
>
> Thanks again, and sorry to disturb you.
> Rocío
> 
>


Re: [ngram] Upload files

2017-01-31 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Text::NSP has a command line interface that allows you to provide a file or
a folder/directory for input. There are some simple examples shown below
that take a single file as input. That might be a good place to start, just
to make sure everything is working as expected.

http://search.cpan.org/dist/Text-NSP/doc/USAGE.pod

Please let us know as questions arise!

Good luck,
Ted

On Tue, Jan 31, 2017 at 7:52 AM, rocioc...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> Dear colleages,
>
> I am new in this. I've just installed Perl and Text-NSP, but I have no
> idea how I can supply files to work with. Should I put them in a specific
> folder?
>
> Thanks in advance,
> Rocío
>
> 
>


Re: [ngram] Ignoring regex with no delimiters

2016-05-12 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
The regex in token should look like this :

/\S+/

I think not having the / / is causing the delimeter errors...

On Thu, May 12, 2016 at 2:11 AM, amir.jad...@yahoo.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> I'm running count.pl on a set of unicode documents. Create a new
> file('token') which contains '\S+' in order to match any characters but
> space.
>
> Here is the output:
>
>
> ⇒  count.pl --ngram=1 --token=token ocount.txt Documents
>
> Ignoring regex with no delimiters: \S+
>
> No token definitions to work with.
>
> Type count.pl --help for help.
>
>
> What's the problem?!
>
> 
>


Re: [ngram] count.pl for unicode documents

2016-05-10 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Tokenization and the --token option are described here :

http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod#2._Tokens

On Tue, May 10, 2016 at 8:14 AM, amir.jad...@yahoo.com [ngram] <
ngram@yahoogroups.com> wrote:

>
> [Attachment(s) <#m_-6964475169159201585_TopText> from
> amir.jad...@yahoo.com included below]
>
> I'm trying to run count.pl for a directory of unicode documents (a sample
> document has been attached) using Perl 5 (v5.18.2). The output is a list
> of digits and punctuations without any unicode word:
>
> 2732
>
> .<>1589
>
> :<>626
>
> 2<>19
>
> !<>17
>
> 10<>16
>
> 4<>14
>
> 13<>13
>
> 12<>13
>
> 20<>12
>
> 9<>11
>
> 15<>11
>
> 3<>10
>
> 5<>10
>
> Is it possible to ask count.pl to tokenize the input file just by space?
>
> There is --token option which maybe useful. But I don't how to use it.
>
> 
>


Re: [ngram] How to recognize informative n-grams in a corpus?

2016-05-10 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
The Ngram Statistics Package is mostly intended to help you find the most
frequent ngrams in a corpus, or the most strongly associated ngrams in a
corpus. It doesn't necessarily directly give you informativeness, although
you can certainly come up with ways to use frequency and measures of
association to find that. It sounds like you should look at our paper on
NSP to get some ideas about how to use it, and what it offers.

http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

Also, the code itself has some documentation that should be helpful...

http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod

http://search.cpan.org/~tpederse/Text-NSP/doc/USAGE.pod

I hope this helps!
Ted

On Tue, May 10, 2016 at 5:22 AM, 'Amir H. Jadidinejad' amir.jad...@yahoo.com
[ngram]  wrote:

>
>
> Hi,
>
> I have a corpus of 3K short text documents. I’m going to *recognize the
> most informative n-grams* in the corpus.
> Unfortunately, I can’t find a straight way from the documents. Would you
> please help me?
>
> Kind regards,
> Amir H. Jadidinejad
>
> 
>


Re: [ngram] simple test using chi-squared

2015-11-23 Thread Ted Pedersen duluth...@gmail.com [ngram]
CHI is a parent class, and not intended to be used as a measure. Rather,
the  measures x2, pmi, and tscore are the end user measures which you can
run (and they all access that CHI class). So, if your goal is to run the
chi squared test, you can do that with the x2 measure, as in:

statistic.pl x2 outputfromstatistic.txt inputfromcount.txt

You can see all the measures you could use here :

https://metacpan.org/release/TPEDERSE/Text-NSP-1.31

I hope this helps, and please let us know if additional questions arise!

Good luck,
Ted

On Mon, Nov 23, 2015 at 7:09 PM, Patrice Seyed apse...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> after executing:
>
> /opt/local/libexec/perl5.16/sitebin/count.pl test-corpus-count.txt
> count.txt test-corpus.txt
>
> the output looks fine, as shown/described in 4.1 of
> http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf.
>
> and then:
>
> ~ /opt/local/libexec/perl5.16/sitebin/statistic.pl CHI output.txt
> test-corpus-count.txt
>
> "Error from statistic library!
>   Error code: 101
>   Error message: Error calculateStatistic() - Mandatory function
> calculateStatistic() not defined.
> Your implementation should override this method. Aborting"
>
> So I'm unsure-- in order to use chisq , do I need to implement it?
>
> Or the full path? :
> /opt/local/libexec/perl5.16/sitebin/statistic.pl
> /opt/local/lib/perl5/site_perl/5.16.1/Text/NSP/Measures/2D/CHI/phi.pm
>  out.txt count.txt
>
> Can't locate Text/NSP/Measures/2D// 
>
>
> Thanks in advance.
>
> Best,
> Patrice
>
>
> 
>


[ngram] Ngram Statistics Package version 1.29 released (minor bug fix release)

2015-10-17 Thread Ted Pedersen duluth...@gmail.com [ngram]
We are pleased to announce a new release of Text::NSP, the Ngram
Statistics Package. This is a very minor bug fix release, but might be
something you want to adopt since it will eliminate some annoying
warning messages that appear as of Perl v 5.15. In that version the
use of defined (@array) has been deprecated, and so we have a few
spots where we were using this and where it now causes warnings. That
has been resolved, so as of version 1.29 of NSP you should not see
these warnings. More about defined (@array) can be found here, if you
are interested :


http://www.perlmonks.org/index.pl?node_id=1077762


You can download the most current version of NSP from CPAN or
Sourceforge by following the links here :


http://ngram.sourceforge.net


Please let us know if any questions arise.


Cordially,
Ted
-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Re: Ngram Statistics Package version 1.29 released (minor bug fix release)

2015-10-04 Thread Ted Pedersen duluth...@gmail.com [ngram]
After releasing 1.29 we noticed some testing problems with rank.pl - these
errors were limited to how we were testing and have been corrected. So, the
most current version of NSP is now 1.31. Please consider upgrading if you
are on a lower version!

You can find 1.31 on CPAN or at sourceforge via links on this page.

http://ngram.sourceforge.net

Enjoy,
Ted

On Sat, Oct 3, 2015 at 5:30 PM, Ted Pedersen <duluth...@gmail.com> wrote:

> We are pleased to announce a new release of Text::NSP, the Ngram
> Statistics Package. This is a very minor bug fix release, but might be
> something you want to adopt since it will eliminate some annoying
> warning messages that appear as of Perl v 5.15. In that version the
> use of defined (@array) has been deprecated, and so we have a few
> spots where we were using this and where it now causes warnings. That
> has been resolved, so as of version 1.29 of NSP you should not see
> these warnings. More about defined (@array) can be found here, if you
> are interested :
>
> http://www.perlmonks.org/index.pl?node_id=1077762
>
> You can download the most current version of NSP from CPAN or
> Sourceforge by following the links here :
>
> http://ngram.sourceforge.net
>
> Please let us know if any questions arise.
>
> Cordially,
> Ted
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


Re: [ngram] accented character

2015-01-19 Thread Ted Pedersen duluth...@gmail.com [ngram]
Hi Arnaud,

There is nothing new for more recent versions - the same solutions proposed
for earlier versions are still relevant (and still the best available
options). You can find some discussion of those here (via the NSP mailing
list):

https://groups.yahoo.com/neo/groups/ngram/conversations/messages/206

And of course we are open to accepting and distributing changes in how NSP
handles encoding - it's just not something we've been able to do here. So,
if anyone is interested in pursuing that please do let me know. It's a
common question and it would be nice to handle things more smoothly.

Cordially,
Ted

On Sun, Jan 18, 2015 at 9:09 AM, Jean-Claude Van Donghen
jcvandong...@yahoo.fr [ngram] ngram@yahoogroups.com wrote:



 Hi everybody,

 Any suggestion to make Text-NSP accept foreign languages (most noticeably
 those with accented characters such as French) ?
 There are answers to this question for previous releases but not for 1.27.
 Thanks in advance.
 Arnaud

  



[ngram] Fwd: Ngrams and Text Similarity deployed as SOAP web services

2014-09-22 Thread Ted Pedersen duluth...@gmail.com [ngram]
Very nice news for users of NSP and Text::Similarity! Please support these
resources by giving them a try and letting others know about them too.

Cordially,
Ted

-- Forwarded message --
From: Marta Villegas marta.ville...@upf.edu
Date: Mon, Sep 22, 2014 at 4:13 AM
Subject: Ngrams and Text Similarity deployed as SOAP web services
To: tpede...@umn.edu


Dear Ted,


Because of our participation in CLARIN  http://clarin.eu/and PANACEA
http://www.panacea-lr.eu/EU projects, in the last few years we deployed
some NLP tools as web services. Among these you will find yours Ngrams and
Text Similarity services. They are deployed as SOAP web services and they
are open and accessible .


You can find a description in our LOD-browser catalogue
http://lod.iula.upf.edu/index-en.html (please let us know if you want us
to change something)


1) TedPedersen's Ngrams Counter Web Service
http://lod.iula.upf.edu/resources/184

2) TedPedersen's Ngram Statistics Package
http://lod.iula.upf.edu/resources/108

3) TedPedersen's Text Similarity Web Service
http://lod.iula.upf.edu/resources/429

​The corresponding demo invocations are also available here:

http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.countngrams_row

http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.ngrams_row

http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.text_similarity_row


Best regards



-- 
Marta Villegas
marta.ville...@gmail.com


[ngram] the (apparent) demise of search.cpan.org

2014-07-18 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
For many years now, http://search.cpan.org has been my go-to link for
finding CPAN distributions, and has been the URL we've listed on our web
sites directing users to Perl software downloads.

Sadly the site has become very unreliable in the last few months, and there
does not appear to be a solution in the works. So, I've decided to
gradually migrate to using https:://metacpan.org as our default web site
for finding and pointing at CPAN distributions.

This will involve making changes on web pages and in documentation, and it
will take a while to do  But, it seems important since the impression can
be created by the search site that CPAN is down. It's not. CPAN is alive
and well, it's just that one particular navigator is not working too well.

I hope to make these changes on the main package pages fairly soon, but in
the event you run into a 503 or 504 error when accessing the search site,
please realize there are other ways, and that CPAN is just fine.

Here's some additional commentary and info about this issue

https://github.com/perlorg/perlweb/issues/115
http://perlhacks.com/2013/01/give-me-metacpan/
http://www.perlmonks.org/index.pl?node_id=1093542
http://grokbase.com/t/perl/beginners/145nsxqz2w/cpan-unavailable

When we started using the search site in about 2002 it was pretty great.
The good news is that https://metacpan.org is even better, so this is a
positive change.

Thanks,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Re: Fwd: ll4 giving me trouble with 4-grams

2013-03-27 Thread Ted Pedersen
I think there is a slight typo in your command :

statistic.pl --ngram 4 ll4.pm output.txt intput.txt

(the module name should be ll4.pm)

I hope this helps! Let me know if you continue to have any trouble...

Good luck,
Ted

On Wed, Mar 27, 2013 at 9:06 AM, mercevg merc...@yahoo.es wrote:
 Ted,

 I've received your answer without problem. I'll try to follow up with another 
 email address.

 A sample of my 4-grams file:
 procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9 16
 recuperacióndeinformacióntextual4 15 287 30 5 15 14 4 25 4 5 14 4 4 4
 estadísticodellenguajenatural3 5 55 22 19 3 3 3 20 16 18 3 3 3 16
 aparicióneneldocumento2 4 93 95 22 3 2 3 18 6 4 2 3 2 3

 Command line:
 statistic.pl --ngram 4 ll.3pm 4-grams-ll.txt 4-grams.txt

 Program answer:
 Measure not defined for 4-grams

 I've got Text-NSP v.1.25.

 Thank you.
 Mercè

 --- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote:

 Merce, I got an email error when responding directly to your yahoo.es
 account. Could you follow up with another email address or use the
 group...?

 Thanks,
 Ted


 -- Forwarded message --
 From: Ted Pedersen tpederse@...
 Date: Wed, Mar 27, 2013 at 8:29 AM
 Subject: Re: ll4 giving me trouble with 4-grams
 To: mercevg mercevg@...


 Hi Merce,

 Could you send me whatever error output you are getting, plus a small
 sample of your ngram file?

 Thanks!
 Ted

 On Wed, Mar 27, 2013 at 8:12 AM, mercevg mercevg@... wrote:
  Hi,
 
  I would like to know how to calculate with Statistical.pl 4-grams using 
  log-likelihood ratio.
 
  To calculate 3-grams I've run the program as follows:
  statistic.pl --ngram 3 tmi3.pm three.ngram.tmi3 three.ngram
 
  But using log-likelihood ratio it doesn't work.
 
  Thanks
 
  Mercè
 
 





[ngram] bug in rank.pl v (0.03) in Text::NSP 1.25

2013-02-14 Thread Ted Pedersen
A user reports a bug in rank.pl. This seems to occur when dealing with
smaller files, for example...

marimba(49): more x
firstbigram1 4.000 1 1
secondbigram2 3.000 2 2
extrabigram13 2.000 3 3
thirdbigram4 1.000 4 4

marimba(50): more y
secondbigram1 4.000 2 2
extrabigram22 3.000 4 4
firstbigram3 2.000 1 1
thirdbigram4 1.000 3 3


New version (0.03)
marimba(51): rank.pl x y
Illegal division by zero at /usr/local/bin/rank.pl line 397.

Old version (0.01)
marimba(52): perl ./rank.pl x y
Rank correlation coefficient = 0.5000

There are also cases there rank.pl will report, falsely, that there
are no ngrams in common between the input files. Again, this seems to
occur with smaller files.

We are checking into this, and if you've observed anything similar
please do let us know!

Cordially,
Ted


Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl

2013-02-06 Thread Ted Pedersen
Hi Karin,

This is very interesting, and I will certainly look into this further and
report back! Thank you for the additional information on this, it does seem
like an interesting case.

More soon!
Ted


On Wed, Feb 6, 2013 at 2:34 AM, Karin Cavallin karin.caval...@ling.gu.sewrote:

  Hi Ted

 since I compare 5 different corpora (size-wise and occurrence-wise),
 basically all the sets have different number of pairs. I have run rank.plon 
 on more than 100.000 lexical sets, most of them get no ranking
 co-efficient since there are no co-occurrences between the sets, some of
 them do get a co-efficient ranging from -1. to 1.000, as expected. One
 lexical set gets this -1.1000, the one I sent you.

  So, I don't think it is due to that the sets are too different, but
 something that is beyond me. That's why I though it was important to report
 it to you.

  /karin

 Karin Cavallin
 PhD Student in Computational Linguistics
 University of Gothenburg, Sweden

   --
 *Från:* duluth...@gmail.com [duluth...@gmail.com] för Ted Pedersen [
 tpede...@d.umn.edu]
 *Skickat:* den 6 februari 2013 03:35
 *Till:* ngram@yahoogroups.com
 *Cc:* Karin Cavallin
 *Ämne:* Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl [1
 Attachment]

   Hi Karin,

  I think the problem you are having is due to the fact that you have
 different number of word pairs in each list, and the fact that most of the
 word pairs are unique to each list. In general rank.pl expects that the
 two input files be made up of the same pairs of words (just ranked
 differently by a different measure of association, for example). When that
 isn't the case, the program will eliminate any word pairs that aren't in
 both files and then run. So, I think this combination of issues is causing
 rank.pl to return this very unexpected value.

  My guess is that it's the fact that the number of input pairs is
 different in each file, but I will do a little more checking in the next
 day or two to really see for sure. Here's a link to the rank.pldocumentation 
 that describes how this particular case is intended to be
 handled


 http://search.cpan.org/dist/Text-NSP/bin/utils/rank.pl#1.4._Dealing_with_Dissimilar_Lists_of_N-grams
 :

  More soon,
 Ted


 On Tue, Feb 5, 2013 at 10:06 AM, Ted Pedersen tpede...@d.umn.edu wrote:

 **

  [Attachment(s) #13caeb04e2f773ea_13cab1d2ea9de4f7_TopText from Ted
 Pedersen included below]

 -- Forwarded message --
 From: Karin Cavallin karin.caval...@ling.gu.se
 Date: Tue, Feb 5, 2013 at 8:53 AM
 Subject: -1.1000(sic!) as result from rank.pl
 To: tpede...@umn.edu tpede...@umn.edu

 Dear professor Ted

 I didn't know whom to report this error to, so I hope you can forward
 this to the appropriate receiver.

 I have been using the NSP for a while, especially the bigram packages.
 I'm working with lexical sets verbal predicate and nominal objects,
 and to collocational analysis on them.
 I wanted to compare the ranking between sets coming from different
 corpora. (I know it is quite uninteresting to do ranking on such
 different data, but I am trying different things for my thesis.)

 Today I noticed one lexical set to be -1.1000, this should not be
 possible! (I have only noticed this one time)

 karin$ rank.pl 65_anstr.txt 95_anstr.txt
 Rank correlation coefficient = -1.1000

 I attached the files which I get this weird outcome from.

 Best regards
 /karin

 Karin Cavallin
 PhD Student in Computational Linguistics
 University of Gothenburg, Sweden

 skyansträngning505 25.1952 2 5 15
 fördubblaansträngning1582 10.8890 1 5 15
 koncentreraansträngning1912 9.1951 1 11 15
 krävasansträngning2172 8.2948 1 17 15
 underlättaansträngning2172 8.2948 1 17 15
 märkaansträngning2471 7.4301 1 26 15
 göraansträngning2915 6.3704 3 1323 15
 fortsättaansträngning3097 6.0043 1 53 15
 kostaansträngning3723 4.8170 1 97 15
 ochansträngning4162 4.0424 1 145 15
 sättaansträngning4482 3.4540 1 198 15
 läggaansträngning4745 3.0005 1 253 15

 intensifieraansträngning3951 40.5247 3 22 33
 göraansträngning4665 35.6553 12 20089 33
 fortsättaansträngning8254 21.8238 3 468 33
 krävaansträngning10206 17.4829 3 973 33
 trotsaansträngning17176 9.9897 1 39 33
 välkomnaansträngning18254 9.3712 1 53 33
 underlättaansträngning20704 8.1388 1 98 33
 dömaansträngning22762 7.1873 1 158 33
 skadaansträngning23084 7.0537 1 169 33
 riktaansträngning23084 7.0537 1 169 33
 haansträngning23176 7.0134 1 89009 33
 stödjaansträngning25349 6.1642 1 265 33
 krävasansträngning25718 6.0348 1 283 33
 varaansträngning29926 4.5609 1 603 33
 ledaansträngning30789 4.2612 1 705 33
 ökaansträngning33145 3.4625 1 1076 33
  





Re: [ngram] formatting + punctuation removal

2011-08-17 Thread Ted Pedersen
Hi Patrick,

One additional idea might be to use the --token option, and say that
you only want to consider alphanumerics as your tokens (which is what
you will count).

For example...

marengo(129): cat test
my friends, i have news
i like  ngrams

Now without any token list, stop list, etc...

marengo(130): count.pl outa test

marengo(131): cat outa
24
!!10 11 12
..3 4 4
news!1 1 12
havenews1 1 1
.ngrams1 4 1
!i1 11 2
,i1 1 2
ihave1 2 1
ngrams!1 1 12
like.1 1 4
friends,1 1 1
ilike1 2 1
myfriends1 1 1

Now I define a token file...

marengo(132): cat token.txt
/\w+/

marengo(133): count.pl out test --token token.txt

marengo(134): cat out
7
ihave1 2 1
newsi1 1 2
havenews1 1 1
likengrams1 1 1
ilike1 2 1
friendsi1 1 2
myfriends1 1 1

Note that we only have alphanumerics...that might be the simplest
thing to try fist...

Hope this helps...
Ted

On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu liux0...@umn.edu wrote:
 Hi Patrick,

 You need to pre-process the text (data cleaning) to remove
 punctuations before run by count.pl. The same idea, you
 need to post-process to get the format you want of the bigrams
 or trigrams.

 Thanks,
 Ying

 semiotica24 wrote:

 Sorry for the basic questions:
 1. I need 2 versions of output for each list of bigrams and trigrams
 that I create using the various measures in count.pl and statistic.pl:
 one with the default statistics and one without. How do I format to
 exclude the statistics?
 e.g.:
 mobilephones100 280 384
 cellularphones96 214 384

 mobile phones
 cellular phones

 2. I need to remove punctuation . and , I've tried within my stopword
 list, but I don't have the tags quite right. How should I enter into
 my stop file?

 Thanks!

 Patrick





 

 Yahoo! Groups Links







-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] Re: formatting + punctuation removal

2011-08-17 Thread Ted Pedersen
Hi Patrick,

NSP makes no real distinction between punctuation and words, so if you do
not do anything with tokenization via --token or --nontoken or
preprocessing, the punctuation marks will be treated just like words and
will affect your results. --token and --nontoken essentially remove them
from the data, so the bigrams you find are affected as is the total sample
size.

Hope this helps!
Ted

On Wed, Aug 17, 2011 at 4:30 PM, semiotica24 semiotic...@yahoo.com wrote:

 **


 So in other words punctuation such as . and , are not used at all by the
 algorithms/measures and I should get the same results if I remove them
 before I run count.pl and stat.pl, correct?


 --- In ngram@yahoogroups.com, Ying Liu liux0395@... wrote:
 
  Hi Patrick,
 
  You need to pre-process the text (data cleaning) to remove
  punctuations before run by count.pl. The same idea, you
  need to post-process to get the format you want of the bigrams
  or trigrams.
 
  Thanks,
  Ying
 
  semiotica24 wrote:
  
   Sorry for the basic questions:
   1. I need 2 versions of output for each list of bigrams and trigrams
   that I create using the various measures in count.pl and statistic.pl:

   one with the default statistics and one without. How do I format to
   exclude the statistics?
   e.g.:
   mobilephones100 280 384
   cellularphones96 214 384
  
   mobile phones
   cellular phones
  
   2. I need to remove punctuation . and , I've tried within my stopword
   list, but I don't have the tags quite right. How should I enter into
   my stop file?
  
   Thanks!
  
   Patrick
  
  
 

  



[ngram] NSP home page in Romanian!

2011-07-26 Thread Ted Pedersen
Greetings all,

I just wanted to let you know that the home page for the Ngram
Statistics Package has been translated into Romanian, thanks to
Alexandra Seremina!

Here's a link to the page, and I will be updating the NSP home page to
include a link to this as well.

http://www.azoft.com/people/seremina/edu/nsp-rom.html

Enjoy!
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] demo at acl mwe workshop, talk today at disco

2011-06-24 Thread Ted Pedersen
Our demo of the Ngram Statistics Package at the ACL MWE workshop seemed to
go pretty well.

http://multiword.sourceforge.net/PHITE.php?sitesig=CONFpage=CONF_20_MWE_2011___lb__ACL__rb__

The best moment I thought was when Saiyam demoed NSP for Ken Church. :)

Today is the last day at ACL - will attend the DisCo workshop and give a
talk showing how I used NSP to participate in the shared task on identifying
semantic compositionality.

http://disco2011.fzi.de/

Cordially,
Ted

---
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] MI for a 4-gram

2011-06-07 Thread Ted Pedersen
Hi Cyrus,

There's nothing wrong with your formulation, although I would refer to
what you describe as Pointwise Mutual Information (PMI), since it
seems like it would only compute the probability of observing A B C D
all together and separately, and not include probabilities of A (not
B) C D, and so forth. If you were doing that, then you'd be more in
the realm of Mutual Information (or tmi as we call it).

Note that NSP does include a 3d version of PMI that essentially
follows your definition.

http://search.cpan.org/dist/Text-NSP/lib/Text/NSP/Measures/3D/MI/pmi.pm

Extending to 4-d would not be difficult.

If on the other hand you would like to do Mutual Information, remember
that only differs from the Log Likelihood Ratio by a constant term, so
you could use our 4-d ll measure for that...

http://search.cpan.org/dist/Text-NSP/lib/Text/NSP/Measures/4D/MI/ll.pm

Also, some of the background for these trigram and 4gram measures is
described in Bridget McInnes' MS thesis...

Extending the Log-Likelihood Ratio to Improve Collocation
Identification (McInnes) - Master of Science Thesis, Department of
Computer Science, University of Minnesota, Duluth, December, 2004.
http://www.d.umn.edu/~tpederse/Pubs/bridget-thesis.pdf

There are some additional subtleties when you move beyond bigrams, and
that's because rather than simply comparing the occurance of an ngram
to the model of independence (ie P(A,B)/P(A)(B)) you have the option
of comparing to other models (ie P(A,B,C)/P(A,B)P(C)) This becomes
it's own big complicated issue which I won't go into much here, but it
does open up a lot of interesting possibilities for longer ngrams that
you don't have with bigrams. Some of this is discussed in more detail
in Bridget's thesis.

I hope this helps, and please do let us know if you have any
additional questions, observations or ideas.

Good luck,
Ted

On Tue, Jun 7, 2011 at 5:26 PM, Cyrus Shaoul cyrus.sha...@ualberta.ca wrote:



 Hi everyone,

 My apologies if this has been asked many times before, but

 would this be an appropriate way to calculated the Mutual Information for a 
 4-gram made of of words A B C and D?

 Mi(ABCD) = log(P(ABCD) / (P (A) x P (B) x P (C) x P (D)))

 If not, what is a better way? Why is this bad?

 Thanks for your help,

 Cyrus

 


--
Ted Pedersen
http://www.d.umn.edu/~tpederse




Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* Your email settings:
Individual Email | Traditional

* To change settings online go to:
http://groups.yahoo.com/group/ngram/join
(Yahoo! ID required)

* To change settings via email:
ngram-dig...@yahoogroups.com 
ngram-fullfeatu...@yahoogroups.com

* To unsubscribe from this group, send an email to:
ngram-unsubscr...@yahoogroups.com

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/



[ngram] Ngram Statistics Package at ACL 2011 in Portland

2011-05-31 Thread Ted Pedersen
Greetings all,

There will be two NSP related papers at ACL 2011 during the workshop program.

On June 23 NSP will be a part of the demo program for the MWE workshop:

http://multiword.sourceforge.net/PHITE.php?sitesig=CONFpage=CONF_20_MWE_2011___lb__ACL__rb__

The following is a very short overview paper written for the workshop:

The Ngram Statistics Package (Text::NSP) - A Flexible Tool for
Identifying Ngrams, Collocations, and Word Associations (Pedersen,
Banerjee, McInnes, Kohli, Joshi, and Liu) - To Appear in the
Proceedings of Multiword Expressions : from Parsing and generation to
the Real World (MWE 2011), an ACL HLT 2011 Workshop, June 23, 2011,
Portland, Oregon. (Demonstration System)
http://www.d.umn.edu/~tpederse/Pubs/pedersen-mwe-2011.pdf

Then on June 24 NSP will be featured in a shared talk task at the
DiSCo workshop.

http://disco2011.fzi.de/

NSP was used to measure semantic compositionality, and in the end did
reasonably well.

Identifying Collocations to Measure Compositionality : Shared Task
System Description (Pedersen) - To Appear in the Proceedings of
Distributional Semantics and Compositionality (DiSCo 2011), an ACL HLT
2011 Workshop, June 24, 2011, Portland, Oregon.
http://www.d.umn.edu/~tpederse/Pubs/pedersen-disco2011.pdf

So, if you are at ACL 2011 please consider attending these events, or
catch up with us some other time.

Hoping to see you in Portland,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] Re: ngrams with hyphen

2011-04-23 Thread Ted Pedersen
Hi Merce,

Ah, yes, I see what you mean. The problem with using \s in the stoplist is
that the toknization prior to checking for stop words does not include a
trailing \s, and so /\s[Ii]n\s/ is never matched.

The trick here is to redfine the \b character class so it doesn't include -.
This involves a bit of regular expression tampering which looks kind of
awful but in fact works pretty nicely. What I have below is a regex (in a
stoplist) that redefines \b as including - and /.

@stop.mode=OR
/\b[iI]n(?:(?![\w/-])(?=[\w/-])|(?=[\w/-])(?![\w/-]))/

So we have a word boundary \b
followed by In or in
followed by a word boundary that includes - or /

ted@linux-zxku:~ count.pl out test.txt --stop stop.txt --token token.txt

ted@linux-zxku:~ more out
4
latejune1 1 1
in-lineskating1 1 1
ilike1 1 1
likein-line1 1 1

ted@linux-zxku:~ cat test.txt
i like in-line skating in late june.

ted@linux-zxku:~ cat stop.txt
@stop.mode=OR
/\b[iI]n(?:(?![\w/-])(?=[\w/-])|(?=[\w/-])(?![\w/-]))/

It's important to say this regex came from Perl Monks,
http://www.perlmonks.org/?node_id=308744

I hope this makes some sense, at least in a general way. I wouldn't worry
too much about the regex itself, although if you need it modified in some
way do let me know and we can work that out.

Enjoy,
Ted

On Sat, Apr 23, 2011 at 4:51 PM, mercevg merc...@yahoo.es wrote:



 Hi Ted,

 I've modified the stopwords list using \s/ instead of \b/, but the problem
 is not solved at all, because now in my bigrams list I get interesting
 bigrams as

 in-bandsignalling
 in-stationmodem

 But also, new bigrams without interest as

 in Recommendation
 defined in
 shown in
 described in
 given in

 It's possible to get just bigrams like

 in-bandsignalling
 in-stationmodem

 And not the others new bigrams without interest?

 Thanks for your help,


 Mercè

 --- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote:
 
  Hi Merce,
 
  Yes, indeed, you can do as you describe. This gets into some important
  details about regular expressions that I'm happy to have a chance to
  mention. In the default stoplist the stop words are delimited by \b, as
 in
 
  /\bin\b/
 
  This means match in as a stop word when surrounded by a word boundary.
 A
  word boundary is spaces as well as various punctuations, including the -.
 
  So, if you want to find bigrams like in-line but then exclude ones like
  in the, then you need to adjust the stoplist so that the stop words are
  perhaps just surrounded by spaces. I say perhaps since there are various
  ways to do this, but the simplest one is shown below...
 
  ted@linux-zxku:~ more stop.txt
  @stop.mode=OR
  /\b[iI]n\s/
 
  ted@linux-zxku:~ more token.txt
  /\w+-\w+/
  /\w+/
 
  ted@linux-zxku:~ more test.txt
  i like in-line skating in late june.
 
  ted@linux-zxku:~ count.pl output.txt test.txt --token token.txt --stop
  stop.txt
 
  ted@linux-zxku:~ more output.txt
  6
  inlate1 1 1
  latejune1 1 1
  skatingin1 1 1
  in-lineskating1 1 1
  ilike1 1 1
  likein-line1 1 1
 
  I hope this helps.
 
  Enjoy,
  Ted
 
  On Fri, Apr 22, 2011 at 11:41 AM, mercevg mercevg@... wrote:
 
  
  
   Ted,
  
   Thanks, I've add this regular expression in my tokens file and it works
   well.
  
   One more comment about that:
  
   In my corpus I have some interesting bigrams as
   in-band signalling
   in-call rearrangement
   in-slot signalling
  
   If I filter as a stopword in, I can't get these kind of bigrams from
 my
   corpus. On the contrary, if in it's not on my stopwords list, I
 retrieve
   these bigrams but also I get more bigrams without interest as
  
   in Recommendation
   in Figure
   in order
  
   My question is: Can I filter and retrieve these two groups of bigrams
 at
   the same time?
  
   Thank you for your help,
  
   Mercè
  
  
   --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote:
   
Greetings Merce,
   
This is fairly easy to handle via the --token option. You simply
 specify
   a
regular expression that says a token in a string followed by a -
 followed
   by
a string. You can customize a --token file many ways, but the
 following
example will handle hyphenated words. Please do let us know if
 additional
questions arise!
   
linux@linux:~ count.pl test.out test.txt --token token.txt
   
linux@linux:~ more test.out
13
cell-phoneIt1 1 1
thevillage-shop1 1 1
sextra-nice1 1 1
village-shoptoday1 1 1
boughta1 1 1
wentto1 1 1
acell-phone1 1 1
iwent1 1 1
todayand1 1 1
Its1 1 1
andI1 1 1
Ibought1 1 1
tothe1 1 1
   
linux@linux:~ cat test.txt
i went to the village-shop today, and I bought a cell-phone. It's
extra-nice.
   
linux@linux:~ cat token.txt
/\w+\-\w+/
/\w+/
   
Enjoy,
Ted
   
On Wed, Apr 20, 2011 at 2:20 PM, mercevg mercevg@ wrote:
   


 Dear all,

 I would like to know if it's possible to get a list of ngrams with
 a
   hyphen
 inside, maybe

Re: [ngram] ngrams with hyphen

2011-04-20 Thread Ted Pedersen
Greetings Merce,

This is fairly easy to handle via the --token option. You simply specify a
regular expression that says a token in a string followed by a - followed by
a string. You can customize a --token file many ways, but the following
example will handle hyphenated words. Please do let us know if additional
questions arise!

linux@linux:~ count.pl test.out test.txt --token token.txt

linux@linux:~ more test.out
13
cell-phoneIt1 1 1
thevillage-shop1 1 1
sextra-nice1 1 1
village-shoptoday1 1 1
boughta1 1 1
wentto1 1 1
acell-phone1 1 1
iwent1 1 1
todayand1 1 1
Its1 1 1
andI1 1 1
Ibought1 1 1
tothe1 1 1

linux@linux:~ cat test.txt
i went to the village-shop today, and I bought a cell-phone. It's
extra-nice.

linux@linux:~ cat token.txt
/\w+\-\w+/
/\w+/

Enjoy,
Ted

On Wed, Apr 20, 2011 at 2:20 PM, mercevg merc...@yahoo.es wrote:



 Dear all,

 I would like to know if it's possible to get a list of ngrams with a hyphen
 inside, maybe during the tokenization process.

 For exemple, I want to get these bigrams:
 - call-connected signal
 - clear-back signal
 - clear-forward signal

 Instead of two bigrams for each one:
 - callconnected179 2608 527
 connectedsignal189 320 9176

 - clearback283 1115 733
 backsignal157 380 9176

 - clearforward632 1115 877
 forwardsignal493 1547 9176

 Thanks a lot,

 Mercè

  




-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] Re: Extending huge-count to 3 grams.

2011-01-13 Thread Ted Pedersen
Hi Hien,

I'm happy to report these were included in Text::NSP in the following
directory:

http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.21/bin/utils/contributed/

Please feel free to share any comments or observations you might have.

Enjoy!
Ted

On Wed, Jan 12, 2011 at 10:47 PM, phamhieniol phamhien...@gmail.com wrote:



 Hi Cyrus,

 Can you post the script? I want to give it a try.

 Best,

 Hien


 --- In ngram@yahoogroups.com ngram%40yahoogroups.com, Cyrus Shaoul
 cyrus.shaoul@... wrote:
 
  Well, since nobody replied, I just made a trigram counter based on
  huge-count.pl.
  It is running now, and seems to work well.
 
  If anyone can help me out by testing it, that would be greatly
 appreciated.
  Just e-mail me and I will send you the files.
 
  Once it is fully tested, I will contribute it to the NSP package.
 
  Thanks,
 
  Cyrus
 
 
   CyrusShaoul wrote:
  
  
   Has anybody tried extending huge-count.pl to 3 grams? I may give it a

   shot, but I don't want to re-invent the wheel.
  
   Thanks,
  
   Cyrus
  
  
 
  --
  =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
  Cyrus Shaoul
  http://www.psych.ualberta.ca/~westburylab/http://www.psych.ualberta.ca/%7Ewestburylab/
  University of Alberta
  780-492-5843
  =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
 

 --
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] possible trouble with hugecount.pl in Text-NSP-1.17?

2010-05-02 Thread Ted Pedersen
 xie199909.txt
 8854   1733381  10273283 xie199910.txt
 8679   1658967   9789022 xie199911.txt
 8788   1716177  10116139 xie199912.txt
 8516   1606427   9434389 xie21.txt
 8051   1571315   9239155 xie22.txt
 9717   1895946  11166496 xie23.txt
 9196   1830029  10819900 xie24.txt
 9392   1805885  10689714 xie25.txt
 9434   1826577  10834233 xie26.txt
 9100   1790377  10553950 xie27.txt
 9267   1818165  10695151 xie28.txt
 9571   1779519  10427577 xie29.txt
 8864   1796484  10646671 xie200010.txt
 8841   1731864  10225318 xie200011.txt
 8007   1623146   9549503 xie200012.txt
 7880   1480773   8681644 xie200101.txt
 8235   1581014   9272288 xie200102.txt
 9643   1937289  11393670 xie200103.txt
 8859   1748990  10261162 xie200104.txt
 8924   1758391  10328510 xie200105.txt
 8620   1716646  10131853 xie200106.txt
 8581   1709051  10080829 xie200107.txt
 9882   1983160  11583688 xie200108.txt
 8867   1728337  10179257 xie200109.txt
 9437   1793786  10614699 xie200110.txt
 1740339343   1995533 xie200111.txt
   679007 132786005 783905863 total

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Fwd: nsp/trigram signficance

2009-12-05 Thread Ted Pedersen
-- Forwarded message --
From: r...@imsc.res.in r...@imsc.res.in
Date: Dec 5, 2009 11:59 AM
Subject: Re: nsp/trigram signficance
To: Ted Pedersen duluth...@gmail.com


May I request you to forward this to list ? I don't use Yahoo and it
seems posting to the mailing list is disallowed without an Yahoo
account. Please could you cc me the reply ? Thanks.

--

To outline what I did.

The following steps *work* :

(1) Corpus of 417 distinct tokens, corpus size is about 10,000 tokens,
in corpus.txt

(2) count.pl --ngram 3 --set_freq_combo combofile.txt corpus-3.cnt corpus.txt

(3) statistic.pl --ngram 3 --set_freq_combo combofile.txt ll.pm
corpus-3sig.txt corpus-3.cnt

where combofile.txt is of form

0 1 2

1
2
0 1
0 2
1 2

The following *does not work* :

Steps 1 - 3, but with combofile.txt as

0 1 2

1
0 1

or any variant of the above. I want to test the hypothesis P(w1w2w3) =
P(w1)*P(w2w3) and it appears from the doc that playing around with the
frequency combinations is the way to go about doing this. I couldn't
get it to work, however.




Quoting Ted Pedersen duluth...@gmail.com:

 Thanks for your query. Could you send me the exact command you
 running? That would be helpful in understanding what is happening.
 Also, if you could send this to the ngram mailing list rather than me
 directly, that would be helpful as I'm sure other users would be
 interested.

 http://tech.groups.yahoo.com/group/ngram/

 Thanks!
 Ted

 On Sat, Dec 5, 2009 at 5:13 AM, Ronojoy Adhikari r...@imsc.res.in wrote:

 
  Dear Prof. Pedersen,
 
  I am a user of your NSP software and I am taking the liberty of bothering
  you with a query.
 
  I have been trying to test for alternative hypotheses for independence of
  trigrams using the -set_freq_combo flags in count.pl and statistic.pl. While
  this works fine for count.pl, statistic.pl invariably leads to an error
  message of the form :
 
  Frequency combination x missing!
 
  where x could be any of 0 1 2, 0, 1, 2, or 0 1 and permutations.
  I have tried every possible combination of these and the only combination
  which works is when the full set
 
  0 1 2
 
  1
  2
  0 1
  0 2
  1 2
 
  is specified in the -set_freq_combo file. Am I doing something wrong or does
  NSP only do the default trigram hypothesis test of
  P(w1w2w3)=P(w1)*P(w2)*P(w3) ? Your help would be much appreciated.
 
  Thanks in advance,
 
  Ronojoy Adhikari.
 
  
  Dr. Ronojoy Adhikari
  The Institute of Mathematical Sciences  Tel: +91(44)2254 3253
  Chennai 600113 IndiaFax: +91(44)2254 1586
  email:r...@imsc.res.in  URL: http://www.imsc.res.in/~rjoy
  
 
 
 



 --
 Ted Pedersen
 http://www.d.umn.edu/~tpederse






This message was sent using IMP, the Internet Messaging Program.



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] NSP stress testing, windowing memory usage

2009-10-22 Thread Ted Pedersen
Greetings all,

I've been doing some stress testing of NSP lately, focusing in
particular on figuring out the cost of windowing, which allows us to
count bigrams that allow some number of intervening words to appear
between them (rather than just requiring them to be adjacent). This is
provided by the --window option in count.pl

Before I get into the results, here is some info about the hardware
(so you can compare to your own setup):

Dell Precision 670 with 12 GB of RAM and 2 dual processor Xeons, each
with a 3.00 GhZ clock rate. Note that NSP only uses one core.
uname -a output is : Linux marimba 2.6.24-23-server #1 SMP Wed Apr 1
22:14:30 UTC 2009 x86_64 GNU/Linux

I used part of one directory of the English Gigaword data and created
a single file of input where all the markup was removed and the text
was converted to upper case. So, this is newspaper text essentiallly.

I ran a series of commands of this form...

perl -d:Dprof count.pl --ngram 2 --window X  windowX.out nsp-stress-test.txt

...meaning that we count all bigrams using a window size of X. Recall
that the window size indicates how many words can occur between two
words that are forming a bigram. By default the window size is 2,
meaning that the bigrams must be adjacent. If the window size is 3,
then there may be up to 1 intervening words between two words in a
bigram, and so forth.

Below is the script that I used to run these - note that the main
variable is  the window size - the point of this exercise really was
to get a sense of how much windowing costs.

===

#!/bin/bash -l

for value in 2 3 5 10 25 50

do
echo window $value remove $value...
perl -d:DProf /usr/local/bin/count.pl --window $value --ngram
2 --remove  $value window$value.out nsp-stress-test.txt
dprofpp  window$value.prof.out
done

==

In this case, nsp-stress-test.txt (11 MB) consists of the following
(btw, this is a small file but I wanted to get started somewhere...:

1,894,410 words of text (tokens)
44,919 unique words (types)

Here's what we see in terms of time and space as window size increases...



window size = 2
Total Elapsed Time = 72.63075 Seconds
Memory Utilization = 240 MB
Total Bigrams = 1,894,409
Unique Bigrams = 523,053

window size = 3
Total Elapsed Time = 121.7461 Seconds
Memory Utilization = 490 MB
Total Bigrams = 3,788,817
Unique Bigrams = 1,129,153

window size = 5
Total Elapsed Time = 217.5907 Seconds
Memory Utilization = 960 MB
Total Bigrams = 7,577,630
Unique Bigrams = 2,264,646

window size = 10
Total Elapsed Time = 450.1853 Seconds
Memory Utilization = 1900 MB
Total Bigrams = 17,049,645
Unique Bigrams = 4,582,673

window size = 25
Total Elapsed Time = 1092.680 Seconds
Memory Utilization = 3800 MB
Total Bigrams = 45,465,540
Unique Bigrams = 9,649,220

window size = 50
Total Elapsed Time = 2067.010 Seconds
Memory Utilization = 6000 MB
Total Bigrams = 92,824,865
Unique Bigrams = 15,673,225

A few summarizing stats...

window size 50 takes 28.7 times as long to run as window size 2
window size 50 takes 25 times as much memory as window size 2.

window size 50 has 30 times as many unique bigrams as window size 2
window size 50 has 48.9 times as many total bigrams as window size 2

The good news is that the increase in time and memory use is linear with
the window size. The bad news is that the memory footprint starts off
fairly large so even a linear increase can lead to fairly high
utilization.

A rough rule of thumb then is that as window size increases, time and
space requirements will expand by the amount of the window size
increase. So going from window size of 2 to 25 will result in a 25x
increase in time and memory.

Now, this is all based on just a single file and a relatively small set of
results, so my next step is to do this with a larger data file and see
if the above rules of thumb continue to hold...

More soon,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Re: significant collocations

2009-07-15 Thread Ted Pedersen
--- In ngram@yahoogroups.com, Amada Eliseo amadaeli...@... wrote:

 Hello to all,
 
 I appreciate your work.
 
 Please. Can someone help me to identify significant collocations in some 
 text. I would like to use text-nsp, but I don't know how to do it. 
 
 Thank you.


If you are just starting out, you have some reading to do. :)

This will give you simple examples of how to use NSP:

http://search.cpan.org/dist/Text-NSP/doc/USAGE.pod

These will give you much more detail: 

http://search.cpan.org/dist/Text-NSP/doc/README.pod

The Design, Implementation, and Use of the Ngram Statistics Package (Banerjee 
and Pedersen) - Appears in the Proceedings of the Fourth International 
Conference on Intelligent Text Processing and Computational Linguistics, pp. 
370-381, February 17-21, 2003, Mexico City.
http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

These are all written with the new user in mind, and I'm sure you'll find them 
very helpful. Once you've had a chance to read these and try out NSP a little, 
please don't hesitate to follow up with more questions!

Cordially,
Ted






Re: [ngram] Ngrams without line break

2009-07-01 Thread Ted Pedersen
Greetings Merce,

To make sure I understand correctly, it sounds like you *only* want to
see those ngrams that contain a line break. For example, if you run
count.pl as follows on your test file

first line of text
second line
And a third line of text

count.pl test.out test

talisker(8): more test.out
11
lineof2 3 2
oftext2 2 2
lineAnd1 3 1
Anda1 1 1
athird1 1 1
secondline1 1 3
thirdline1 1 3
firstline1 1 3
textsecond1 1 1

You will get the bigrams that cross over the end of line - (text,
second, line And), but you also get all the other ngrams too...and so
it sounds to me like you only want the ones that cross over the new
line markers, and nothing else. Is that accurate?

By default count.pl simply ignores end of line markers (the behavior
you see above). So, it's not so much that the ngram includes the new
line, it simply ignores it. So with a file like

the cat is
my friend the
cat is my friend

the 2 occurrences of the cat would be considered identical, even
though the second could be thought of as having a new line in the
middle of it (but we essentially ignore that).

So...at the moment at least I'm not sure how to limit the output to
only those ngrams that are made by crossing over a new line
markerBut, let me make sure I am understanding things correctly
(so do let me know if I'm wrong) and I'll give this a little more
thought too.

Cordially,
Ted


On Wed, Jul 1, 2009 at 12:15 PM, mercevgmerc...@yahoo.es wrote:


 Dear all,

 I would like to know if it's possible to get ngrams without containing line
 breaks from the corpus. I'll try to explain clearly: if the input text file
 is

 first line of text
 second line
 And a third line of text

 Then, we'll get with count.pl two bigrams containing like breaks:

 text second
 line And

 Or trigrams:
 of text second
 text second line
 second line And

 And so on.

 Taking into account these outputs, and after reading help text, I don't know
 if I can change default count.pl options to get all ngrams from the corpus
 except the ngrams containing words placed at the end of one sentence and
 words that are at the begining of the next sentence. That is, ngram without
 containing line breaks.

 Best wishes,
 Mercè

 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] Re: Ngrams without line break

2009-07-01 Thread Ted Pedersen
Hi Merce,

Ah, now I understand. Fortunately there is a simple answer, I think.

count.pl cattest.out cattest --newLine

will cause the end of line markers to be respected, so ngrams will NOT
cross over them.

talisker(56): more cattest.out
7
catis2 2 2
myfriend2 2 2
friendthe1 1 1
ismy1 1 1
thecat1 1 1

So, I believe the --newLine option will do exactly as you require!

Please let me know if there are any other questions or concerns.

Thanks!
Ted

On Wed, Jul 1, 2009 at 1:04 PM, mercevgmerc...@yahoo.es wrote:


 Dear Ted,

 In my case, I would like to get all the ngrams except those that cross over
 the end of line. In your example:

 the cat is
 my friend the
 cat is my friend

 I don't want to get as ngrams is my and the cat, those having a new line
 in the
 middle of it.

 As you said, by default count.pl simply ignores end of line markers. But,
 it's possible not ignore end of line markers?

 Thanks a lot!
 Mercè

 --- In ngram@yahoogroups.com, Ted Pedersen duluth...@... wrote:

 Greetings Merce,

 To make sure I understand correctly, it sounds like you *only* want to
 see those ngrams that contain a line break. For example, if you run
 count.pl as follows on your test file

 first line of text
 second line
 And a third line of text

 count.pl test.out test

 talisker(8): more test.out
 11
 lineof2 3 2
 oftext2 2 2
 lineAnd1 3 1
 Anda1 1 1
 athird1 1 1
 secondline1 1 3
 thirdline1 1 3
 firstline1 1 3
 textsecond1 1 1

 You will get the bigrams that cross over the end of line - (text,
 second, line And), but you also get all the other ngrams too...and so
 it sounds to me like you only want the ones that cross over the new
 line markers, and nothing else. Is that accurate?

 By default count.pl simply ignores end of line markers (the behavior
 you see above). So, it's not so much that the ngram includes the new
 line, it simply ignores it. So with a file like

 the cat is
 my friend the
 cat is my friend

 the 2 occurrences of the cat would be considered identical, even
 though the second could be thought of as having a new line in the
 middle of it (but we essentially ignore that).

 So...at the moment at least I'm not sure how to limit the output to
 only those ngrams that are made by crossing over a new line
 markerBut, let me make sure I am understanding things correctly
 (so do let me know if I'm wrong) and I'll give this a little more
 thought too.

 Cordially,
 Ted


 On Wed, Jul 1, 2009 at 12:15 PM, mercevgmerc...@... wrote:
 
 
  Dear all,
 
  I would like to know if it's possible to get ngrams without containing
  line
  breaks from the corpus. I'll try to explain clearly: if the input text
  file
  is
 
  first line of text
  second line
  And a third line of text
 
  Then, we'll get with count.pl two bigrams containing like breaks:
 
  text second
  line And
 
  Or trigrams:
  of text second
  text second line
  second line And
 
  And so on.
 
  Taking into account these outputs, and after reading help text, I don't
  know
  if I can change default count.pl options to get all ngrams from the
  corpus
  except the ngrams containing words placed at the end of one sentence and
  words that are at the begining of the next sentence. That is, ngram
  without
  containing line breaks.
 
  Best wishes,
  Mercè
 
 



 --
 Ted Pedersen
 http://www.d.umn.edu/~tpederse


 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] Re: the NSP trigram calculations don't match mine??

2009-06-14 Thread Ted Pedersen
Greetings all,

Thanks to Stefan for the very complete and lucid explanation on
computing PMI scores. We'll be updating the documentation to reflect
our actual calculation (which we are relieved to find seems to be
correct), and also fixing that issue with the long form of the
measure not working properly from the command line.

As to the issue of the log we are using, we are using the Perl log(x)
function, which returns the natural log (base e). So, values between
different systems may well differ depending on the log used, but the
relative ranking should be the same. This is a further concern when
thinking about the issue of cutoffs (what value of PMI indicates that
I've found a collocation, for example...) since someone might report
one value (using log 2) while someone else reports a value using log 2
or log 10. So, just something to be careful of perhaps...

Cordially,
Ted

On Sun, Jun 14, 2009 at 10:19 AM, Stefan
Evertev...@ims.uni-stuttgart.de wrote:



 Hi everyone!

 The NSP values still do not match mine, and I see that it concerns ll,
 pmi, ps as well as tmi for trigrams. Evidently, there must be some error
 which probably lies in the observed or estimated frequencies (since all four
 measures produce different results than mine)

 I need to ask for two clarifications:
 (1) estimated frequency: The webpage/pmi file says:
 n1pp * np1p * npp1
 m111= 
 nppp
 but the file 3D.pm says
 $m111=$n1pp*$np1p*$npp1/($nppp**2); 

 which I take to mean that we use, not nppp, but the exponent:
 n1pp * np1p * npp1
 m111= 
 nppp * nppp
 If so, which one sould I really use?

 The correct expected co-occurrence frequency under an independence
 hypothesis is the second one, with the denominator squared. It's easy
 to make this clear to yourself if you keep its mathematical derviation
 in mind:

 - The occurrence probability of the first word is (n1pp/nppp); of the
 second word (np1p/nppp); etc.

 - The probability of all three words occurring next to each other by
 chance, i.e. the co-occurrence probability under an independence null
 hypothesis, is the product of the three probabilities:
 (n1pp/nppp)*(np1p/nppp)*(npp1/nppp) = n1pp * np1p * npp1 / (nppp**3)

 - Multiply this probability by sample size nppp to obtain the expected
 frequency under the independence null

 (2) Furthermore, let us return to the example trigram. When I
 compute the example trigram's pmi in the way I understand the code, I
 get the value -15.24452, instead of the NSP package's 6.4127.

 Not surprising: your expected frequency is way too high (by a factor
 of nppp), so you have a lower co-occurrence frequency than expected
 and hence a negative association.

 The standard definition of PMI uses base-2 logarithms (because of its
 roots in information theory), so the resulting value can be
 interpreted as bits of mutual information. Other implementations
 diverge from this; e.g., for my own code in the UCS toolkit I made the
 regrettable decision to use base-10 logarithms. Note that all
 versions should still give the same ranking of candidates, so that's a
 robust test case.

 Cheers,
 Stefan (Evert)
 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] Re: the NSP trigram calculations don't match mine??

2009-06-11 Thread Ted Pedersen
 1
iwilltest18 0.1960 1 3 1 5 1 1 1

On Thu, Jun 11, 2009 at 6:27 AM, gunnlysegunnl...@yahoo.no wrote:


 Hello,
 thank you for this clarification!

 The NSP values still do not match mine, and I see that it concerns ll, pmi,
 ps as well as tmi for trigrams. Evidently, there must be some error which
 probably lies in the observed or estimated frequencies (since all four
 measures produce different results than mine)

 I need to ask for two clarifications:

 (1) estimated frequency: The webpage/pmi file says:
 n1pp * np1p * npp1
 m111= 
 nppp
 but the file 3D.pm says
 $m111=$n1pp*$np1p*$npp1/($nppp**2); 

 which I take to mean that we use, not nppp, but the exponent:
 n1pp * np1p * npp1
 m111= 
 nppp * nppp
 If so, which one sould I really use?

 (2) Furthermore, let us return to the example trigram. When I compute the
 example trigram's pmi in the way I understand the code, I get the value
 -15.24452, instead of the NSP package's 6.4127.
 All the observed frequencies needed for pmi are directly available in the
 example trigram line, so the only thing that can explain diverging results
 is HOW we compute the value.
 May I therefore ask if you agree with the way I understand the code?

 For the trigram
 355663266
 atdeter262744 7073841 9391062 5872364 1234064 647295 1064083

 I compute m111 as:
  m111= 7073841 * 9391062 * 5872364
  -
  355663266
 
  = 1.0968417e+12
 
  and PMI = log (262744 / 1.0968417e+12) = -15.24452

  NSP's pmi returns (using the command line:
  statistic.pl --ngram 3 pmi outputfile inputfile )
  produces the following line
 atdeter1 6.4127 262744 7073841 9391062 5872364 1234064 647295 1064083

 Best,
 Gunn

 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Re: the NSP trigram calculations don't match mine??

2009-06-10 Thread Ted Pedersen
Hi Gunn,

You might be hitting a peculiar bug we notice late last year (which still 
hasn't been fixed).

http://tech.groups.yahoo.com/group/ngram/message/240

If you run using just pmi in the command line, do your results agree with your 
Lisp code?

If there is still disagreement, let's run some tests on some common input and 
see if we can isolate why those differences exist...

To be honest I haven't looked at the PMI code in a while so I don't recall all 
the details, but I'll do that and respond in more detail. Just wanted to see if 
the above resolves anything for you.

Cordially,
Ted

--- In ngram@yahoogroups.com, gunnlyse gunnl...@... wrote:

 Hi,
 
 Due to a memory problem in using the NSP package on my trigrams, I decided 
 that I would rather program the calculations myself. Not being a perl 
 programmer, I can only assume that I have understood the code correctly.
 I started from a .cnt file, example lines (the first line of total n-gram 
 count, followed by 1 example line):
 
 355663266
 atdeter262744 7073841 9391062 5872364 1234064 647295 1064083
 
  I based myself on what I found in the file 3D.pm (concerning estimated 
 frequencies and observed frequencies), and translated this into lisp code. 
 Then I used the specific codesfor each assocoation measure. Having done the 
 programming, I tested my code for computing scores on a small data sample of 
 20 lines, and run the NSP package on the same sample (NSP does not crash on 
 this small sample) 
 It turns out tha my values are far from similar to the ones produced by the 
 NSP package, and I see no reason why. Could anyone have a look at this?
 
 Specifically, say that I wish to compute the pmi for the example trigram 
 above. According to the file pmi.pm:
 
 The expected values for the internal cells are calculated by taking the 
 product of their associated marginals and dividing by the sample size, for 
 example:
 
 n1pp * np1p * npp1
m111=   
nppp
 
 Pointwise Mutual Information (pmi) is defined as the log of the devitation
 between the observed frequency of a trigram (n111) and the probability of
 that trigram if it were independent (m111).
 
  PMI =   log (n111/m111)
 
 For the trigram above, this should give:
 m111= 7073841 * 9391062 * 5872364
   -
355663266
 
  = 1.0968417e+12
 
 
 and PMI = log (262744 / 1.0968417e+12) = -15.24452
 
 whereas NSP's pmi 
 (using the command line:
 statistic.pl --ngram 3 Text::NSP::Measures::3D::MI::pmi outputfile inputfile)
 produces the following line for the trigram above:
 
 atdeter18 -11.5906 262744 7073841 9391062 5872364 1234064 647295 1064083
 
 Not only do the figures differ, but the ranking of trigrams also diverge.
 Do I do something wrong?!? I program in LISP, the default log base is e (if 
 it matters).
 
 I am puzzled, among other things, by the fact that the pmi file states that 
 m111 is computed the way I rendered it above. But in the 3D.pm file it says 
 that
 
 sub computeExpectedValues
 {
   my ($values)=...@_;
 
 $m111=$n1pp*$np1p*$npp1/($nppp**2); 
 
 Does this mean that really we compute, not 
 nppp
 but 
 nppp*nppp ? 
 (Since I do not really know perl, maybe I misunderstand)?
 
 
 
 Thank you in advance!
 Gunn





Re: [ngram] output files

2009-04-01 Thread Ted Pedersen
On Fri, Mar 27, 2009 at 10:11 AM, walaa khaled walaa_...@yahoo.com wrote:
 Hi,

 I would like to ask if it is possible to use --recurse option and write the
 ngram of each file separately.

 I mean if I've 3 files as an input (input1, input2, input3) , can I write
 the ngrams in 3 ouptutfiles (out1, out2, out3) instead of only one output
 file??

 thanks in advance,
 Walaa

Hi Walaa,

The short answer is no, you can't do this. Each run of count.pl treats
whatever it's input is as a single source of data, and produces an
overall count. If you want separate counts for particular files, you
should process those with separate runs of count.pl.

I hope this helps!
Ted


-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] search in file generated by statistic.pl

2009-03-25 Thread Ted Pedersen
On Wed, Mar 25, 2009 at 6:50 AM, arezki20002002 arezki20002...@yahoo.fr wrote:
 Hello,
 once the file generated by statistic.pl
 how can I know a bigram appears in this file?
 thank you
 Arezki

Hi Arezki,

I tend to use the grep command to search through my statistics.pl
output...(when I'm looking for a specific ngram).
For example, I processed the biography The Fabulous Life of Diego
Rivera as follows

count.pl fab.out fabulous-life-of-diego-rivera.txt

statistic.pl ll.pm fab-ll.out fab.out

Then I decided I wanted to find out if Tina Modotti occurred in that book...

marimba(22): grep TinaModotti fab-ll.out
TinaModotti146 231.4471 13 26 14

This tells me that she did (13 times) and that this was the 146th
ranked bigram (according to log-likelihood). Tina occurred 26 times
(as the first word of a bigram) and Modotti occurs 14 times (as the
second word of a bigram).

I also just searched for Modotti

marimba(23): grep Modotti fab-ll.out
TinaModotti146 231.4471 13 26 14
Modotti.1624 39.2108 9 14 7804
Modottirejected6575 11.4513 1 14 17
Modottiserved6641 11.3337 1 14 18
thanModotti11839 5.9592 1 262 14
Modottiwas16621 2.5137 1 14 1611
Modottiand19152 0.8857 1 14 4451
Modotti,21349 0.0072 1 14 14352

Among other things, here I can see that Modotti is the second word of
two different bigrams (Tina Modotti, 13 times as we saw above, and
then as than Modotti 1 time, allowing us to confirm the total of 14
bigrams where Modotti is the second word...).

Fishing around like this can be quite fun. You could also use egrep to
specify regular expression patterns to search for (rather than just
strings), but I find grep to be a nice starting point.

I hope this is helpful!

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] providing commandline options

2009-02-06 Thread Ted Pedersen
If you run

count.pl --help

you will see a brief listing of all the command line arguments. More
detailed explanations are available here:

http://search.cpan.org/dist/Text-NSP/

You can also find a paper that discusses many of the features of NSP here...

 @inproceedings{BanerjeeP03,
author = {Banerjee, S. and Pedersen, T.},
title = {The Design, Implementation, and Use of the {N}gram
{S}tatistic {P}ackage},
booktitle = {Proceedings of the Fourth International
Conference on Intelligent Text Processing and Computational
Linguistics},
pages = {370-381},
year = {2003},
month ={February},
address = {Mexico City}}
http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

And yes, they are also defined in the source code via perldoc, so you can run

perldoc count.pl

to find a very detailed explanation of the options...

Good luck!
Ted

On Fri, Feb 6, 2009 at 1:31 AM, reshmijose reshmij...@yahoo.com wrote:
 can you tell me how command line arguments like 'count.pl output.txt
 input.txt' are defined? Are they defined within the source code?

 What should i do if i want to run the program count.pl separately?

 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] No ngram over sentence

2009-02-06 Thread Ted Pedersen
Hi Jayaram,

Yes, in order to restrict ngrams to individual sentences you would
need to use the -newLine option, and make sure that you had one
sentence per line, one line per sentence. Identifying sentences
boundaries is a non-trivial problem, but we have some simple code
available as a part of our WordNet::SenseRelate::AllWords package that
could be a useful starting point for a sentence boundary detector.

http://cpansearch.perl.org/src/TPEDERSE/WordNet-SenseRelate-AllWords-0.13/utils/sentence_split.pl

This is not intended to solve the problem, but it will do a
reasonable approximation of sentence boundary detection.

I hope this helps!

Cordially,
Ted

On Fri, Feb 6, 2009 at 1:01 AM, jayaram raji jayaram_raji2...@yahoo.com wrote:
 Dear Ted,

 In order to achieve what Christos has asked, Is it necessary to arrange the
 data in such a way that there is only one sentence per line?  If it is a
 running text, how does it identify the end of the sentence?

 Thanks
 Jayaram

 --- On Thu, 2/5/09, Ted Pedersen duluth...@gmail.com wrote:

 From: Ted Pedersen duluth...@gmail.com
 Subject: Re: [ngram] No ngram over sentence
 To: ngram@yahoogroups.com
 Date: Thursday, February 5, 2009, 9:41 PM

 Hi Christos,

 In order to count as you describe, you just need to use the --newLine
 option.

 If you run

 count.pl --help

 you can see all the command line options. Among them is ...

 --newLine Prevents n-grams from spanning across the
 new-line character.

 which should do exactly as you wish!

 Happy Counting, :)
 Ted

 On Thu, Feb 5, 2009 at 8:29 AM, christos.braeunle
 christos.braeunle@ yahoo.com wrote:
 Hello

 I started using the NSP package and i am realy impressed by its power.
 First of all thanks for that great tool!

 Now i run into a problem when building ngrams. I want to tell count.pl
 not to create ngrams over the end of a sentence.

 For example: i have two sentences.

 Vincent loves Honey Bunny
 A women snorts

 Now when building bigrams i would like to get:

 Vincentloves
 lovesHoney
 HoneyBunny
 Awomen
 womensnorts

 so i want that the bigram BunnyA is not created (and don't gets counted)

 Is there a way to achieve this?

 I hope my question is understandable and has not been ask bevor.

 If i missed some relevant documentation, i would be glad to be pointet
 to it.

 Thanks a lot

 Christos Bräunle



 --
 Ted Pedersen
 http://www.d. umn.edu/~ tpederse

 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] Re: plans for version 1.05

2008-02-15 Thread Ted Pedersen
Thanks for all this very interesting discussion. My inclination is to
try the encoding route suggested by Richard, although this will take
a bit longer since we'll need to study a little and experiment a bit
more. In addition to being a potentially more robust fix, I think
users whose problems would be solved by use locale can fairly easily
add that to their own programs. I can see where locale might cause
some unexpected issues to arise, and might end up further complicating
matters for some users. Most importantly, I don't feel like I fully
understand the issues here, so I need to try and educate myself a
little bit more before jumping into anything. These discussions have
been very helpful, and any other points of view are most welcome.

So, I think this means that 1.05 will be delayed just a bit while we
mull this over a bit more. Please do let us know if there are other
issues that we might want to address (relating to encodings or
anything else).

Thanks again,
Ted

On Feb 15, 2008 1:18 PM, Richard Jelinek [EMAIL PROTECTED] wrote:






 On Fri, Feb 15, 2008 at 04:24:17PM +0100, Björn Wilmsmann wrote:
   Thanks for elaborating on locale / Encode and the code example. I see
 your
   point.
  
   As for NSP this basically would mean that one would have to replace all
   open() and print calls with calls to custom methods that do the encoding
   magic before actually reading from or writing to an IO stream.

  decoding for open IN streams and encoding for open OUT streams.

  Yes. This is probably the most straightforward solution. Fortunately
  perl could help here. One can use open as pragma too:

  http://perldoc.perl.org/open.html

  Setting up default IN and OUT layers this way could save a lot of
  typing/transforming in the migration process.


  --
  Kind regards,

  Dipl.-Inf. Richard Jelinek

  - The PetaMem Group - Prague/Nuremberg - www.petamem.com -
  -= 2007-09-25: 49235653 Mind Units =-
  



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Problem with CPANPLUS 0.076 misidentifying versions after installing Text::NSP 1.03 (fwd)

2006-12-23 Thread ted pedersen

-- 
--
Ted Pedersen
http://www.d.umn.edu/~tpederse

-- Forwarded message --
Date: Sat, 23 Dec 2006 10:34:03 -0800
From: Jonathan Leffler [EMAIL PROTECTED]
To: [EMAIL PROTECTED], Bugs in CPANPLUS via RT [EMAIL PROTECTED],
 [EMAIL PROTECTED]
Subject: Problem with CPANPLUS 0.076 misidentifying versions after
installing Text::NSP 1.03

Dear Ted and CPANPLUS Maintenance team,

I installed Text::NSP v1.03 without any problems, but after that, 
CPANPLUS complains that Text::NSP::Measures is out of date because the 
installed version is 0.01 and not the 0.97 that is recorded on CPAN.

Looking at the Measures.pm file, it appears that the trouble is because 
the POD precedes the code (specifically, it precedes the code that 
defines the version of Measures.pm), and CPANPLUS is not noticing that 
the VERSION code it is looking at is in POD and not in code.

I'm reporting that as a bug to the CPANPLUS team, Ted, but there's also 
an easy fix for Measures.pm, namely to move the code that defines the 
version of Measures.pm up to near the front of the module.  (I added an 
'=cut' line before '=head1 DESCRIPTION' and moved 6 paragraphs of code 
defining the package and its version up the file - the problem no longer 
occurs (CPANPLUS does not complain that T::NSP::M is at version 0.01).

For the CPANPLUS team - I'm using:

CPANPLUS::Shell::Default -- CPAN e[...] (v0.076)
*** Using CPANPLUS::Backend v0.076.  ReadLine support enabled.

This is with Perl 5.8.8 on MacOS X 10.4.8.  I'll supply 'perl -V' output 
if you need it.

-- 
Jonathan Leffler   #include disclaimer.h
Email: [EMAIL PROTECTED], [EMAIL PROTECTED]
Guardian of DBD::Informix v2005.02 -- http://dbi.perl.org/



Re: [ngram] Pb with tokenisation in nsp

2006-11-24 Thread ted pedersen

On Fri, 24 Nov 2006, b siham wrote:

 Hi,
 
 I use a nsp package but I have problem with tokenisation, I want to define my 
 proper definition of tokenisation not as it's define in nsp package. Where 
 can I change the token option.
 
 Thanks for your help
 
 Siham
 

Hi Siham,

Check out sections 2 and 3 in the README. They describe how to set your 
own tokenization scheme.

http://search.cpan.org/src/TPEDERSE/Text-NSP-1.03/README

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse



[ngram] use of window size in count.pl

2006-09-22 Thread ted pedersen

Greetings all,

I was corresponding with someone about the --window option in count.pl,
and realized that this might be of general interest to NSP users, so
I have modifed that note slightly and sent it here. 

When you are counting up the bigrams in a corpus, you can specify a  
--window size that will allow there to be some number of intervening words
between the two words that make up the bigram. For example...

count.pl --window 5 output input.txt

...will allow up to 3 intervening words between the words in the bigram.  
The window size is 5, and the two words in the bigram occupy the first  
and fifth position respectively, so you have up to three spaces for  
words left over. 

Now, if we use the --window 5 option, all we do is simply count all the  
possible bigrams that include 0, 1, 2, and 3 intervening words, and then  
figure out the sample size based on this count, and then the calculuations  
of the measures proceed exactly as if you were doing it without any  
intervening words. This avoids, I think, any trickery or hacking of the 
measures to support a more flexible notion of what a bigram can be. 

For example, suppose this is your input:

my name is jim her name is sally

if you run count.pl without any window size, it defaults to allowing no 
intervening words (window size of 2). So, you could run...

count.pl output input.txt

...and you would get output like this:

7
nameis2 2 2 
isjim1 2 1 
jimher1 1 1 
issally1 2 1 
myname1 1 2 
hername1 1 2 

This tells us that there are 7 bigrams in the sample, and, for example,
the bigram name is occurs 2 times, where name occurs as the first 
word in any bigram 2 times and is occurs as the second word in any 
bigram 2 times...from that you can construct the 2x2 table thusly:

2  0 |  2
0  5 |  5
--
2  57

...meaning that we have 7 bigrams in the sample, where 2 of them are name  
is, and the other 5 do not include name or is.  
 
Now, if you wanted to allow up to 3 intervening words in the bigrams, you  
could run count.pl like this

count.pl --window 5 output input.txt

and the output would be like this:

22
nameis2 6 6 
namename1 6 5 
myis1 4 6 
isname1 5 5 
jimsally1 4 4 
jimher1 4 4 
namejim1 6 3 
namesally1 6 4 
isjim1 5 3 
nameher1 6 4 
isher1 5 4 
issally1 5 4 
hersally1 3 4 
myname1 4 5 
jimname1 4 5 
heris1 3 6 
myher1 4 4 
jimis1 4 6 
hername1 3 5 
myjim1 4 3 
isis1 5 6 

Notice that our sample size is different, and we have a lot more bigrams. 
But, we can do log-likelihood exactly as we should (in my view) without
any tampering or manipulation of the basic formula. 

Note that the table for name is does change here...

2  4  | 6
4 12  | 16 
--
6 1622

So, what this reflects is the fact that allowing intervening words has 
added bigrams to the sample. We still have only 2 occurrrences of name 
is, but we have 4 other bigrams where name is the first word, and 4 
other bigrams where is is the second word. That's because of the reach 
of the window size pulling in more bigrams. 

You could get log-likelihood values (for example) for either of the above 
outputs from count.pl via:  

statistic.pl ll output.ll output

I hope this helps clarify how happens when you use the --window option.
It's quite powerful I think, but hopefully fairly easy to understand. Do 
let us know if you have any questions about this (or anything else!)

Just a reminder, the most current version of NSP is now 1.03, and this is
available from links at :

http://www.d.umn.edu/~tpederse/nsp.html

Cordially,
Ted




 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* Your email settings:
Individual Email | Traditional

* To change settings online go to:
http://groups.yahoo.com/group/ngram/join
(Yahoo! ID required)

* To change settings via email:
mailto:[EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED]

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




[ngram] more details on performance issues that led to NSP 0.97 release

2006-06-21 Thread ted pedersen

=
Performance of ll on NSP version 0.97
=

Exporter::export has -4 unstacked calls in outer
Exporter::Heavy::heavy_export has 4 unstacked calls in outer
Total Elapsed Time = 4.759184 Seconds
  User+System Time = 2.769184 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 25.2   0.699  1.915  45672   0. 0.  Text::NSP::Measures::2D::MI::ll::c
 alculateStatistic
 14.3   0.396  0.850  45672   0. 0.  Text::NSP::Measures::2D::MI::getVa
 lues
 13.2   0.366  0.366 182688   0. 0.  Text::NSP::Measures::2D::MI::compu
 tePMI
 9.43   0.261  0.261  45672   0. 0.  Text::NSP::Measures::2D::computeMa
 rginalTotals
 8.31   0.230  0.230  1   0.2300 0.2300  main::unformattedPrinting
 5.45   0.151  0.151  45672   0. 0.  Text::NSP::Measures::2D::computeOb
 servedValues
 2.20   0.061  0.061  45672   0. 0.  Text::NSP::Measures::getErrorCode
 1.48   0.041  0.041  45672   0. 0.  Text::NSP::Measures::2D::computeEx
 pectedValues
 0.36   0.010  0.010  4   0.0025 0.0025  Exporter::as_heavy
 0.36   0.010  0.010  4   0.0025 0.0025  Text::NSP::Measures::2D::MI::BEGIN
 0.36   0.010  0.030  4   0.0025 0.0074  main::BEGIN
 0.36   0.010  0.010 19   0.0005 0.0005  Getopt::Long::BEGIN
 0.00   0.000  0.000  4   0. 0.  Exporter::Heavy::heavy_export
 0.00   - -0.000  1-  -  Getopt::Long::ConfigDefaults
 0.00   - -0.000  1-  -  Getopt::Long::Configure

--
Ted Pedersen
http://www.d.umn.edu/~tpederse



 Yahoo! Groups Sponsor ~-- 
Yahoo! Groups gets a make over. See the new email design.
http://us.click.yahoo.com/XISQkA/lOaOAA/yQLSAA/dpFolB/TM
~- 

 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




Re: [ngram] Re: Another Can you do this with NSP/Ngram

2006-05-28 Thread ted pedersen




Hi Leonardo.

Apologies for this very late reply. I have been out of town for a
few weeks now, and everything has fallen behind!

I think the answer to your question is no. :) I hate to say that,
but I think the regular expressions that are used in NSP are strictly
for tokenization. They let you chop up a file into tokens that might
be 2 letters long, or 2 words long, or be made up only of capitals,
etc. But, the counting step does not really look at the regular 
expressions used to tokenize, it simply counts up the tokens that are
found, and reports the totals for bigrams, etc. (whatever we are
counting).

One idea to do what you want to do might be to count ngrams using
NSP as usual, and then do some sort of edit distance calculations on
the resulting ngrams, and possibly merge together those ngrams that
are within some number of edits of each other. Or you could use
some other similarity measure to do something like that...

Sorry, I hope I am not misunderstanding the question. Please do let
us know if my answer seems to miss your point, or is unclear in some
way!

Thanks,
Ted

On Tue, 2 May 2006, Leonardo Fontenelle wrote:

 Or: _yet_ another can I do...

 I'm using regular expressions to match 4+ letter works or
 all-uppercase words; I believe they do a nice job for bigrams (file
 manager, in example, is one of the first hits) but they might be too
 restrictive for 3+-grams. How could I look for n-grams in which some
 (i.e. 2) tokens must match some regular expressions, but the others
 are also allowed to match some others? I'm trying to get expressions
 like drag and drop or press-and-hold, or create a new \w{4,}

 Thanks, again!

 Leonardo F. Fontenelle



 Yahoo! Groups Links







--
Ted Pedersen
http://www.d.umn.edu/~tpederse





  
  
SPONSORED LINKS
  
  
  

Computer internet security
  
  
Package design
  
  
Ski packages
  
  


Vacation packages
  
  
Snowboard packages
  
  
Package integrity testing
  
  

   
  







  
  
  YAHOO! GROUPS LINKS



  Visit your group "ngram" on the web.
  To unsubscribe from this group, send an email to:[EMAIL PROTECTED]
  Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.



  











Re: [ngram] Can you do this with NSP/Ngram type question: Name Matching?

2005-12-18 Thread ted pedersen

I think the short answer is that I'm not sure. I'm not entirely clear
what a match probability ratio is, so let me provide a brief accounting
of what NSP provides, and you can see if that is sufficient to compute
what you are after

In the case of

John Edwards
Jon Edwards

NSP (count.pl) would give you counts of how often John occurred,
how often Jon occurred, how often Edwards occurred, how often
John Edwards occurred, and how often Jon Edwards occurred. Now,
from that statistic.pl can compute measures of association that will
tell you how strongly associted John and Edwards are, and Jon
and Edwards are. It won't really tell you anything about the association
between John Edwards and Jon Edwards directly, I don't think.

So, I'm not sure if this helps. Can you describe more exactly what
you wish to compute (especially if it involves any of the quantities
mentioned above).

Also, keep in mind there are some utility programs that come with
NSP that *might* be useful - these include rank.pl and kocos.pl. But,
more about those if they seem relevant.

Thanks,
Ted

On Fri, 16 Dec 2005, dave1234870 wrote:

 Greetings all, my first post to the group hoping that its an
 appropriate forum for this question ... if not, my apologies to the
 group.

 I'm a moderately proficient, self-taught Perl hacker working in the
 fraud examination type industry.  I work with large amounts of data to
 identify scenarios wherein Names and/or Addresses serve as nexus
 points for discrete network analysis.  Of course, my problem is that
 names and addresses are quite often misspelled or not consistent.
 Examples,

 John Edwards
 Jon Edwards

 123 Main Street
 123 Main St

 PO Box 123
 Post Office Box 123
 etc.

 I've read over the docs for the NSP package, but am having a hard time
 wrapping my brain around it.  Would it be possible for the NSP package
 (count.pl and statistic.pl) to accomplish a test upon a pair of names
 to achieve a match probability ratio?

 In a perfect world, I want to open a large file with 1 long list of
 names.  Starting at the first name, I want to iterate over the entire
 list and achieve ratio proabilities for each pair of names.  As each
 ratio is computed, I'll test it for a threshold and if the pair
 exceeds a threshold, I'll push it to an array.  Repeat for the 2nd
 name in the list, 3rd name in the list, etc.

 Thanks in advance for any wisdom you might have on this question :-)







 Yahoo! Groups Links







--
Ted Pedersen
http://www.d.umn.edu/~tpederse


 Yahoo! Groups Sponsor ~-- 
Most low income homes are not online. Make a difference this holiday season!
http://us.click.yahoo.com/5UeCyC/BWHMAA/TtwFAA/dpFolB/TM
~- 

 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




[ngram] Re: [cpan #15862] Incorrect packaging practices

2005-11-16 Thread ted pedersen

On Wed, 16 Nov 2005,  via RT wrote:


 This message about Text-NSP was sent to you by GROUSSE [EMAIL PROTECTED] 
 via rt.cpan.org

 Full context and any attached attachments can be found at:
 URL: https://rt.cpan.org/Ticket/Display.html?id=15862 

 The distribution use top-level namespaces for all its measurement packages. 
 This is a bad practice, whereas it could easily keep them under its own 
 Text::NSP namespace. For instance, leftFisher should rather be 
 Text::NSP::leftFisher. As a side effect, it would make installation procedure 
 easier: just putting all pm files in a lib/ subdirectory ensure MakeMaker 
 correctely process them.

 Also, those are modules, not executable, so I don't see the need to scan PATH 
 for them. Installation procedure is supposed to correctly install them in 
 @INC, where 'use' directive will automatically find them at compile time, 
 which is more efficient than 'requires' directive at execution time.


Thank you very much for your careful review and comments regarding
Text-NSP (the Ngram Statistics Package). In fact, it is interesting you
raise these concerns as we are currently working on a version 0.75 of
Text-NSP that will address the namespace concerns (by implementing a more
traditional hierarchy for the measure modules). We are also trying to
clean up the issues regarding our use of PATH, etc. as you describe below.
That said, we are grateful for the confirmation that we are heading in the
right direction, and we will make sure that our changes are in line with
your suggestions below.

To be honest, when we started NSP back in 2001, we knew very little about
Perl (which is no excuse really, there is lots of documentation, etc.) but
we made some choices that weren't so good, and clearly very non-standard.
Rest assured we want to resolve these asap! We are hoping that the 0.75
release will be ready by mid-December.

Cordially,
Ted and Saiyam

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


 Yahoo! Groups Sponsor ~-- 
Get fast access to your favorite Yahoo! Groups. Make Yahoo! your home page
http://us.click.yahoo.com/dpRU5A/wUILAA/yQLSAA/dpFolB/TM
~- 

 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




[ngram] suggestion for nsp from user

2005-10-04 Thread ted pedersen

An NSP user has the following idea:

--
I just thought it would be nice to have an option in NSP (specifically in
statistic.pl) to filter bigrams based on their p-values, like we currently
do by rank and score. Very often I need to find significant bigrams, and
it will be nice if I can just tell NSP to give me bigrams with say  5% or
1% chance of being independent etc...
--

I think this is an excellent suggestion, and it is something we have
thought about doing in the past, and should revisit now that we are back
into NSP development mode (just started in Sept, hopefully some new
releases coming in October!).

In fact, there are some Perl modules that will give these values (assuming
we can give them log-likelihood or pearson's values, which of course we
can). For example, the following seems promising:

http://search.cpan.org/~mikek/Statistics-Distributions-1.02/Distributions.pm

So, we aren't too far away from being able to do this, especially for
measures like ll and x2 (which can be assigned significance based on their
raw values using 1 degree of freedom and the chi-squared distribution).
However, the problem we would have is that it's less clear what it means
for some measures - like the dice coefficient, for example. I don't
*think* there is a clean way to assign significance to those values
(perhaps I'm wrong on that point?)

Anyway, an excellent suggestion. Thanks for making it - we'll make sure we
give it some serious consideration as we get into modifying statistic.pl,
which isn't too far off in the future.

If there are other suggestions along these lines, that is additional
features for statistic.pl, additional measures to support, etc. please
make them now as we are early in the development stages and it's a good
time to add items to the agenda. A few months from now it will probably be
a bit harder to do so.

Thanks!
Ted



 Yahoo! Groups Sponsor ~-- 
Get Bzzzy! (real tools to help you find a job). Welcome to the Sweet Life.
http://us.click.yahoo.com/A77XvD/vlQLAA/TtwFAA/dpFolB/TM
~- 

 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




[ngram] NSP bibliography now under construction

2005-09-27 Thread ted pedersen

I am finally getting to one of my New Year's Resolutions (yes, for 2005
even!). The construction of an online bibliography for the Ngram
Statistics Package is *finally* underway!

http://www.d.umn.edu/~tpederse/nsp-bib/

I have already put a few entries there, but I know there are lots that are
missing!! I will be working on this over the next few days, but would very
much appreciate it if you would check out the bibliography and let me know
of any written papers, articles, reports, theses, etc. that make use of
the Ngram Statistics Package in some way. Also, if you notice any errors
in the entries that are already there, please let me know.

The goal of this bibliography is to try and show the wide range of
problems and applications for which NSP has been used, and to try and make
those publications as widely available as possible. Please help out by
adding your own papers to the bibliography!

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


 Yahoo! Groups Sponsor ~-- 
Fair play? Video games influencing politics. Click and talk back!
http://us.click.yahoo.com/T8sf5C/tzNLAA/TtwFAA/dpFolB/TM
~- 

 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




[ngram] proposed re-design of Measures in Ngram Statistics Package

2005-09-20 Thread ted pedersen

The following is a description of our plan of attach for the first stage
of the NSP redesign, that is to organize the measures in an object
oriented hierarchical fashion. The description below is written by Saiyam
Kohli. Your comments and questions are of course most welcome, especially
at this time. We will get started on implementation in the very near
future.

Thanks,
Ted



The Ngram Statistics Package (aka Text-NSP) is being completely rewritten
using object oriented Perl. We will start with the measures for ranking
bigrams and trigrams that are now found in the /Measures directory.

These changes should be transparent to current users of NSP, since
statistic.pl will continue to be used as is. Internally the Measures
will be organized as a hierarchy of classes, and statistic.pl will serve
as a driver program that calls the methods appropriate for the measures
requested by the user.

The proposed hierarchy for Text-NSP measures is:

Text-NSP::Measure

Text-NSP::Measure::2D

Text-NSP::Measure::2D::fisher

Text-NSP::Measure::2D::fisher::left

Text-NSP::Measure::2D::fisher::right

Text-NSP::Measure::2D::mi

Text-NSP::Measure::2D::mi::ll

Text-NSP::Measure::2D::mi::tmi

Text-NSP::Measure::2D::mi::pmi

Text-NSP::Measure::2D::phi

Text-NSP::Measure::2D::phi::x2

Text-NSP::Measure::2D::phi::phi2

Text-NSP::Measure::2D::dice

Text-NSP::Measure::2D::odds

Text-NSP::Measure::2D::tscore

Text-NSP::Measure::3D

Text-NSP::Measure::3D::mi::ll

Text-NSP::Measure::3D::mi::tmi

Text-NSP::Measure is the base class for all the measures and will
provide basic framework and error checks. Most of the methods in this
class will have to be overridden.

To create a object of any of the measures, statistic.pl will have to pass
the name of that measure to the constructor of this class:

$ll = Text-NSP::Measure-(Text-NSP::Measure::2D::ll);

Since statistic.pl will still be the driver program, it will act as the
abstraction between existing programs that use the ngrams statistics
package and the new implementation. This will also allow users to create
and release their own measures based on the Text-NSP::Measure framework.

Text-NSP::Measure::2D inherits from Text-NSP::Measure and will provide
framework specific for bigram based measures, this class will implement
methods to compute the observed and expected values from the marginal
totals. Error checks that are specific to bi-grams will also be
implemented in this class. Similarly Text-NSP::Measure::3D will provide
framework for trigram measures.

Text-NSP::Measure::2D::fisher will implement methods specific to the
fisher's exact test thus eradicating any duplication of code present in
the current implementation, we will also resolve the errors that have been
reported in the ngram users list regarding Fisher's exact test. (These
errors occur when n22 is not the maximum value in the 2x2 table
representation of the count data).

The computation of the Log-Likelihood measure, Total Mutual Information
and Pointwise Mutual Information is quite similar.

ll = 2* sum[ nij*log(nij/mij)]

tmi =  sum[ mij*log(nij/mij)/log2]

pmi = log(n11/m11)

The computations common to these measures will be implemented in
Text-NSP::Measure::2D::mi class. Here mi is just a placeholder we
have not finalized name for this class.

Similarly, the phi and x2 measures are related to each other, thus the
class phi will provide methods for computations and error checks
specific to both these measures.

x2 = npp*phi^2

There was not much common in the rest of the three measures, thus they
will be implemented directly under the Text-NSP::Measure::2D class.

This hierarchy is not final, any suggestions regarding the proposed
changes will be appreciated.

Saiyam Kohli


 Yahoo! Groups Sponsor ~-- 
Most low income households are not online. Help bridge the digital divide today!
http://us.click.yahoo.com/cd_AJB/QnQLAA/TtwFAA/dpFolB/TM
~- 

 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 





[ngram] input format

2005-08-16 Thread ted pedersen

A user is wondering about how to manually create input files for
statistic.pl ...

 I have read your readme file which came with the package. It's well
 written and quite understandable even for a person ignorant in the field
 of Ngrams.
 But unfortunately, although I quickly understood the general idea, it's
 still not quite clear for me how to convert my data into the format of
 2x2 contingency tables suitable for a leftFisher library.
 here's what I have:
 for every gene (bigram in your terms) I have typical 2x2 table with
 frequencies like:
 2?? 3
 4?? 1

 First, it's written in leftFisher.pm, that only data from bigrams are
 excepted.
 In readme file the input format for bigrams is given as:

 first_line_total_number_of_bigrams
 word1word2n11n1pnp1
 ...
 where n1p,np1 represent marginal totals in a 2x2 contingency table.
 n11 is a frequency of a bigram.

 Now in leftFisher.pm it's written:
 ??? #? Get the total number of bigrams; no need to check
 ??? my $npp = measure2d::getTotalBigrams();

 ??? #? Get the marginal frequencies;
 ???
 ??? my ($n1p, $np1, $n2p, $np2) = measure2d::getMarginalTotals();

 Indeed, as one would expect, there should be 4 marginal totals in the
 input for Fisher statistics computation.
 But how do I get them into input file?
 I imagine that word1word2 would be unique geneID. What should I put
 as a n11, n1p and np1 taking into account the freq table above?


Suppose your 2x2 table of data looks like this:

  2 3 | 5
  4 1 | 5
__
  6 4   10

To convert this into a format that statistic.pl can process, you
would do this:

10
xy2 5 6


You can imagine this as specifying the following 2x2 table:

  2  n12 | 5
 n21 n22 | n2p
  __
  6  np2  10

Now, given those 4 values, you can fill in the rest of the cell values
using simple algebra. So we only require that you specify this minimal set
of values to represent a 2x2 table!

In general you probably want to use the command line program statistic.pl
rather than trying to modify leftFisher.pm, etc.

If you you put the following :

10
xy2 5 6

into a file mydata.input

Then you could run statistic.pl like this...

statistic.pl leftFisher mydata.output mydata.input

I hope this makes sense!

Good luck,
Ted


 Yahoo! Groups Sponsor ~-- 
font face=arial size=-1a 
href=http://us.ard.yahoo.com/SIG=12h4acmup/M=362329.6886308.7839368.1510227/D=groups/S=1705007709:TM/Y=YAHOO/EXP=1124246433/A=2894321/R=0/SIG=11dvsfulr/*http://youthnoise.com/page.php?page_id=1992
Fair play? Video games influencing politics. Click and talk back!/a./font
~- 

 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




[ngram] overflow in fisher's test

2005-07-20 Thread ted pedersen

A user reported observing some cases of overflow in Fisher's exact test in
the Ngram Statistics Package (both the left and right variations). My own
conclusion is that there is a bit of rounding error at work here, since we
are summing together a potentially large number of hyper-geometric
probabilities to arrive at the values. So, it's not an alarming situation,
but certainly one that needs to be fixed. Below you can see some specific
cases of overflow:

Right Fisher output:

934064
cat:cch_position_direction:-2 1.0490 1728 20006 169317
h_role:objectrelative_position:33 1.0050 144 68362 15501
h_cat:jjh_role:locative4 1.0032 511 48419 35842
h_group_type:nacat:pos5 1. 14 59756 8709
h_group_type:nacat:sym5 1. 38 59756 6910

Left Fisher output:

934064
cat:nnh_relative_position:31 1.0916 801 301347 1890
h_group_type:nprole:predeterminer2 1.0390 1133 445387 1135
group_type:naleafp:na3 1. 1 1 1
group_type:nah_leafp:na3 1. 1 1 59756
h_group_type:naleafp:na3 1. 1 59756 1

The good news is that the Ngram Statistics Package is in line for a long
overdue facelift that will commence in August - there are a number of long
pending issues that will be resolved at that time, and some new
enhancements and features. As we get closer to starting that work, I'll be
posting our list of reported problems, etc. in order to make sure we
have caught everything. And of course, please feel free to let us know
of any other questions or concerns.

Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse




 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 




[ngram] Extensions to NSP for log-likelihood ratio

2005-05-01 Thread ted pedersen

Greetings all,

As a part of her MS thesis here at UMD (completed in Fall 2004), Bridget
made available her code that extends NSP to carry out the log-likelihood
ratio for 3, 4, and 5 word sequences. It's my intent to integrate that
into NSP more closely, but for now it is at least available as a
separate add-on here:

http://www.d.umn.edu/~tpederse/Code/modeling.tar.gz

You can find Bridget's thesis at the site below, to get more details on
what she did and what these extensions will provide:

http://www.d.umn.edu/~tpederse/Pubs/bridget-thesis.pdf

Please let us know if you have any questions about this. Sorry for not
making this available sooner, Bridget did a nice job on this and it just
fell through the cracks!

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse





 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 





Re: [ngram] Re: bash: ALL-TESTS.sh: command not found

2005-02-21 Thread ted pedersen


Hi Nancy,

I think you are having path issues, although it's a little strange since
it seems like the NSP directories are in your path. That said, I don't use
cygwin and I do know that it is a bit quirky, so perhaps there is someone
with more cygwin experience who can shed some light on what might be
happening.

But, anyway, here are some more ideas...

It looks like cygwin is using bash as your shell, as the commands in our
documentation assume the csh shell. So, you could either modify the set
path command for bash (you need to use the export rather than set
command), or, more simply, you could run the csh in cgwin, and then do the
path setting operations. If you aren't real familiar with bash and don't
really care if you use it or not, then this might be the easier solution.

So, if you run

csh

from the command line, you will start a csh (C shell). If you don't
have csh (if this command fails), you can use tcsh instead.

BTW, you can tell if you are root or not by doing the command

whoami

If that says root you are the master of the universse. If it says
something else, then maybe you aren't the master and need to use the
PREFIX option with perl Makefile.PL (to specify where to install NSP).

Once you have started your csh, then go ahead and redo the installation
sequence of

perl Makefile.PL
make
make test
make install

(If you aren't root, then you will need to specify a PREFIX directory in
which to install NSP, and then use those three set path commands to point
the path at PREFIX and a few subdirectories.

So, if that all works correctly, do a few things like this from the
command line...

which count.pl

which statistic.pl

which ALL-TESTS.sh

If everything has gone well, you should see the directory where these
programs reside. If they are found, then when you run

ALL-TESTS.sh

things should go ok. If you don't find these files/programs via the which
command, it might mean that path problems remain.

As an aside, I think cygwin is great, if you don't have the option of
setting up your machine with dual booting capability (Windows and Linux).
However, if you do have this option, and plan to be using NSP and similar
tools a lot, it might make some sense to think about going the dual boot
root. Most of this sort of thing is pretty straightforward in Linux, and
I think over time Linux will evolve more nicely than cygwin. This is not
to dismiss cygwin, like I say it's a great idea, but I think life will get
easier if you are able to run on a Linux machine.

Good luck, and let us know what happens!

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
 





[ngram] new year's resolutions/ngram statistics package

2005-01-03 Thread ted pedersen


Happy 2005 from the Ngram Statistics Package...

There are a few items lingering in our todo queue, among them cleaning up
a few issues in the documentation, and then of course continuing to fix
up the code, add new features, etc. Right now NSP seems to be in a
somewhat stable state, but we do have plans and ambitions for the future.
If there are any features you would like to see included in future vesions
of NSP, please let us know!

My New Year's resolution regarding NSP is to finally compile and post an
NSP bibliography - that is to list all papers, theses, dissertations,
articles, tech reports, etc. that have used NSP in some way. It's actually
fairly amazing how many such articles I have located thus far, but I'm
sure that I'm not finding everything.

So, if you are the author of something that uses or cites NSP, could you
please send me a note about this and let me know what you have done, and
where other NSP users can find your work? One thing that I've noticed is
that there are lots and lots of links on the web that point to NSP, so if
you are included in our bibliography this will probably help people find
your work (unless you are already very famous, in which case you don't
need our help, but we certainly need yours. :)

If you have already written to me about your paper, I still have that
information, and my apologies for being so slow with this bibliography.
But, I am hard at work at this in the new year, and hope to have something
available by Jan 21.

Also, if you happen to have code that uses NSP without a related
publication, and that code is distributed, we want to know about you too.
We'll have a separate section for software systems...

Happy New Year!
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


 
Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/