[ngram] Re: Fwd: ll4 giving me trouble with 4-grams
Hi Bridget, I am using Text-NSP-1.25, Perl 5.10.1 and GNU/Linux. Thanks for your help! Mercè --- In ngram@yahoogroups.com, Bridget McInnes btmcinnes@... wrote: Hi Mercè, That is puzzling. I get the following, which really doesn't help you much. 3214 derecuperacióndeinformación1 1531.9192 8 274 17 286 33 8 18 9 17 16 27 8 8 8 16 larecuperacióndeinformación2 1266.9553 7 115 17 286 33 8 54 11 17 16 27 8 7 11 16 recuperacióndeinformacióny3 1009.9596 5 15 287 30 63 15 14 5 25 11 5 14 5 5 5 elprocesamientodellenguaje4 610.5006 6 95 19 57 22 11 9 6 10 9 20 6 6 6 9 procesamientodellenguajenatural5 521.5067 9 19 55 22 19 10 9 9 20 16 18 9 9 9 16 What version of Text-NSP are you using? and I guess also what version of perl? And your OS? Maybe if we can see the difference between your system and mine we can track down the error. I am using: Text-NSP-1.27; Perl 5.10.1; ubuntu. Do you have similar versions, especially with NSP? In the mean time, I will take a look at the code that the error is being generated at to see if something comes to light. Thanks, Bridget On Wed, Mar 27, 2013 at 2:34 PM, mercevg mercevg@... wrote: ** Hi Bridget, I've been doing the same process as you, but the error continues to occur. My test.4 file contains: 3214 procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9 16 derecuperacióndeinformación8 274 17 286 33 8 18 9 17 16 27 8 8 8 16 larecuperacióndeinformación7 115 17 286 33 8 54 11 17 16 27 8 7 11 16 elprocesamientodellenguaje6 95 19 57 22 11 9 6 10 9 20 6 6 6 9 recuperacióndeinformacióny5 15 287 30 63 15 14 5 25 11 5 14 5 5 5 Then I run the Log Likelihood for 4-grams statistic.pl --ngram 4 ll test.4ll test.4 And this is the error message: Use of uninitialized value $Text::NSP::Measures::4D::expected_values in string eq at /etc/perl/Text/NSP/Measures/4D.pm line 869, SRC line 816.^C Thank you for your help! Mercè --- In ngram@yahoogroups.com, Bridget McInnes btmcinnes@ wrote: Hi Mercè, Would you send me your file? I am not able to reproduce the error. I apologize if you already sent it. I am not seeing it in the thread. I put what I did to test it below so you could reproduce what I have done on an example test set. I will check on : In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI:: ll.3pm There shouldn't be a ll.3pm in 4D. I must have something wrong in there. Thanks, Bridget - Here is what I am doing: The text file contains the following: this is a test sentence just a sentence this is a test sentence I save that to a file called test.txt. Then I run the following: bridget@atlas:~/nsp-test$ count.pl --ngram 4 test.4 test.txt The test.4 file contains: 10 thisisatest2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 isatestsentence2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 sentencethisisa1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 testsentencejusta1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 justasentencethis1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 asentencethisis1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 atestsentencejust1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 sentencejustasentence1 2 1 3 3 1 1 1 1 1 1 1 1 1 1 Then I run the Log Likelihood for 4-grams over it: bridget@atlas:~/nsp-test$ statistic.pl --ngram 4 ll test.4ll test.4 Please note here that input file is the count.pl file generated from the above step (test.4). This may be the cause of the error. The test.4ll contains: 10 isatestsentence1 29.8708 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 thisisatest2 29.6804 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 sentencejustasentence3 27.3805 1 2 1 3 3 1 1 1 1 1 1 1 1 1 1 justasentencethis4 22.4273 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 sentencethisisa5 19.9354 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 testsentencejusta5 19.9354 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 asentencethisis5 19.9354 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 atestsentencejust5 19.9354 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 Let me know if you get anything different. On Wed, Mar 27, 2013 at 11:45 AM, mercevg mercevg@ wrote: ** Ted, Changing the command line the following error message appears: Use of uninitialized value $Text::NSP::Measures::4D::expected_values in string eq at /etc/perl/Text/NSP/Measures/4D.pm line 839, SRC line 1265.^C It could be due to files already installed? In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI::ll.3pm Thank you, Mercè --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote: I think there is a slight typo in your command : statistic.pl --ngram 4 ll4.pm output.txt intput.txt (the module name should be ll4.pm) I hope this helps! Let me know if you continue to have any trouble... Good luck, Ted On Wed, Mar 27, 2013 at 9:06 AM, mercevg mercevg@ wrote
[ngram] Re: Fwd: ll4 giving me trouble with 4-grams
Ted, Changing the command line the following error message appears: Use of uninitialized value $Text::NSP::Measures::4D::expected_values in string eq at /etc/perl/Text/NSP/Measures/4D.pm line 839, SRC line 1265.^C It could be due to files already installed? In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI::ll.3pm Thank you, Mercè --- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote: I think there is a slight typo in your command : statistic.pl --ngram 4 ll4.pm output.txt intput.txt (the module name should be ll4.pm) I hope this helps! Let me know if you continue to have any trouble... Good luck, Ted On Wed, Mar 27, 2013 at 9:06 AM, mercevg mercevg@... wrote: Ted, I've received your answer without problem. I'll try to follow up with another email address. A sample of my 4-grams file: procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9 16 recuperacióndeinformacióntextual4 15 287 30 5 15 14 4 25 4 5 14 4 4 4 estadísticodellenguajenatural3 5 55 22 19 3 3 3 20 16 18 3 3 3 16 aparicióneneldocumento2 4 93 95 22 3 2 3 18 6 4 2 3 2 3 Command line: statistic.pl --ngram 4 ll.3pm 4-grams-ll.txt 4-grams.txt Program answer: Measure not defined for 4-grams I've got Text-NSP v.1.25. Thank you. Mercè --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote: Merce, I got an email error when responding directly to your yahoo.es account. Could you follow up with another email address or use the group...? Thanks, Ted -- Forwarded message -- From: Ted Pedersen tpederse@ Date: Wed, Mar 27, 2013 at 8:29 AM Subject: Re: ll4 giving me trouble with 4-grams To: mercevg mercevg@ Hi Merce, Could you send me whatever error output you are getting, plus a small sample of your ngram file? Thanks! Ted On Wed, Mar 27, 2013 at 8:12 AM, mercevg mercevg@ wrote: Hi, I would like to know how to calculate with Statistical.pl 4-grams using log-likelihood ratio. To calculate 3-grams I've run the program as follows: statistic.pl --ngram 3 tmi3.pm three.ngram.tmi3 three.ngram But using log-likelihood ratio it doesn't work. Thanks Mercè
[ngram] Re: Fwd: ll4 giving me trouble with 4-grams
Hi Bridget, I've been doing the same process as you, but the error continues to occur. My test.4 file contains: 3214 procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9 16 derecuperacióndeinformación8 274 17 286 33 8 18 9 17 16 27 8 8 8 16 larecuperacióndeinformación7 115 17 286 33 8 54 11 17 16 27 8 7 11 16 elprocesamientodellenguaje6 95 19 57 22 11 9 6 10 9 20 6 6 6 9 recuperacióndeinformacióny5 15 287 30 63 15 14 5 25 11 5 14 5 5 5 Then I run the Log Likelihood for 4-grams statistic.pl --ngram 4 ll test.4ll test.4 And this is the error message: Use of uninitialized value $Text::NSP::Measures::4D::expected_values in string eq at /etc/perl/Text/NSP/Measures/4D.pm line 869, SRC line 816.^C Thank you for your help! Mercè --- In ngram@yahoogroups.com, Bridget McInnes btmcinnes@... wrote: Hi Mercè, Would you send me your file? I am not able to reproduce the error. I apologize if you already sent it. I am not seeing it in the thread. I put what I did to test it below so you could reproduce what I have done on an example test set. I will check on : In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI:: ll.3pm There shouldn't be a ll.3pm in 4D. I must have something wrong in there. Thanks, Bridget - Here is what I am doing: The text file contains the following: this is a test sentence just a sentence this is a test sentence I save that to a file called test.txt. Then I run the following: bridget@atlas:~/nsp-test$ count.pl --ngram 4 test.4 test.txt The test.4 file contains: 10 thisisatest2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 isatestsentence2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 sentencethisisa1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 testsentencejusta1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 justasentencethis1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 asentencethisis1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 atestsentencejust1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 sentencejustasentence1 2 1 3 3 1 1 1 1 1 1 1 1 1 1 Then I run the Log Likelihood for 4-grams over it: bridget@atlas:~/nsp-test$ statistic.pl --ngram 4 ll test.4ll test.4 Please note here that input file is the count.pl file generated from the above step (test.4). This may be the cause of the error. The test.4ll contains: 10 isatestsentence1 29.8708 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 thisisatest2 29.6804 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 sentencejustasentence3 27.3805 1 2 1 3 3 1 1 1 1 1 1 1 1 1 1 justasentencethis4 22.4273 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 sentencethisisa5 19.9354 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 testsentencejusta5 19.9354 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 asentencethisis5 19.9354 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 atestsentencejust5 19.9354 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 Let me know if you get anything different. On Wed, Mar 27, 2013 at 11:45 AM, mercevg mercevg@... wrote: ** Ted, Changing the command line the following error message appears: Use of uninitialized value $Text::NSP::Measures::4D::expected_values in string eq at /etc/perl/Text/NSP/Measures/4D.pm line 839, SRC line 1265.^C It could be due to files already installed? In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI::ll.3pm Thank you, Mercè --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote: I think there is a slight typo in your command : statistic.pl --ngram 4 ll4.pm output.txt intput.txt (the module name should be ll4.pm) I hope this helps! Let me know if you continue to have any trouble... Good luck, Ted On Wed, Mar 27, 2013 at 9:06 AM, mercevg mercevg@ wrote: Ted, I've received your answer without problem. I'll try to follow up with another email address. A sample of my 4-grams file: procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9 16 recuperacióndeinformacióntextual4 15 287 30 5 15 14 4 25 4 5 14 4 4 4 estadísticodellenguajenatural3 5 55 22 19 3 3 3 20 16 18 3 3 3 16 aparicióneneldocumento2 4 93 95 22 3 2 3 18 6 4 2 3 2 3 Command line: statistic.pl --ngram 4 ll.3pm 4-grams-ll.txt 4-grams.txt Program answer: Measure not defined for 4-grams I've got Text-NSP v.1.25. Thank you. Mercè --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote: Merce, I got an email error when responding directly to your yahoo.es account. Could you follow up with another email address or use the group...? Thanks, Ted -- Forwarded message -- From: Ted Pedersen tpederse@ Date: Wed, Mar 27, 2013 at 8:29 AM Subject: Re: ll4 giving me trouble with 4-grams To: mercevg mercevg@ Hi Merce, Could you send me whatever error output you are getting, plus a small sample of your ngram file? Thanks! Ted On Wed, Mar 27, 2013 at 8:12 AM, mercevg mercevg@ wrote: Hi, I would like to know how to calculate with Statistical.pl 4-grams
[ngram] Re: ngrams with hyphen
Ted, Thanks, I've add this regular expression in my tokens file and it works well. One more comment about that: In my corpus I have some interesting bigrams as in-band signalling in-call rearrangement in-slot signalling If I filter as a stopword in, I can't get these kind of bigrams from my corpus. On the contrary, if in it's not on my stopwords list, I retrieve these bigrams but also I get more bigrams without interest as in Recommendation in Figure in order My question is: Can I filter and retrieve these two groups of bigrams at the same time? Thank you for your help, Mercè --- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote: Greetings Merce, This is fairly easy to handle via the --token option. You simply specify a regular expression that says a token in a string followed by a - followed by a string. You can customize a --token file many ways, but the following example will handle hyphenated words. Please do let us know if additional questions arise! linux@linux:~ count.pl test.out test.txt --token token.txt linux@linux:~ more test.out 13 cell-phoneIt1 1 1 thevillage-shop1 1 1 sextra-nice1 1 1 village-shoptoday1 1 1 boughta1 1 1 wentto1 1 1 acell-phone1 1 1 iwent1 1 1 todayand1 1 1 Its1 1 1 andI1 1 1 Ibought1 1 1 tothe1 1 1 linux@linux:~ cat test.txt i went to the village-shop today, and I bought a cell-phone. It's extra-nice. linux@linux:~ cat token.txt /\w+\-\w+/ /\w+/ Enjoy, Ted On Wed, Apr 20, 2011 at 2:20 PM, mercevg mercevg@... wrote: Dear all, I would like to know if it's possible to get a list of ngrams with a hyphen inside, maybe during the tokenization process. For exemple, I want to get these bigrams: - call-connected signal - clear-back signal - clear-forward signal Instead of two bigrams for each one: - callconnected179 2608 527 connectedsignal189 320 9176 - clearback283 1115 733 backsignal157 380 9176 - clearforward632 1115 877 forwardsignal493 1547 9176 Thanks a lot, Mercè -- Ted Pedersen http://www.d.umn.edu/~tpederse
[ngram] Ngrams without line break
Dear all, I would like to know if it's possible to get ngrams without containing line breaks from the corpus. I'll try to explain clearly: if the input text file is first line of text second line And a third line of text Then, we'll get with count.pl two bigrams containing like breaks: text second line And Or trigrams: of text second text second line second line And And so on. Taking into account these outputs, and after reading help text, I don't know if I can change default count.pl options to get all ngrams from the corpus except the ngrams containing words placed at the end of one sentence and words that are at the begining of the next sentence. That is, ngram without containing line breaks. Best wishes, Mercè
[ngram] Re: Ngrams without line break
Dear Ted, In my case, I would like to get all the ngrams except those that cross over the end of line. In your example: the cat is my friend the cat is my friend I don't want to get as ngrams is my and the cat, those having a new line in the middle of it. As you said, by default count.pl simply ignores end of line markers. But, it's possible not ignore end of line markers? Thanks a lot! Mercè --- In ngram@yahoogroups.com, Ted Pedersen duluth...@... wrote: Greetings Merce, To make sure I understand correctly, it sounds like you *only* want to see those ngrams that contain a line break. For example, if you run count.pl as follows on your test file first line of text second line And a third line of text count.pl test.out test talisker(8): more test.out 11 lineof2 3 2 oftext2 2 2 lineAnd1 3 1 Anda1 1 1 athird1 1 1 secondline1 1 3 thirdline1 1 3 firstline1 1 3 textsecond1 1 1 You will get the bigrams that cross over the end of line - (text, second, line And), but you also get all the other ngrams too...and so it sounds to me like you only want the ones that cross over the new line markers, and nothing else. Is that accurate? By default count.pl simply ignores end of line markers (the behavior you see above). So, it's not so much that the ngram includes the new line, it simply ignores it. So with a file like the cat is my friend the cat is my friend the 2 occurrences of the cat would be considered identical, even though the second could be thought of as having a new line in the middle of it (but we essentially ignore that). So...at the moment at least I'm not sure how to limit the output to only those ngrams that are made by crossing over a new line markerBut, let me make sure I am understanding things correctly (so do let me know if I'm wrong) and I'll give this a little more thought too. Cordially, Ted On Wed, Jul 1, 2009 at 12:15 PM, mercevgmerc...@... wrote: Dear all, I would like to know if it's possible to get ngrams without containing line breaks from the corpus. I'll try to explain clearly: if the input text file is first line of text second line And a third line of text Then, we'll get with count.pl two bigrams containing like breaks: text second line And Or trigrams: of text second text second line second line And And so on. Taking into account these outputs, and after reading help text, I don't know if I can change default count.pl options to get all ngrams from the corpus except the ngrams containing words placed at the end of one sentence and words that are at the begining of the next sentence. That is, ngram without containing line breaks. Best wishes, Mercè -- Ted Pedersen http://www.d.umn.edu/~tpederse
[ngram] Re: Problem with a token
Patrick, Ted, I added use locale; in line 83 but this can't improve my results: words containing the character l·l (like intel·ligència)are not included in the results list. But it is important to say that I add as a tokens all accents, diaeresis and apostrophes that are used in Catalan corpus and I have had a good results. I think it's the solution for this kind of characters, except for the l·l (l geminada). Best regards, Mercè Greetings all, Thanks for the very interesting discussion. This is quite helpful. Just a short note to confirm that we have not yet added the add locale; directive to NSP - we haven't had a release in some time, but this will surely be included when we do. I am thinking it might not be a bad idea to have a release simply to take care of this. Thanks to Patrick for pointing this out in the first place, and then reminding us of that earlier discussion. I would be very interested to know if this resolves the problems with Catalan, French, Spanish, btw. Please do update us and the rest of the list, as I suspect this is a fairly common problem. Cordially, Ted On Feb 13, 2008 11:07 AM, mercevg [EMAIL PROTECTED] wrote: Patrick, I have checked the latest version of NSP (v.1.03) and count.pl doesn't contain use locale;. I'll try to add use locale; in line 83, maybe your suggestion it's my solution. More or less we have the same problems with accents and other kind of characters working with French and Catalan or Spanish. Thank you very much! Mercè Mercè, I have not checked the latest version of NSP to see if count.pl and the other files contain use locale; as I suggested some time ago. The simple inclusion of such a statement at the beginning of the Perl scripts fixed the problems I had for French. You can have a look at this for more information : http://tech.groups.yahoo.com/group/ngram/message/159 Hope this helps... Regards, Patrick -- Ted Pedersen http://www.d.umn.edu/~tpederse
[ngram] Re: plans for version 1.05
Ted, I have two suggestions to improve the new version. 1. I have problems to extract bigrams using Fishers exact test - left sided and Fishers exact test - right sided. Could you fix this two measures? The error message: Can't locate Text/NSP/Measures/2D/left.pm in @INC (@INC contains: /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl .) at /usr/bin/statistic.pl line 452. We don't know how to resolve this problem, because the Ngram has installed correctly. Anyone else has this problem? 2. It could be very interesting to extract trigrams using all twelve statistical measures. It could be possible? Best wishes, Mercè Greetings all, I'm in the process of collecting up the various bug reports that we've gotten since version 1.03 was released in September 2006, and I'll resolve those in 1.05. Here's what I have so far... 1) Incorporate use locale throughout package (suggested by Patrick Drouin long ago)This will make for more convenient handling of non-English text. 2) fix Testing/statistic/t2 missing message during install (reported most recently by Mary Taffet, others previously) 3) fix Makefile.PL to allow for cleaner Windows install (reported by Richard Churchill) I will keep looking through the mailing list archives and my own email, but those seem like the main issues that have arisen. However, if you recall something else, or these is some feature or change you are interested in seeing, please let me know. As you can tell NSP releases have slowed considerably in recent years, so this is likely to be the only release for some time to come, so please do let me know asap if there are other issues. Comments and suggestions are of course welcome. Cordially, Ted
[ngram] Re: Problem with a token
Bjoern, Yes, I think so! I work with UTF-8 (corpus, stop list, etc.). I thought that the problem with the character l·l was similar to the accents, because I added as a token all kind of accents used in Catalan and Spanish and the problem was solved, but not in that case. For this reason, I try to add this character in my tokens file or in my stopwords list, but it doesn't work. Mercè Hi there, mercevg wrote: I have some problems to filter n-grams in a corpus that contains words with this character: l·l. This character is frequently used in Catalan documents. In my results list I can't retrieve n-grams with words that contains this character. In my tokens file I have insert the line /[a-zA-Z·]+/ (with ·), but the results are not satisfactory. I have also tried to insert in my stop list the line /l·l/, but doesn't work at all, because in my results list I have bi-grams like intelligència. In this case, one word is divided into two words. You know what is the problem? This sounds like a character set / file encoding issue. All files involved (corpus, filters etc.) should have the same encoding. I am not sure about the specific ISO encoding for Catalan. However, I suppose Catalan is covered by iso-8859-1. utf-8 should work anyway, though. -- Best regards, Bjoern Wilmsmann
[ngram] Re: Problem with a token
Patrick, I have checked the latest version of NSP (v.1.03) and count.pl doesn't contain use locale;. I'll try to add use locale; in line 83, maybe your suggestion it's my solution. More or less we have the same problems with accents and other kind of characters working with French and Catalan or Spanish. Thank you very much! Mercè Mercè, I have not checked the latest version of NSP to see if count.pl and the other files contain use locale; as I suggested some time ago. The simple inclusion of such a statement at the beginning of the Perl scripts fixed the problems I had for French. You can have a look at this for more information : http://tech.groups.yahoo.com/group/ngram/message/159 Hope this helps... Regards, Patrick