[ngram] Re: Fwd: ll4 giving me trouble with 4-grams

2013-04-03 Thread mercevg
Hi Bridget,

I am using Text-NSP-1.25, Perl 5.10.1 and GNU/Linux.

Thanks for your help!
Mercè 



--- In ngram@yahoogroups.com, Bridget McInnes btmcinnes@... wrote:

 Hi Mercè,
 
 That is puzzling. I get the following, which really doesn't help you much.
 
 3214
 derecuperacióndeinformación1 1531.9192 8 274 17 286 33 8 18 9 17 16
 27 8 8 8 16
 larecuperacióndeinformación2 1266.9553 7 115 17 286 33 8 54 11 17
 16 27 8 7 11 16
 recuperacióndeinformacióny3 1009.9596 5 15 287 30 63 15 14 5 25 11
 5 14 5 5 5
 elprocesamientodellenguaje4 610.5006 6 95 19 57 22 11 9 6 10 9 20 6
 6 6 9
 procesamientodellenguajenatural5 521.5067 9 19 55 22 19 10 9 9 20
 16 18 9 9 9 16
 
 What version of Text-NSP are you using? and I guess also what version of
 perl? And your OS? Maybe if we can see the difference between your system
 and mine we can track down the error.
 
 I am using: Text-NSP-1.27; Perl 5.10.1; ubuntu. Do you have similar
 versions, especially with NSP?
 
 In the mean time, I will take a look at the code that the error is being
 generated at to see if something comes to light.
 
 Thanks,
 
 Bridget
 
 On Wed, Mar 27, 2013 at 2:34 PM, mercevg mercevg@... wrote:
 
  **
 
 
  Hi Bridget,
 
  I've been doing the same process as you, but the error continues to occur.
 
  My test.4 file contains:
 
  3214
  procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9
  16
  derecuperacióndeinformación8 274 17 286 33 8 18 9 17 16 27 8 8 8
  16
  larecuperacióndeinformación7 115 17 286 33 8 54 11 17 16 27 8 7 11
  16
  elprocesamientodellenguaje6 95 19 57 22 11 9 6 10 9 20 6 6 6 9
  recuperacióndeinformacióny5 15 287 30 63 15 14 5 25 11 5 14 5 5 5
 
  Then I run the Log Likelihood for 4-grams
  statistic.pl --ngram 4 ll test.4ll test.4
 
  And this is the error message:
  Use of uninitialized value $Text::NSP::Measures::4D::expected_values in
  string eq at /etc/perl/Text/NSP/Measures/4D.pm line 869, SRC line 816.^C
 
  Thank you for your help!
 
  Mercè
 
  --- In ngram@yahoogroups.com, Bridget McInnes btmcinnes@ wrote:
  
   Hi Mercè,
  
   Would you send me your file? I am not able to reproduce the error. I
   apologize if you already sent it. I am not seeing it in the thread.
  
   I put what I did to test it below so you could reproduce what I have done
   on an example test set.
  
   I will check on :
   In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI::
   ll.3pm
  
   There shouldn't be a ll.3pm in 4D. I must have something wrong in there.
  
   Thanks,
  
   Bridget
   -
  
   Here is what I am doing:
  
   The text file contains the following:
   this is a test sentence
   just a sentence
   this is a test sentence
  
   I save that to a file called test.txt.
  
   Then I run the following:
   bridget@atlas:~/nsp-test$ count.pl --ngram 4 test.4 test.txt
  
   The test.4 file contains:
   10
   thisisatest2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
   isatestsentence2 2 3 2 3 2 2 2 2 2 2 2 2 2 2
   sentencethisisa1 2 1 1 2 1 1 1 1 1 1 1 1 1 1
   testsentencejusta1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
   justasentencethis1 1 3 2 1 1 1 1 1 1 1 1 1 1 1
   asentencethisis1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
   atestsentencejust1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
   sentencejustasentence1 2 1 3 3 1 1 1 1 1 1 1 1 1 1
  
   Then I run the Log Likelihood for 4-grams over it:
   bridget@atlas:~/nsp-test$ statistic.pl --ngram 4 ll test.4ll test.4
  
   Please note here that input file is the count.pl file generated from the
   above step (test.4). This may be the cause of the error.
  
   The test.4ll contains:
   10
   isatestsentence1 29.8708 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2
   thisisatest2 29.6804 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
   sentencejustasentence3 27.3805 1 2 1 3 3 1 1 1 1 1 1 1 1 1 1
   justasentencethis4 22.4273 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1
   sentencethisisa5 19.9354 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1
   testsentencejusta5 19.9354 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
   asentencethisis5 19.9354 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
   atestsentencejust5 19.9354 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
  
   Let me know if you get anything different.
  
  
   On Wed, Mar 27, 2013 at 11:45 AM, mercevg mercevg@ wrote:
  
**
   
   
Ted,
   
Changing the command line the following error message appears:
   
Use of uninitialized value $Text::NSP::Measures::4D::expected_values
  in
string eq at /etc/perl/Text/NSP/Measures/4D.pm line 839, SRC line
  1265.^C
   
It could be due to files already installed?
In the folder MyNSP/man/man3 I've got
Text::NSP::Measures::4D::MI::ll.3pm
   
Thank you,
Mercè
   
--- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote:

 I think there is a slight typo in your command :

 statistic.pl --ngram 4 ll4.pm output.txt intput.txt

 (the module name should be ll4.pm)

 I hope this helps! Let me know if you continue to have any trouble...

 Good luck,
 Ted

 On Wed, Mar 27, 2013 at 9:06 AM, mercevg mercevg@ wrote

[ngram] Re: Fwd: ll4 giving me trouble with 4-grams

2013-03-27 Thread mercevg
Ted,

Changing the command line the following error message appears:

Use of uninitialized value $Text::NSP::Measures::4D::expected_values in string 
eq at /etc/perl/Text/NSP/Measures/4D.pm line 839, SRC line 1265.^C

It could be due to files already installed? 
In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI::ll.3pm

Thank you,
Mercè


--- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote:

 I think there is a slight typo in your command :
 
 statistic.pl --ngram 4 ll4.pm output.txt intput.txt
 
 (the module name should be ll4.pm)
 
 I hope this helps! Let me know if you continue to have any trouble...
 
 Good luck,
 Ted
 
 On Wed, Mar 27, 2013 at 9:06 AM, mercevg mercevg@... wrote:
  Ted,
 
  I've received your answer without problem. I'll try to follow up with 
  another email address.
 
  A sample of my 4-grams file:
  procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9 
  16
  recuperacióndeinformacióntextual4 15 287 30 5 15 14 4 25 4 5 14 4 4 
  4
  estadísticodellenguajenatural3 5 55 22 19 3 3 3 20 16 18 3 3 3 16
  aparicióneneldocumento2 4 93 95 22 3 2 3 18 6 4 2 3 2 3
 
  Command line:
  statistic.pl --ngram 4 ll.3pm 4-grams-ll.txt 4-grams.txt
 
  Program answer:
  Measure not defined for 4-grams
 
  I've got Text-NSP v.1.25.
 
  Thank you.
  Mercè
 
  --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote:
 
  Merce, I got an email error when responding directly to your yahoo.es
  account. Could you follow up with another email address or use the
  group...?
 
  Thanks,
  Ted
 
 
  -- Forwarded message --
  From: Ted Pedersen tpederse@
  Date: Wed, Mar 27, 2013 at 8:29 AM
  Subject: Re: ll4 giving me trouble with 4-grams
  To: mercevg mercevg@
 
 
  Hi Merce,
 
  Could you send me whatever error output you are getting, plus a small
  sample of your ngram file?
 
  Thanks!
  Ted
 
  On Wed, Mar 27, 2013 at 8:12 AM, mercevg mercevg@ wrote:
   Hi,
  
   I would like to know how to calculate with Statistical.pl 4-grams using 
   log-likelihood ratio.
  
   To calculate 3-grams I've run the program as follows:
   statistic.pl --ngram 3 tmi3.pm three.ngram.tmi3 three.ngram
  
   But using log-likelihood ratio it doesn't work.
  
   Thanks
  
   Mercè
  
  
 
 
 





[ngram] Re: Fwd: ll4 giving me trouble with 4-grams

2013-03-27 Thread mercevg
Hi Bridget,

I've been doing the same process as you, but the error continues to occur.

My test.4 file contains:

3214
procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9 9 9 16 
derecuperacióndeinformación8 274 17 286 33 8 18 9 17 16 27 8 8 8 16 
larecuperacióndeinformación7 115 17 286 33 8 54 11 17 16 27 8 7 11 16 
elprocesamientodellenguaje6 95 19 57 22 11 9 6 10 9 20 6 6 6 9 
recuperacióndeinformacióny5 15 287 30 63 15 14 5 25 11 5 14 5 5 5 

Then I run the Log Likelihood for 4-grams 
statistic.pl --ngram 4 ll test.4ll test.4

And this is the error message:
Use of uninitialized value $Text::NSP::Measures::4D::expected_values in string 
eq at /etc/perl/Text/NSP/Measures/4D.pm line 869, SRC line 816.^C

Thank you for your help!

Mercè

--- In ngram@yahoogroups.com, Bridget McInnes btmcinnes@... wrote:

 Hi Mercè,
 
 Would you send me your file? I am not able to reproduce the error. I
 apologize if you already sent it. I am not seeing it in the thread.
 
 I put what I did to test it below so you could reproduce what I have done
 on an example test set.
 
 I will check on :
 In the folder MyNSP/man/man3 I've got Text::NSP::Measures::4D::MI::
 ll.3pm
 
 There shouldn't be a ll.3pm in 4D. I must have something wrong in there.
 
 Thanks,
 
 Bridget
 -
 
 Here is what I am doing:
 
 The text file contains the following:
 this is a test sentence
 just a sentence
 this is a test sentence
 
 I save that to a file called test.txt.
 
 Then I run the following:
 bridget@atlas:~/nsp-test$ count.pl --ngram 4 test.4 test.txt
 
 The test.4 file contains:
 10
 thisisatest2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
 isatestsentence2 2 3 2 3 2 2 2 2 2 2 2 2 2 2
 sentencethisisa1 2 1 1 2 1 1 1 1 1 1 1 1 1 1
 testsentencejusta1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
 justasentencethis1 1 3 2 1 1 1 1 1 1 1 1 1 1 1
 asentencethisis1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
 atestsentencejust1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
 sentencejustasentence1 2 1 3 3 1 1 1 1 1 1 1 1 1 1
 
 Then I run the Log Likelihood for 4-grams over it:
 bridget@atlas:~/nsp-test$ statistic.pl --ngram 4 ll test.4ll test.4
 
 Please note here that input file is the count.pl file generated from the
 above step (test.4). This may be the cause of the error.
 
 The test.4ll contains:
 10
 isatestsentence1 29.8708 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2
 thisisatest2 29.6804 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
 sentencejustasentence3 27.3805 1 2 1 3 3 1 1 1 1 1 1 1 1 1 1
 justasentencethis4 22.4273 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1
 sentencethisisa5 19.9354 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1
 testsentencejusta5 19.9354 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
 asentencethisis5 19.9354 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
 atestsentencejust5 19.9354 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
 
 Let me know if you get anything different.
 
 
 On Wed, Mar 27, 2013 at 11:45 AM, mercevg mercevg@... wrote:
 
  **
 
 
  Ted,
 
  Changing the command line the following error message appears:
 
  Use of uninitialized value $Text::NSP::Measures::4D::expected_values in
  string eq at /etc/perl/Text/NSP/Measures/4D.pm line 839, SRC line 1265.^C
 
  It could be due to files already installed?
  In the folder MyNSP/man/man3 I've got
  Text::NSP::Measures::4D::MI::ll.3pm
 
  Thank you,
  Mercè
 
  --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote:
  
   I think there is a slight typo in your command :
  
   statistic.pl --ngram 4 ll4.pm output.txt intput.txt
  
   (the module name should be ll4.pm)
  
   I hope this helps! Let me know if you continue to have any trouble...
  
   Good luck,
   Ted
  
   On Wed, Mar 27, 2013 at 9:06 AM, mercevg mercevg@ wrote:
Ted,
   
I've received your answer without problem. I'll try to follow up with
  another email address.
   
A sample of my 4-grams file:
procesamientodellenguajenatural9 19 55 22 19 10 9 9 20 16 18 9
  9 9 16
recuperacióndeinformacióntextual4 15 287 30 5 15 14 4 25 4 5
  14 4 4 4
estadísticodellenguajenatural3 5 55 22 19 3 3 3 20 16 18 3 3 3
  16
aparicióneneldocumento2 4 93 95 22 3 2 3 18 6 4 2 3 2 3
   
Command line:
statistic.pl --ngram 4 ll.3pm 4-grams-ll.txt 4-grams.txt
   
Program answer:
Measure not defined for 4-grams
   
I've got Text-NSP v.1.25.
   
Thank you.
Mercè
   
--- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote:
   
Merce, I got an email error when responding directly to your yahoo.es
account. Could you follow up with another email address or use the
group...?
   
Thanks,
Ted
   
   
-- Forwarded message --
From: Ted Pedersen tpederse@
Date: Wed, Mar 27, 2013 at 8:29 AM
Subject: Re: ll4 giving me trouble with 4-grams
To: mercevg mercevg@
   
   
Hi Merce,
   
Could you send me whatever error output you are getting, plus a small
sample of your ngram file?
   
Thanks!
Ted
   
On Wed, Mar 27, 2013 at 8:12 AM, mercevg mercevg@ wrote:
 Hi,

 I would like to know how to calculate with Statistical.pl 4-grams

[ngram] Re: ngrams with hyphen

2011-04-22 Thread mercevg
Ted,

Thanks, I've add this regular expression in my tokens file and it works well.

One more comment about that:

In my corpus I have some interesting bigrams as 
in-band signalling
in-call rearrangement
in-slot signalling

If I filter as a stopword in, I can't get these kind of bigrams from my 
corpus. On the contrary, if in it's not on my stopwords list, I retrieve 
these bigrams but also I get more bigrams without  interest as

in Recommendation
in Figure  
in order

My question is: Can I filter and retrieve these two groups of bigrams at the 
same time? 
 
Thank you for your help,

Mercè


--- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote:

 Greetings Merce,
 
 This is fairly easy to handle via the --token option. You simply specify a
 regular expression that says a token in a string followed by a - followed by
 a string. You can customize a --token file many ways, but the following
 example will handle hyphenated words. Please do let us know if additional
 questions arise!
 
 linux@linux:~ count.pl test.out test.txt --token token.txt
 
 linux@linux:~ more test.out
 13
 cell-phoneIt1 1 1
 thevillage-shop1 1 1
 sextra-nice1 1 1
 village-shoptoday1 1 1
 boughta1 1 1
 wentto1 1 1
 acell-phone1 1 1
 iwent1 1 1
 todayand1 1 1
 Its1 1 1
 andI1 1 1
 Ibought1 1 1
 tothe1 1 1
 
 linux@linux:~ cat test.txt
 i went to the village-shop today, and I bought a cell-phone. It's
 extra-nice.
 
 linux@linux:~ cat token.txt
 /\w+\-\w+/
 /\w+/
 
 Enjoy,
 Ted
 
 On Wed, Apr 20, 2011 at 2:20 PM, mercevg mercevg@... wrote:
 
 
 
  Dear all,
 
  I would like to know if it's possible to get a list of ngrams with a hyphen
  inside, maybe during the tokenization process.
 
  For exemple, I want to get these bigrams:
  - call-connected signal
  - clear-back signal
  - clear-forward signal
 
  Instead of two bigrams for each one:
  - callconnected179 2608 527
  connectedsignal189 320 9176
 
  - clearback283 1115 733
  backsignal157 380 9176
 
  - clearforward632 1115 877
  forwardsignal493 1547 9176
 
  Thanks a lot,
 
  Mercè
 
   
 
 
 
 
 -- 
 Ted Pedersen
 http://www.d.umn.edu/~tpederse





[ngram] Ngrams without line break

2009-07-01 Thread mercevg
Dear all,

I would like to know if it's possible to get ngrams without containing line 
breaks from the corpus. I'll try to explain clearly: if the input text file is

first line of text
 second line
 And a third line of text

Then, we'll get with count.pl two bigrams containing like breaks: 

text second
line And

Or trigrams:
of text second
text second line
second line And

And so on.

Taking into account these outputs, and after reading help text, I don't know if 
I can change default count.pl options to get all ngrams from the corpus except 
the ngrams containing words placed at the end of one sentence and words that 
are at the begining of the next sentence. That is, ngram without containing 
line breaks.

Best wishes,
Mercè









[ngram] Re: Ngrams without line break

2009-07-01 Thread mercevg
Dear Ted,

In my case, I would like to get all the ngrams except those that cross over the 
end of line. In your example:

the cat is
my friend the
cat is my friend

I don't want to get as ngrams is my and the cat, those having a new line in 
the
middle of it. 

As you said, by default count.pl simply ignores end of line markers. But, it's 
possible not ignore end of line markers?  

Thanks a lot!
Mercè

--- In ngram@yahoogroups.com, Ted Pedersen duluth...@... wrote:

 Greetings Merce,
 
 To make sure I understand correctly, it sounds like you *only* want to
 see those ngrams that contain a line break. For example, if you run
 count.pl as follows on your test file
 
 first line of text
 second line
 And a third line of text
 
 count.pl test.out test
 
 talisker(8): more test.out
 11
 lineof2 3 2
 oftext2 2 2
 lineAnd1 3 1
 Anda1 1 1
 athird1 1 1
 secondline1 1 3
 thirdline1 1 3
 firstline1 1 3
 textsecond1 1 1
 
 You will get the bigrams that cross over the end of line - (text,
 second, line And), but you also get all the other ngrams too...and so
 it sounds to me like you only want the ones that cross over the new
 line markers, and nothing else. Is that accurate?
 
 By default count.pl simply ignores end of line markers (the behavior
 you see above). So, it's not so much that the ngram includes the new
 line, it simply ignores it. So with a file like
 
 the cat is
 my friend the
 cat is my friend
 
 the 2 occurrences of the cat would be considered identical, even
 though the second could be thought of as having a new line in the
 middle of it (but we essentially ignore that).
 
 So...at the moment at least I'm not sure how to limit the output to
 only those ngrams that are made by crossing over a new line
 markerBut, let me make sure I am understanding things correctly
 (so do let me know if I'm wrong) and I'll give this a little more
 thought too.
 
 Cordially,
 Ted
 
 
 On Wed, Jul 1, 2009 at 12:15 PM, mercevgmerc...@... wrote:
 
 
  Dear all,
 
  I would like to know if it's possible to get ngrams without containing line
  breaks from the corpus. I'll try to explain clearly: if the input text file
  is
 
  first line of text
  second line
  And a third line of text
 
  Then, we'll get with count.pl two bigrams containing like breaks:
 
  text second
  line And
 
  Or trigrams:
  of text second
  text second line
  second line And
 
  And so on.
 
  Taking into account these outputs, and after reading help text, I don't know
  if I can change default count.pl options to get all ngrams from the corpus
  except the ngrams containing words placed at the end of one sentence and
  words that are at the begining of the next sentence. That is, ngram without
  containing line breaks.
 
  Best wishes,
  Mercè
 
  
 
 
 
 -- 
 Ted Pedersen
 http://www.d.umn.edu/~tpederse





[ngram] Re: Problem with a token

2008-02-14 Thread mercevg
Patrick, Ted,

I added use locale; in line 83 but this can't improve my results:
words containing the character l·l (like intel·ligència)are not
included in the results list.

But it is important to say that I add as a tokens all accents,
diaeresis and apostrophes that are used in Catalan corpus and I have
had a good results. I think it's the solution for this kind of
characters, except for the l·l (l geminada).

Best regards,
Mercè
 

 Greetings all,
 
 Thanks for the very interesting discussion. This is quite helpful.
 
 Just a short note to confirm that we have not yet added the
 
 add locale;
 
 directive to NSP - we haven't had a release in some time, but this
will surely
 be included when we do. I am thinking it might not be a bad idea to
have a
 release simply to take care of this. Thanks to Patrick for pointing
this out
 in the first place, and then reminding us of that earlier discussion.
 
 I would be very interested to know if this resolves the problems
with Catalan,
 French, Spanish, btw. Please do update us and the rest of the list, as
 I suspect
 this is a fairly common problem.
 
 Cordially,
 Ted
 
 On Feb 13, 2008 11:07 AM, mercevg [EMAIL PROTECTED] wrote:
 
 
 
  Patrick,
 
   I have checked the latest version of NSP (v.1.03) and count.pl
doesn't
   contain use locale;. I'll try to add use locale; in line 83,
maybe
   your suggestion it's my solution.
 
   More or less we have the same problems with accents and other kind of
   characters working with French and Catalan or Spanish.
 
 
   Thank you very much!
 
   Mercè
 
   
Mercè,
   
I have not checked the latest version of NSP to see if count.pl
and the
other files contain use locale; as I suggested some time ago. The
simple inclusion of such a statement at the beginning of the Perl
scripts fixed the problems I had for French. You can have a look at
   this
for more information :
   
http://tech.groups.yahoo.com/group/ngram/message/159
   
Hope this helps...
   
Regards,
Patrick
   
 
 
 
 -- 
 Ted Pedersen
 http://www.d.umn.edu/~tpederse





[ngram] Re: plans for version 1.05

2008-02-14 Thread mercevg
Ted,

I have two suggestions to improve the new version.

1. I have problems to extract bigrams using Fishers exact test - left
sided and Fishers exact test - right sided. Could you fix this two
measures?

The error message:

Can't locate Text/NSP/Measures/2D/left.pm in @INC (@INC contains:
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl .) at
/usr/bin/statistic.pl line 452.

We don't know how to resolve this problem, because the Ngram has
installed correctly. Anyone else has this problem?

2. It could be very interesting to extract trigrams using all twelve
statistical measures. It could be possible?

Best wishes,
Mercè


 Greetings all, 
 
 I'm in the process of collecting up the various bug reports that we've
 gotten since version 1.03 was released in September 2006, and I'll
 resolve those in 1.05. Here's what I have so far...
 
 1) Incorporate use locale throughout package (suggested by Patrick
 Drouin long ago)This will make for more convenient handling of
 non-English text.
 
 2) fix Testing/statistic/t2 missing message during install (reported
 most recently by Mary Taffet, others previously)
 
 3) fix Makefile.PL to allow for cleaner Windows install (reported by
 Richard Churchill)
 
 I will keep looking through the mailing list archives and my own
 email, but those seem like the main issues that have arisen. However,
 if you recall something else, or these is some feature or change you
 are interested in seeing, please let me know. As you can tell NSP
 releases have slowed considerably in recent years, so this is likely
 to be the only release for some time to come, so please do let me know
 asap if there are other issues. Comments and suggestions are of course
 welcome. 
 
 Cordially,
 Ted





[ngram] Re: Problem with a token

2008-02-13 Thread mercevg
Bjoern,

Yes, I think so! 

I work with UTF-8 (corpus, stop list, etc.). I thought that the
problem with the character l·l was similar to the accents, because I
added as a token all kind of accents used in Catalan and Spanish and
the problem was solved, but not in that case. For this reason, I try
to add this character in my tokens file or in my stopwords list, but
it doesn't work.

Mercè



 Hi there,
 
 mercevg wrote:
  I have some problems to filter n-grams in a corpus that contains words
  with this character: l·l. This character is frequently used in
  Catalan documents. In my results list I can't retrieve n-grams with
  words that contains this character.
 
  In my tokens file I have insert the line /[a-zA-Z·]+/ (with ·),
  but the results are not satisfactory.
 
  I have also tried to insert in my stop list the line /l·l/, but
  doesn't work at all, because in my results list I have bi-grams like
  intelligència. In this case, one word is divided into two words.
 
  You know what is the problem?
 
 
 This sounds like a character set / file encoding issue. All files  
 involved (corpus, filters etc.) should have the same encoding. I am  
 not sure about the specific ISO encoding for Catalan. However, I  
 suppose Catalan is covered by iso-8859-1. utf-8 should work anyway,  
 though.
 --
 Best regards,
 Bjoern Wilmsmann





[ngram] Re: Problem with a token

2008-02-13 Thread mercevg
Patrick,

I have checked the latest version of NSP (v.1.03) and count.pl doesn't
contain use locale;. I'll try to add use locale; in line 83, maybe
your suggestion it's my solution.

More or less we have the same problems with accents and other kind of
characters working with French and Catalan or Spanish.

Thank you very much!

Mercè


 Mercè,
 
 I have not checked the latest version of NSP to see if count.pl and the 
 other files contain use locale; as I suggested some time ago. The 
 simple inclusion of such a statement at the beginning of the Perl 
 scripts fixed the problems I had for French. You can have a look at
this 
 for more information :
 
 http://tech.groups.yahoo.com/group/ngram/message/159
 
 Hope this helps...
 
 Regards,
 Patrick