[ngram] Re: Problem with a token

2008-02-14 Thread mercevg
Patrick, Ted,

I added use locale; in line 83 but this can't improve my results:
words containing the character l·l (like intel·ligència)are not
included in the results list.

But it is important to say that I add as a tokens all accents,
diaeresis and apostrophes that are used in Catalan corpus and I have
had a good results. I think it's the solution for this kind of
characters, except for the l·l (l geminada).

Best regards,
Mercè
 

 Greetings all,
 
 Thanks for the very interesting discussion. This is quite helpful.
 
 Just a short note to confirm that we have not yet added the
 
 add locale;
 
 directive to NSP - we haven't had a release in some time, but this
will surely
 be included when we do. I am thinking it might not be a bad idea to
have a
 release simply to take care of this. Thanks to Patrick for pointing
this out
 in the first place, and then reminding us of that earlier discussion.
 
 I would be very interested to know if this resolves the problems
with Catalan,
 French, Spanish, btw. Please do update us and the rest of the list, as
 I suspect
 this is a fairly common problem.
 
 Cordially,
 Ted
 
 On Feb 13, 2008 11:07 AM, mercevg [EMAIL PROTECTED] wrote:
 
 
 
  Patrick,
 
   I have checked the latest version of NSP (v.1.03) and count.pl
doesn't
   contain use locale;. I'll try to add use locale; in line 83,
maybe
   your suggestion it's my solution.
 
   More or less we have the same problems with accents and other kind of
   characters working with French and Catalan or Spanish.
 
 
   Thank you very much!
 
   Mercè
 
   
Mercè,
   
I have not checked the latest version of NSP to see if count.pl
and the
other files contain use locale; as I suggested some time ago. The
simple inclusion of such a statement at the beginning of the Perl
scripts fixed the problems I had for French. You can have a look at
   this
for more information :
   
http://tech.groups.yahoo.com/group/ngram/message/159
   
Hope this helps...
   
Regards,
Patrick
   
 
 
 
 -- 
 Ted Pedersen
 http://www.d.umn.edu/~tpederse





[ngram] Re: Problem with a token

2008-02-13 Thread mercevg
Bjoern,

Yes, I think so! 

I work with UTF-8 (corpus, stop list, etc.). I thought that the
problem with the character l·l was similar to the accents, because I
added as a token all kind of accents used in Catalan and Spanish and
the problem was solved, but not in that case. For this reason, I try
to add this character in my tokens file or in my stopwords list, but
it doesn't work.

Mercè



 Hi there,
 
 mercevg wrote:
  I have some problems to filter n-grams in a corpus that contains words
  with this character: l·l. This character is frequently used in
  Catalan documents. In my results list I can't retrieve n-grams with
  words that contains this character.
 
  In my tokens file I have insert the line /[a-zA-Z·]+/ (with ·),
  but the results are not satisfactory.
 
  I have also tried to insert in my stop list the line /l·l/, but
  doesn't work at all, because in my results list I have bi-grams like
  intelligència. In this case, one word is divided into two words.
 
  You know what is the problem?
 
 
 This sounds like a character set / file encoding issue. All files  
 involved (corpus, filters etc.) should have the same encoding. I am  
 not sure about the specific ISO encoding for Catalan. However, I  
 suppose Catalan is covered by iso-8859-1. utf-8 should work anyway,  
 though.
 --
 Best regards,
 Bjoern Wilmsmann





[ngram] Re: Problem with a token

2008-02-13 Thread mercevg
Patrick,

I have checked the latest version of NSP (v.1.03) and count.pl doesn't
contain use locale;. I'll try to add use locale; in line 83, maybe
your suggestion it's my solution.

More or less we have the same problems with accents and other kind of
characters working with French and Catalan or Spanish.

Thank you very much!

Mercè


 Mercè,
 
 I have not checked the latest version of NSP to see if count.pl and the 
 other files contain use locale; as I suggested some time ago. The 
 simple inclusion of such a statement at the beginning of the Perl 
 scripts fixed the problems I had for French. You can have a look at
this 
 for more information :
 
 http://tech.groups.yahoo.com/group/ngram/message/159
 
 Hope this helps...
 
 Regards,
 Patrick