[ngram] Re: Problem with a token
Patrick, Ted, I added use locale; in line 83 but this can't improve my results: words containing the character l·l (like intel·ligència)are not included in the results list. But it is important to say that I add as a tokens all accents, diaeresis and apostrophes that are used in Catalan corpus and I have had a good results. I think it's the solution for this kind of characters, except for the l·l (l geminada). Best regards, Mercè Greetings all, Thanks for the very interesting discussion. This is quite helpful. Just a short note to confirm that we have not yet added the add locale; directive to NSP - we haven't had a release in some time, but this will surely be included when we do. I am thinking it might not be a bad idea to have a release simply to take care of this. Thanks to Patrick for pointing this out in the first place, and then reminding us of that earlier discussion. I would be very interested to know if this resolves the problems with Catalan, French, Spanish, btw. Please do update us and the rest of the list, as I suspect this is a fairly common problem. Cordially, Ted On Feb 13, 2008 11:07 AM, mercevg [EMAIL PROTECTED] wrote: Patrick, I have checked the latest version of NSP (v.1.03) and count.pl doesn't contain use locale;. I'll try to add use locale; in line 83, maybe your suggestion it's my solution. More or less we have the same problems with accents and other kind of characters working with French and Catalan or Spanish. Thank you very much! Mercè Mercè, I have not checked the latest version of NSP to see if count.pl and the other files contain use locale; as I suggested some time ago. The simple inclusion of such a statement at the beginning of the Perl scripts fixed the problems I had for French. You can have a look at this for more information : http://tech.groups.yahoo.com/group/ngram/message/159 Hope this helps... Regards, Patrick -- Ted Pedersen http://www.d.umn.edu/~tpederse
[ngram] Re: Problem with a token
Bjoern, Yes, I think so! I work with UTF-8 (corpus, stop list, etc.). I thought that the problem with the character l·l was similar to the accents, because I added as a token all kind of accents used in Catalan and Spanish and the problem was solved, but not in that case. For this reason, I try to add this character in my tokens file or in my stopwords list, but it doesn't work. Mercè Hi there, mercevg wrote: I have some problems to filter n-grams in a corpus that contains words with this character: l·l. This character is frequently used in Catalan documents. In my results list I can't retrieve n-grams with words that contains this character. In my tokens file I have insert the line /[a-zA-Z·]+/ (with ·), but the results are not satisfactory. I have also tried to insert in my stop list the line /l·l/, but doesn't work at all, because in my results list I have bi-grams like intelligència. In this case, one word is divided into two words. You know what is the problem? This sounds like a character set / file encoding issue. All files involved (corpus, filters etc.) should have the same encoding. I am not sure about the specific ISO encoding for Catalan. However, I suppose Catalan is covered by iso-8859-1. utf-8 should work anyway, though. -- Best regards, Bjoern Wilmsmann
[ngram] Re: Problem with a token
Patrick, I have checked the latest version of NSP (v.1.03) and count.pl doesn't contain use locale;. I'll try to add use locale; in line 83, maybe your suggestion it's my solution. More or less we have the same problems with accents and other kind of characters working with French and Catalan or Spanish. Thank you very much! Mercè Mercè, I have not checked the latest version of NSP to see if count.pl and the other files contain use locale; as I suggested some time ago. The simple inclusion of such a statement at the beginning of the Perl scripts fixed the problems I had for French. You can have a look at this for more information : http://tech.groups.yahoo.com/group/ngram/message/159 Hope this helps... Regards, Patrick