[ngram] count.pl for unicode documents [1 Attachment]

amir.jad...@yahoo.com [ngram] Tue, 10 May 2016 06:20:13 -0700

I'm trying to run count.pl for a directory of unicode documents (a sample 
document has been attached) using Perl 5 (v5.18.2). The output is a list of 
digits and punctuations without any unicode word:
 2732
 .<>1589
 :<>626
 2<>19
 !<>17
 10<>16
 4<>14
 13<>13
 12<>13
 20<>12
 9<>11
 15<>11
 3<>10
 5<>10
 Is it possible to ask count.pl to tokenize the input file just by space?


 There is --token option which maybe useful. But I don't how to use it.

[ngram] count.pl for unicode documents [1 Attachment]

Reply via email to