[ngram] formatting + punctuation removal

2011-08-17 Thread semiotica24
Sorry for the basic questions: 1. I need 2 versions of output for each list of bigrams and trigrams that I create using the various measures in count.pl and statistic.pl: one with the default statistics and one without. How do I format to exclude the statistics? e.g.:

Re: [ngram] formatting + punctuation removal

2011-08-17 Thread Ying Liu
Hi Patrick, You need to pre-process the text (data cleaning) to remove punctuations before run by count.pl. The same idea, you need to post-process to get the format you want of the bigrams or trigrams. Thanks, Ying semiotica24 wrote: Sorry for the basic questions: 1. I need 2 versions of

Re: [ngram] formatting + punctuation removal

2011-08-17 Thread Ted Pedersen
Hi Patrick, One additional idea might be to use the --token option, and say that you only want to consider alphanumerics as your tokens (which is what you will count). For example... marengo(129): cat test my friends, i have news i like ngrams Now without any token list, stop

Re: [ngram] formatting + punctuation removal

2011-08-17 Thread bthomson
Hi Patrick, I thought I would throw my idea in as well :-) I tend to use the --nontoken option. It is kind of the flip side of Ted's. For example using Ted's example below: bridget@cheshire:~/test$ cat test.txt my friends, i have news i like ngrams and a nontoken file