Sorry for the basic questions:
1. I need 2 versions of output for each list of bigrams and trigrams that
I create using the various measures in count.pl and statistic.pl: one with the
default statistics and one without. How do I format to exclude the statistics?
e.g.:
Hi Patrick,
You need to pre-process the text (data cleaning) to remove
punctuations before run by count.pl. The same idea, you
need to post-process to get the format you want of the bigrams
or trigrams.
Thanks,
Ying
semiotica24 wrote:
Sorry for the basic questions:
1. I need 2 versions of
Hi Patrick,
One additional idea might be to use the --token option, and say that
you only want to consider alphanumerics as your tokens (which is what
you will count).
For example...
marengo(129): cat test
my friends, i have news
i like ngrams
Now without any token list, stop
Hi Patrick,
I thought I would throw my idea in as well :-) I tend to use the
--nontoken option. It is kind of the flip side of Ted's. For example using
Ted's example below:
bridget@cheshire:~/test$ cat test.txt
my friends, i have news
i like ngrams
and a nontoken file