[ngram] formatting + punctuation removal
Sorry for the basic questions: 1. I need 2 versions of output for each list of bigrams and trigrams that I create using the various measures in count.pl and statistic.pl: one with the default statistics and one without. How do I format to exclude the statistics? e.g.: mobilephones100 280 384 cellularphones96 214 384 mobile phones cellular phones 2. I need to remove punctuation . and , I've tried within my stopword list, but I don't have the tags quite right. How should I enter into my stop file? Thanks! Patrick
Re: [ngram] formatting + punctuation removal
Hi Patrick, You need to pre-process the text (data cleaning) to remove punctuations before run by count.pl. The same idea, you need to post-process to get the format you want of the bigrams or trigrams. Thanks, Ying semiotica24 wrote: Sorry for the basic questions: 1. I need 2 versions of output for each list of bigrams and trigrams that I create using the various measures in count.pl and statistic.pl: one with the default statistics and one without. How do I format to exclude the statistics? e.g.: mobilephones100 280 384 cellularphones96 214 384 mobile phones cellular phones 2. I need to remove punctuation . and , I've tried within my stopword list, but I don't have the tags quite right. How should I enter into my stop file? Thanks! Patrick Yahoo! Groups Links * To visit your group on the web, go to: http://groups.yahoo.com/group/ngram/ * Your email settings: Individual Email | Traditional * To change settings online go to: http://groups.yahoo.com/group/ngram/join (Yahoo! ID required) * To change settings via email: ngram-dig...@yahoogroups.com ngram-fullfeatu...@yahoogroups.com * To unsubscribe from this group, send an email to: ngram-unsubscr...@yahoogroups.com * Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/
Re: [ngram] formatting + punctuation removal
Hi Patrick, One additional idea might be to use the --token option, and say that you only want to consider alphanumerics as your tokens (which is what you will count). For example... marengo(129): cat test my friends, i have news i like ngrams Now without any token list, stop list, etc... marengo(130): count.pl outa test marengo(131): cat outa 24 !!10 11 12 ..3 4 4 news!1 1 12 havenews1 1 1 .ngrams1 4 1 !i1 11 2 ,i1 1 2 ihave1 2 1 ngrams!1 1 12 like.1 1 4 friends,1 1 1 ilike1 2 1 myfriends1 1 1 Now I define a token file... marengo(132): cat token.txt /\w+/ marengo(133): count.pl out test --token token.txt marengo(134): cat out 7 ihave1 2 1 newsi1 1 2 havenews1 1 1 likengrams1 1 1 ilike1 2 1 friendsi1 1 2 myfriends1 1 1 Note that we only have alphanumerics...that might be the simplest thing to try fist... Hope this helps... Ted On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu liux0...@umn.edu wrote: Hi Patrick, You need to pre-process the text (data cleaning) to remove punctuations before run by count.pl. The same idea, you need to post-process to get the format you want of the bigrams or trigrams. Thanks, Ying semiotica24 wrote: Sorry for the basic questions: 1. I need 2 versions of output for each list of bigrams and trigrams that I create using the various measures in count.pl and statistic.pl: one with the default statistics and one without. How do I format to exclude the statistics? e.g.: mobilephones100 280 384 cellularphones96 214 384 mobile phones cellular phones 2. I need to remove punctuation . and , I've tried within my stopword list, but I don't have the tags quite right. How should I enter into my stop file? Thanks! Patrick Yahoo! Groups Links -- Ted Pedersen http://www.d.umn.edu/~tpederse
Re: [ngram] formatting + punctuation removal
Hi Patrick, I thought I would throw my idea in as well :-) I tend to use the --nontoken option. It is kind of the flip side of Ted's. For example using Ted's example below: bridget@cheshire:~/test$ cat test.txt my friends, i have news i like ngrams and a nontoken file containing a regex of the punctuation that you want to remove: bridget@cheshire:~/test$ cat nontokenfile /\./ /\!/ /\,/ You can run count.pl with the --nontoken option as follows: bridget@cheshire:~/test$ count.pl --ngram 2 --nontoken nontokenfile test.2 test.txt bridget@cheshire:~/test$ cat test.2 7 ihave1 2 1 newsi1 1 2 havenews1 1 1 likengrams1 1 1 ilike1 2 1 friendsi1 1 2 myfriends1 1 1 This gives some control over what punctuation you want to remove and what punctuation you would like to keep - for example hyphens. For your first question on formatting, I didn't completely understand what you were asking. You do not want the statistics in the output file after running statistic.pl? Or you would like a program to remove the statistics and the markers after running statistics.pl? Thanks, Bridget On Wed, 17 Aug 2011, Ted Pedersen wrote: Hi Patrick, One additional idea might be to use the --token option, and say that you only want to consider alphanumerics as your tokens (which is what you will count). For example... marengo(129): cat test my friends, i have news i like ngrams Now without any token list, stop list, etc... marengo(130): count.pl outa test marengo(131): cat outa 24 !!10 11 12 ..3 4 4 news!1 1 12 havenews1 1 1 .ngrams1 4 1 !i1 11 2 ,i1 1 2 ihave1 2 1 ngrams!1 1 12 like.1 1 4 friends,1 1 1 ilike1 2 1 myfriends1 1 1 Now I define a token file... marengo(132): cat token.txt /\w+/ marengo(133): count.pl out test --token token.txt marengo(134): cat out 7 ihave1 2 1 newsi1 1 2 havenews1 1 1 likengrams1 1 1 ilike1 2 1 friendsi1 1 2 myfriends1 1 1 Note that we only have alphanumerics...that might be the simplest thing to try fist... Hope this helps... Ted On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu liux0...@umn.edu wrote: Hi Patrick, You need to pre-process the text (data cleaning) to remove punctuations before run by count.pl. The same idea, you need to post-process to get the format you want of the bigrams or trigrams. Thanks, Ying semiotica24 wrote: Sorry for the basic questions: 1. I need 2 versions of output for each list of bigrams and trigrams that I create using the various measures in count.pl and statistic.pl: one with the default statistics and one without. How do I format to exclude the statistics? e.g.: mobilephones100 280 384 cellularphones96 214 384 mobile phones cellular phones 2. I need to remove punctuation . and , I've tried within my stopword list, but I don't have the tags quite right. How should I enter into my stop file? Thanks! Patrick Yahoo! Groups Links -- Ted Pedersen http://www.d.umn.edu/~tpederse