[ngram] formatting + punctuation removal

2011-08-17 Thread semiotica24
Sorry for the basic questions:
1.  I need 2 versions of output for each list of bigrams and trigrams that 
I create using the various measures in count.pl and statistic.pl: one with the 
default statistics and one without.  How do I format to exclude the statistics?
e.g.:
mobilephones100 280 384 
cellularphones96 214 384

mobile phones
cellular phones

2. I need to remove punctuation . and ,   I've tried within my stopword list, 
but I don't have the tags quite right.  How should I enter into my stop file?

Thanks!

Patrick



Re: [ngram] formatting + punctuation removal

2011-08-17 Thread Ying Liu
Hi Patrick,

You need to pre-process the text (data cleaning) to remove
punctuations before run by count.pl. The same idea, you
need to post-process to get the format you want of the bigrams
or trigrams.

Thanks,
Ying

semiotica24 wrote:

 Sorry for the basic questions:
 1. I need 2 versions of output for each list of bigrams and trigrams 
 that I create using the various measures in count.pl and statistic.pl: 
 one with the default statistics and one without. How do I format to 
 exclude the statistics?
 e.g.:
 mobilephones100 280 384
 cellularphones96 214 384

 mobile phones
 cellular phones

 2. I need to remove punctuation . and , I've tried within my stopword 
 list, but I don't have the tags quite right. How should I enter into 
 my stop file?

 Thanks!

 Patrick

 





Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/ngram/

* Your email settings:
Individual Email | Traditional

* To change settings online go to:
http://groups.yahoo.com/group/ngram/join
(Yahoo! ID required)

* To change settings via email:
ngram-dig...@yahoogroups.com 
ngram-fullfeatu...@yahoogroups.com

* To unsubscribe from this group, send an email to:
ngram-unsubscr...@yahoogroups.com

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/



Re: [ngram] formatting + punctuation removal

2011-08-17 Thread Ted Pedersen
Hi Patrick,

One additional idea might be to use the --token option, and say that
you only want to consider alphanumerics as your tokens (which is what
you will count).

For example...

marengo(129): cat test
my friends, i have news
i like  ngrams

Now without any token list, stop list, etc...

marengo(130): count.pl outa test

marengo(131): cat outa
24
!!10 11 12
..3 4 4
news!1 1 12
havenews1 1 1
.ngrams1 4 1
!i1 11 2
,i1 1 2
ihave1 2 1
ngrams!1 1 12
like.1 1 4
friends,1 1 1
ilike1 2 1
myfriends1 1 1

Now I define a token file...

marengo(132): cat token.txt
/\w+/

marengo(133): count.pl out test --token token.txt

marengo(134): cat out
7
ihave1 2 1
newsi1 1 2
havenews1 1 1
likengrams1 1 1
ilike1 2 1
friendsi1 1 2
myfriends1 1 1

Note that we only have alphanumerics...that might be the simplest
thing to try fist...

Hope this helps...
Ted

On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu liux0...@umn.edu wrote:
 Hi Patrick,

 You need to pre-process the text (data cleaning) to remove
 punctuations before run by count.pl. The same idea, you
 need to post-process to get the format you want of the bigrams
 or trigrams.

 Thanks,
 Ying

 semiotica24 wrote:

 Sorry for the basic questions:
 1. I need 2 versions of output for each list of bigrams and trigrams
 that I create using the various measures in count.pl and statistic.pl:
 one with the default statistics and one without. How do I format to
 exclude the statistics?
 e.g.:
 mobilephones100 280 384
 cellularphones96 214 384

 mobile phones
 cellular phones

 2. I need to remove punctuation . and , I've tried within my stopword
 list, but I don't have the tags quite right. How should I enter into
 my stop file?

 Thanks!

 Patrick





 

 Yahoo! Groups Links







-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] formatting + punctuation removal

2011-08-17 Thread bthomson
Hi Patrick,

I thought I would throw my idea in as well :-) I tend to use the 
--nontoken option. It is kind of the flip side of Ted's. For example using 
Ted's example below:

bridget@cheshire:~/test$ cat test.txt
my friends, i have news
i like  ngrams

and a nontoken file containing a regex of the punctuation that you want to 
remove:

bridget@cheshire:~/test$ cat nontokenfile
/\./
/\!/
/\,/


You can run count.pl with the --nontoken option as follows:

bridget@cheshire:~/test$ count.pl --ngram 2 --nontoken nontokenfile test.2 
test.txt
bridget@cheshire:~/test$ cat test.2
7
ihave1 2 1
newsi1 1 2
havenews1 1 1
likengrams1 1 1
ilike1 2 1
friendsi1 1 2
myfriends1 1 1

This gives some control over what punctuation you want to remove and what 
punctuation you would like to keep - for example hyphens.

For your first question on formatting, I didn't completely understand what 
you were asking. You do not want the statistics in the output file after 
running statistic.pl? Or you would like a program to remove the 
statistics and the  markers after running statistics.pl?

Thanks,

Bridget


On Wed, 17 Aug 2011, Ted Pedersen wrote:

 Hi Patrick,

 One additional idea might be to use the --token option, and say that
 you only want to consider alphanumerics as your tokens (which is what
 you will count).

 For example...

 marengo(129): cat test
 my friends, i have news
 i like  ngrams

 Now without any token list, stop list, etc...

 marengo(130): count.pl outa test

 marengo(131): cat outa
 24
 !!10 11 12
 ..3 4 4
 news!1 1 12
 havenews1 1 1
 .ngrams1 4 1
 !i1 11 2
 ,i1 1 2
 ihave1 2 1
 ngrams!1 1 12
 like.1 1 4
 friends,1 1 1
 ilike1 2 1
 myfriends1 1 1

 Now I define a token file...

 marengo(132): cat token.txt
 /\w+/

 marengo(133): count.pl out test --token token.txt

 marengo(134): cat out
 7
 ihave1 2 1
 newsi1 1 2
 havenews1 1 1
 likengrams1 1 1
 ilike1 2 1
 friendsi1 1 2
 myfriends1 1 1

 Note that we only have alphanumerics...that might be the simplest
 thing to try fist...

 Hope this helps...
 Ted

 On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu liux0...@umn.edu wrote:
 Hi Patrick,

 You need to pre-process the text (data cleaning) to remove
 punctuations before run by count.pl. The same idea, you
 need to post-process to get the format you want of the bigrams
 or trigrams.

 Thanks,
 Ying

 semiotica24 wrote:

 Sorry for the basic questions:
 1. I need 2 versions of output for each list of bigrams and trigrams
 that I create using the various measures in count.pl and statistic.pl:
 one with the default statistics and one without. How do I format to
 exclude the statistics?
 e.g.:
 mobilephones100 280 384
 cellularphones96 214 384

 mobile phones
 cellular phones

 2. I need to remove punctuation . and , I've tried within my stopword
 list, but I don't have the tags quite right. How should I enter into
 my stop file?

 Thanks!

 Patrick





 

 Yahoo! Groups Links







 -- 
 Ted Pedersen
 http://www.d.umn.edu/~tpederse