Re: [ngram] Re: ngrams with hyphen

2011-04-23 Thread Ted Pedersen
Hi Merce,

Ah, yes, I see what you mean. The problem with using \s in the stoplist is
that the toknization prior to checking for stop words does not include a
trailing \s, and so /\s[Ii]n\s/ is never matched.

The trick here is to redfine the \b character class so it doesn't include -.
This involves a bit of regular expression tampering which looks kind of
awful but in fact works pretty nicely. What I have below is a regex (in a
stoplist) that redefines \b as including - and /.

@stop.mode=OR
/\b[iI]n(?:(?![\w/-])(?=[\w/-])|(?=[\w/-])(?![\w/-]))/

So we have a word boundary \b
followed by In or in
followed by a word boundary that includes - or /

ted@linux-zxku:~ count.pl out test.txt --stop stop.txt --token token.txt

ted@linux-zxku:~ more out
4
latejune1 1 1
in-lineskating1 1 1
ilike1 1 1
likein-line1 1 1

ted@linux-zxku:~ cat test.txt
i like in-line skating in late june.

ted@linux-zxku:~ cat stop.txt
@stop.mode=OR
/\b[iI]n(?:(?![\w/-])(?=[\w/-])|(?=[\w/-])(?![\w/-]))/

It's important to say this regex came from Perl Monks,
http://www.perlmonks.org/?node_id=308744

I hope this makes some sense, at least in a general way. I wouldn't worry
too much about the regex itself, although if you need it modified in some
way do let me know and we can work that out.

Enjoy,
Ted

On Sat, Apr 23, 2011 at 4:51 PM, mercevg merc...@yahoo.es wrote:



 Hi Ted,

 I've modified the stopwords list using \s/ instead of \b/, but the problem
 is not solved at all, because now in my bigrams list I get interesting
 bigrams as

 in-bandsignalling
 in-stationmodem

 But also, new bigrams without interest as

 in Recommendation
 defined in
 shown in
 described in
 given in

 It's possible to get just bigrams like

 in-bandsignalling
 in-stationmodem

 And not the others new bigrams without interest?

 Thanks for your help,


 Mercè

 --- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote:
 
  Hi Merce,
 
  Yes, indeed, you can do as you describe. This gets into some important
  details about regular expressions that I'm happy to have a chance to
  mention. In the default stoplist the stop words are delimited by \b, as
 in
 
  /\bin\b/
 
  This means match in as a stop word when surrounded by a word boundary.
 A
  word boundary is spaces as well as various punctuations, including the -.
 
  So, if you want to find bigrams like in-line but then exclude ones like
  in the, then you need to adjust the stoplist so that the stop words are
  perhaps just surrounded by spaces. I say perhaps since there are various
  ways to do this, but the simplest one is shown below...
 
  ted@linux-zxku:~ more stop.txt
  @stop.mode=OR
  /\b[iI]n\s/
 
  ted@linux-zxku:~ more token.txt
  /\w+-\w+/
  /\w+/
 
  ted@linux-zxku:~ more test.txt
  i like in-line skating in late june.
 
  ted@linux-zxku:~ count.pl output.txt test.txt --token token.txt --stop
  stop.txt
 
  ted@linux-zxku:~ more output.txt
  6
  inlate1 1 1
  latejune1 1 1
  skatingin1 1 1
  in-lineskating1 1 1
  ilike1 1 1
  likein-line1 1 1
 
  I hope this helps.
 
  Enjoy,
  Ted
 
  On Fri, Apr 22, 2011 at 11:41 AM, mercevg mercevg@... wrote:
 
  
  
   Ted,
  
   Thanks, I've add this regular expression in my tokens file and it works
   well.
  
   One more comment about that:
  
   In my corpus I have some interesting bigrams as
   in-band signalling
   in-call rearrangement
   in-slot signalling
  
   If I filter as a stopword in, I can't get these kind of bigrams from
 my
   corpus. On the contrary, if in it's not on my stopwords list, I
 retrieve
   these bigrams but also I get more bigrams without interest as
  
   in Recommendation
   in Figure
   in order
  
   My question is: Can I filter and retrieve these two groups of bigrams
 at
   the same time?
  
   Thank you for your help,
  
   Mercè
  
  
   --- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote:
   
Greetings Merce,
   
This is fairly easy to handle via the --token option. You simply
 specify
   a
regular expression that says a token in a string followed by a -
 followed
   by
a string. You can customize a --token file many ways, but the
 following
example will handle hyphenated words. Please do let us know if
 additional
questions arise!
   
linux@linux:~ count.pl test.out test.txt --token token.txt
   
linux@linux:~ more test.out
13
cell-phoneIt1 1 1
thevillage-shop1 1 1
sextra-nice1 1 1
village-shoptoday1 1 1
boughta1 1 1
wentto1 1 1
acell-phone1 1 1
iwent1 1 1
todayand1 1 1
Its1 1 1
andI1 1 1
Ibought1 1 1
tothe1 1 1
   
linux@linux:~ cat test.txt
i went to the village-shop today, and I bought a cell-phone. It's
extra-nice.
   
linux@linux:~ cat token.txt
/\w+\-\w+/
/\w+/
   
Enjoy,
Ted
   
On Wed, Apr 20, 2011 at 2:20 PM, mercevg mercevg@ wrote:
   


 Dear all,

 I would like to know if it's possible to get a list of ngrams with
 a
   hyphen
 inside, maybe 

[ngram] Re: ngrams with hyphen

2011-04-22 Thread mercevg
Ted,

Thanks, I've add this regular expression in my tokens file and it works well.

One more comment about that:

In my corpus I have some interesting bigrams as 
in-band signalling
in-call rearrangement
in-slot signalling

If I filter as a stopword in, I can't get these kind of bigrams from my 
corpus. On the contrary, if in it's not on my stopwords list, I retrieve 
these bigrams but also I get more bigrams without  interest as

in Recommendation
in Figure  
in order

My question is: Can I filter and retrieve these two groups of bigrams at the 
same time? 
 
Thank you for your help,

Mercè


--- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote:

 Greetings Merce,
 
 This is fairly easy to handle via the --token option. You simply specify a
 regular expression that says a token in a string followed by a - followed by
 a string. You can customize a --token file many ways, but the following
 example will handle hyphenated words. Please do let us know if additional
 questions arise!
 
 linux@linux:~ count.pl test.out test.txt --token token.txt
 
 linux@linux:~ more test.out
 13
 cell-phoneIt1 1 1
 thevillage-shop1 1 1
 sextra-nice1 1 1
 village-shoptoday1 1 1
 boughta1 1 1
 wentto1 1 1
 acell-phone1 1 1
 iwent1 1 1
 todayand1 1 1
 Its1 1 1
 andI1 1 1
 Ibought1 1 1
 tothe1 1 1
 
 linux@linux:~ cat test.txt
 i went to the village-shop today, and I bought a cell-phone. It's
 extra-nice.
 
 linux@linux:~ cat token.txt
 /\w+\-\w+/
 /\w+/
 
 Enjoy,
 Ted
 
 On Wed, Apr 20, 2011 at 2:20 PM, mercevg mercevg@... wrote:
 
 
 
  Dear all,
 
  I would like to know if it's possible to get a list of ngrams with a hyphen
  inside, maybe during the tokenization process.
 
  For exemple, I want to get these bigrams:
  - call-connected signal
  - clear-back signal
  - clear-forward signal
 
  Instead of two bigrams for each one:
  - callconnected179 2608 527
  connectedsignal189 320 9176
 
  - clearback283 1115 733
  backsignal157 380 9176
 
  - clearforward632 1115 877
  forwardsignal493 1547 9176
 
  Thanks a lot,
 
  Mercè
 
   
 
 
 
 
 -- 
 Ted Pedersen
 http://www.d.umn.edu/~tpederse