Hi Merce,
Ah, yes, I see what you mean. The problem with using \s in the stoplist is
that the toknization prior to checking for stop words does not include a
trailing \s, and so /\s[Ii]n\s/ is never matched.
The trick here is to redfine the \b character class so it doesn't include -.
This involves a bit of regular expression tampering which looks kind of
awful but in fact works pretty nicely. What I have below is a regex (in a
stoplist) that redefines \b as including - and /.
@stop.mode=OR
/\b[iI]n(?:(?![\w/-])(?=[\w/-])|(?=[\w/-])(?![\w/-]))/
So we have a word boundary \b
followed by In or in
followed by a word boundary that includes - or /
ted@linux-zxku:~ count.pl out test.txt --stop stop.txt --token token.txt
ted@linux-zxku:~ more out
4
latejune1 1 1
in-lineskating1 1 1
ilike1 1 1
likein-line1 1 1
ted@linux-zxku:~ cat test.txt
i like in-line skating in late june.
ted@linux-zxku:~ cat stop.txt
@stop.mode=OR
/\b[iI]n(?:(?![\w/-])(?=[\w/-])|(?=[\w/-])(?![\w/-]))/
It's important to say this regex came from Perl Monks,
http://www.perlmonks.org/?node_id=308744
I hope this makes some sense, at least in a general way. I wouldn't worry
too much about the regex itself, although if you need it modified in some
way do let me know and we can work that out.
Enjoy,
Ted
On Sat, Apr 23, 2011 at 4:51 PM, mercevg merc...@yahoo.es wrote:
Hi Ted,
I've modified the stopwords list using \s/ instead of \b/, but the problem
is not solved at all, because now in my bigrams list I get interesting
bigrams as
in-bandsignalling
in-stationmodem
But also, new bigrams without interest as
in Recommendation
defined in
shown in
described in
given in
It's possible to get just bigrams like
in-bandsignalling
in-stationmodem
And not the others new bigrams without interest?
Thanks for your help,
Mercè
--- In ngram@yahoogroups.com, Ted Pedersen tpederse@... wrote:
Hi Merce,
Yes, indeed, you can do as you describe. This gets into some important
details about regular expressions that I'm happy to have a chance to
mention. In the default stoplist the stop words are delimited by \b, as
in
/\bin\b/
This means match in as a stop word when surrounded by a word boundary.
A
word boundary is spaces as well as various punctuations, including the -.
So, if you want to find bigrams like in-line but then exclude ones like
in the, then you need to adjust the stoplist so that the stop words are
perhaps just surrounded by spaces. I say perhaps since there are various
ways to do this, but the simplest one is shown below...
ted@linux-zxku:~ more stop.txt
@stop.mode=OR
/\b[iI]n\s/
ted@linux-zxku:~ more token.txt
/\w+-\w+/
/\w+/
ted@linux-zxku:~ more test.txt
i like in-line skating in late june.
ted@linux-zxku:~ count.pl output.txt test.txt --token token.txt --stop
stop.txt
ted@linux-zxku:~ more output.txt
6
inlate1 1 1
latejune1 1 1
skatingin1 1 1
in-lineskating1 1 1
ilike1 1 1
likein-line1 1 1
I hope this helps.
Enjoy,
Ted
On Fri, Apr 22, 2011 at 11:41 AM, mercevg mercevg@... wrote:
Ted,
Thanks, I've add this regular expression in my tokens file and it works
well.
One more comment about that:
In my corpus I have some interesting bigrams as
in-band signalling
in-call rearrangement
in-slot signalling
If I filter as a stopword in, I can't get these kind of bigrams from
my
corpus. On the contrary, if in it's not on my stopwords list, I
retrieve
these bigrams but also I get more bigrams without interest as
in Recommendation
in Figure
in order
My question is: Can I filter and retrieve these two groups of bigrams
at
the same time?
Thank you for your help,
Mercè
--- In ngram@yahoogroups.com, Ted Pedersen tpederse@ wrote:
Greetings Merce,
This is fairly easy to handle via the --token option. You simply
specify
a
regular expression that says a token in a string followed by a -
followed
by
a string. You can customize a --token file many ways, but the
following
example will handle hyphenated words. Please do let us know if
additional
questions arise!
linux@linux:~ count.pl test.out test.txt --token token.txt
linux@linux:~ more test.out
13
cell-phoneIt1 1 1
thevillage-shop1 1 1
sextra-nice1 1 1
village-shoptoday1 1 1
boughta1 1 1
wentto1 1 1
acell-phone1 1 1
iwent1 1 1
todayand1 1 1
Its1 1 1
andI1 1 1
Ibought1 1 1
tothe1 1 1
linux@linux:~ cat test.txt
i went to the village-shop today, and I bought a cell-phone. It's
extra-nice.
linux@linux:~ cat token.txt
/\w+\-\w+/
/\w+/
Enjoy,
Ted
On Wed, Apr 20, 2011 at 2:20 PM, mercevg mercevg@ wrote:
Dear all,
I would like to know if it's possible to get a list of ngrams with
a
hyphen
inside, maybe