Hi Leonardo.

Apologies for this very late reply. I have been out of town for a
few weeks now, and everything has fallen behind!

I think the answer to your question is "no". :) I hate to say that,
but I think the regular expressions that are used in NSP are strictly
for tokenization. They let you chop up a file into tokens that might
be 2 letters long, or 2 words long, or be made up only of capitals,
etc. But, the counting step does not really look at the regular
expressions used to tokenize, it simply counts up the tokens that are
found, and reports the totals for bigrams, etc. (whatever we are
counting).

One idea to do what you want to do might be to count ngrams using
NSP as usual, and then do some sort of edit distance calculations on
the resulting ngrams, and possibly merge together those ngrams that
are within some number of edits of each other. Or you could use
some other similarity measure to do something like that...

Sorry, I hope I am not misunderstanding the question. Please do let
us know if my answer seems to miss your point, or is unclear in some
way!

Thanks,
Ted

On Tue, 2 May 2006, Leonardo Fontenelle wrote:

> Or: _yet_ another "can I do..."
>
> I'm using regular expressions to match 4+ letter works or
> all-uppercase words; I believe they do a nice job for bigrams ("file
> manager", in example, is one of the first hits) but they might be too
> restrictive for 3+-grams. How could I look for n-grams in which some
> (i.e. 2) tokens must match some regular expressions, but the others
> are also allowed to match some others? I'm trying to get expressions
> like "drag and drop" or "press-and-hold", or "create a new \w{4,}"
>
> Thanks, again!
>
> Leonardo F. Fontenelle
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


SPONSORED LINKS
Computer internet security Package design Ski packages
Vacation packages Snowboard packages Package integrity testing


YAHOO! GROUPS LINKS




Reply via email to