Re: protwords.txt support in stemmers

Robert Muir Tue, 30 Mar 2010 06:24:06 -0700

On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:


>
> It would also be nice to make the token categories generated by
> tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
> tokenizer that detected many of the properties could significantly
> speed up analysis because tokens would not have to be re-analyzed to
> see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
> path for WDF would be checking a bit per token).
>
>
I like this idea, but it does seem a little bit dangerous. e.g. the
tokenizer could set one of these values, but if some tokenfilter down the
stream doesnt properly use it, you could introduce bugs (by assuming a word
has no numbers when in fact it now does, due to say, a PatternReplaceFilter)

so i think we would simply end up adding a lot of these redundant checks
back, e.g. you would have to re-analyze the term after any regex replacement
from PatternReplaceFilter to properly set these flags... and it might
introduce a lot of subtle bugs.

-- 
Robert Muir
rcm...@gmail.com

Re: protwords.txt support in stemmers

Reply via email to