On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:
> > It would also be nice to make the token categories generated by > tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A > tokenizer that detected many of the properties could significantly > speed up analysis because tokens would not have to be re-analyzed to > see if they contain mixed case, numbers, hyphens, etc (i.e. the fast > path for WDF would be checking a bit per token). > > I like this idea, but it does seem a little bit dangerous. e.g. the tokenizer could set one of these values, but if some tokenfilter down the stream doesnt properly use it, you could introduce bugs (by assuming a word has no numbers when in fact it now does, due to say, a PatternReplaceFilter) so i think we would simply end up adding a lot of these redundant checks back, e.g. you would have to re-analyze the term after any regex replacement from PatternReplaceFilter to properly set these flags... and it might introduce a lot of subtle bugs. -- Robert Muir rcm...@gmail.com