subject:"Re\: Enhance StandardTokenizer to support words which will not be tokenized"

Re: Enhance StandardTokenizer to support words which will not be tokenized

2009-06-03 Thread ami dudu

This can be good solution but it will have to be maintained every update of the StandardAnalyzer rules. Is there a way to workaround it? Grant Ingersoll-6 wrote: > > You'd have to modify the JFlex grammar. I'd suggest adding in a > generic "protected words" approach whereby you can pass in a

Re: Enhance StandardTokenizer to support words which will not be tokenized

2009-06-03 Thread Earwin Burrfoot

Not sure you can easily marry generated JFlex grammar and runtime-provided list of protected words. I took the approach of creating tokens for punctuation inside my tokenizer and later gluing them with nearby text tokens or dropping from the stream with a tokenfilter. On Wed, Jun 3, 2009 at 20:10,

Re: Enhance StandardTokenizer to support words which will not be tokenized

2009-06-03 Thread Grant Ingersoll

You'd have to modify the JFlex grammar. I'd suggest adding in a generic "protected words" approach whereby you can pass in a list of protected words. This would be a nice patch/improvement. -Grant On Jun 3, 2009, at 4:07 AM, ami dudu wrote: Hi, I'm using a StandardTokenizer which do gre