Preserving punctuation tokens with ICUTokenizerFactory

2012-04-10 Thread Demian Katz
It has been brought to my attention that ICUTokenizerFactory drops tokens like the ++ in The C++ Programming Language. Is there any way to persuade it to preserve these types of tokens? thanks, Demian

Re: Preserving punctuation tokens with ICUTokenizerFactory

2012-04-10 Thread Robert Muir
you can actually plug in customized grammars and stuff like that, but the simplest approach is to configure mappingcharfilter before your tokenizer, with mappings like: c++ = cplusplus On Tue, Apr 10, 2012 at 11:50 AM, Demian Katz demian.k...@villanova.edu wrote: It has been brought to my