>>: I need to tokenize my field on whitespaces, html, punctuation, apostrophe >> >>: but if I use HTMLStripStandardTokenizerFactory it strips only html.... >>: but no apostrophes
> you might consider using one of the HTML Tokenizers, and then use a > PatternReplaceFilterFilter ... or if you know java write a > simple Tokenizer that uses the HTMLStripReader. > > in the long run, changing the HTMLStripReader to be useble as a > "CharFilter" so it can work with any Tokenizer is probably the way we'll > go -- but i don't think anyone has started working on a patch for that. thanks... I used HTMLStripStandardTokenizerFactory and then a PatternReplaceFilterFilter now it works