>>: I need to tokenize my field on whitespaces, html, punctuation, apostrophe
>>
>>: but if I use HTMLStripStandardTokenizerFactory it strips only html.... 
>>: but no apostrophes

> you might consider using one of the HTML Tokenizers, and then use a 
> PatternReplaceFilterFilter ... or if you know java write a 
> simple Tokenizer that uses the HTMLStripReader.
> 
>  in the long run, changing the HTMLStripReader to be useble as a 
>  "CharFilter" so it can work with any Tokenizer is probably the way we'll 
> go -- but i don't think anyone has started working on a patch for that.

thanks... I used HTMLStripStandardTokenizerFactory and then a 
PatternReplaceFilterFilter

now it works



      

Reply via email to