On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote:
> Hi, > > I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create > tokens, based solely on lower-casing characters. Is there a way to tell it > NOT to drop non-characters? It's amazingly frustrating that the > TokenizerFactory and the FilterFactory have two entirely different modes of > behavior. If I wanted it to tokenize based on non-lower case > characters.... > wouldn't I use, say, LetterTokenizerFactory and tack on the > LowerCaseFilterFactory? Or any number of combinations that would otherwise > achieve that specific end-result? > I don't think you should use LowerCaseTokenizerFactory if you dont want to divide text on non-letters, its intended to do just that. from the javadocs: LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of LetterTokenizer and LowerCaseFilter, there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation. So... Is there a way for me to tell it to NOT split based on non-characters? > Use a different tokenizer that doesn't split on non-characters, followed by a LowerCaseFilter -- Robert Muir rcm...@gmail.com