I went for a different route: https://issues.apache.org/jira/browse/LUCENE-2644
Scott On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote: > On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote: > > > Hi, > > > > I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create > > tokens, based solely on lower-casing characters. Is there a way to tell > it > > NOT to drop non-characters? It's amazingly frustrating that the > > TokenizerFactory and the FilterFactory have two entirely different modes > of > > behavior. If I wanted it to tokenize based on non-lower case > > characters.... > > wouldn't I use, say, LetterTokenizerFactory and tack on the > > LowerCaseFilterFactory? Or any number of combinations that would > otherwise > > achieve that specific end-result? > > > > I don't think you should use LowerCaseTokenizerFactory if you dont want to > divide text on non-letters, its intended to do just that. > > from the javadocs: > LowerCaseTokenizer performs the function of LetterTokenizer and > LowerCaseFilter together. It divides text at non-letters and converts them > to lower case. While it is functionally equivalent to the combination of > LetterTokenizer and LowerCaseFilter, there is a performance advantage to > doing the two tasks at once, hence this (redundant) implementation. > > > > So... Is there a way for me to tell it to NOT split based on > non-characters? > > > > Use a different tokenizer that doesn't split on non-characters, followed by > a LowerCaseFilter > > -- > Robert Muir > rcm...@gmail.com >