Jonathan, you bring up an excellent point. I think its worth our time to actually benchmark this LowerCaseTokenizer versus LetterTokenizer + LowerCaseFilter
This tokenizer is quite old, and although I can understand there is no doubt its technically faster than LetterTokenizer + LowerCaseFilter even today (as it can just go through the char[] only a single time), I have my doubts that this brings any value these days... On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > Why would you want to do that, instead of just using another tokenizer and > a lowercasefilter? It's more confusing less DRY code to leave them separate > -- the LowerCaseTokenizerFactory combines anyway because someone decided it > was such a common use case that it was worth it for the demonstrated > performance advantage. (At least I hope that's what happened, otherwise > there's no excuse for it!). > > Do you know you get a worthwhile performance benefit for what you're doing? > If not, why do it? > > Jonathan > > > Scott Gonyea wrote: > >> I went for a different route: >> >> https://issues.apache.org/jira/browse/LUCENE-2644 >> >> Scott >> >> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote: >> >> >> >>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote: >>> >>> >>> >>>> Hi, >>>> >>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create >>>> tokens, based solely on lower-casing characters. Is there a way to tell >>>> >>>> >>> it >>> >>> >>>> NOT to drop non-characters? It's amazingly frustrating that the >>>> TokenizerFactory and the FilterFactory have two entirely different modes >>>> >>>> >>> of >>> >>> >>>> behavior. If I wanted it to tokenize based on non-lower case >>>> characters.... >>>> wouldn't I use, say, LetterTokenizerFactory and tack on the >>>> LowerCaseFilterFactory? Or any number of combinations that would >>>> >>>> >>> otherwise >>> >>> >>>> achieve that specific end-result? >>>> >>>> >>>> >>> I don't think you should use LowerCaseTokenizerFactory if you dont want >>> to >>> divide text on non-letters, its intended to do just that. >>> >>> from the javadocs: >>> LowerCaseTokenizer performs the function of LetterTokenizer and >>> LowerCaseFilter together. It divides text at non-letters and converts >>> them >>> to lower case. While it is functionally equivalent to the combination of >>> LetterTokenizer and LowerCaseFilter, there is a performance advantage to >>> doing the two tasks at once, hence this (redundant) implementation. >>> >>> >>> >>> So... Is there a way for me to tell it to NOT split based on >>> non-characters? >>> Use a different tokenizer that doesn't split on non-characters, >>> followed by >>> a LowerCaseFilter >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> >>> >> >> >> > -- Robert Muir rcm...@gmail.com