I'd agree with your point entirely. My attacking LowerCaseTokenizer was a result of not wanting to create yet more Classes.
That said, rightfully dumping LowerCaseTokenizer would probably have me creating my own Tokenizer. I could very well be thinking about this wrong... But what if I wanted to create tokens based on Non-Whitespace/Alpha/Numeric character-content? It looks like I could perhaps use the PatternTokenizer, but that didn't leave me with a comfortable feeling when I had first looked into it. Scott On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote: > Jonathan, you bring up an excellent point. > > I think its worth our time to actually benchmark this LowerCaseTokenizer > versus LetterTokenizer + LowerCaseFilter > > This tokenizer is quite old, and although I can understand there is no > doubt > its technically faster than LetterTokenizer + LowerCaseFilter even today > (as > it can just go through the char[] only a single time), I have my doubts > that > this brings any value these days... > > > On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu> > wrote: > > > Why would you want to do that, instead of just using another tokenizer > and > > a lowercasefilter? It's more confusing less DRY code to leave them > separate > > -- the LowerCaseTokenizerFactory combines anyway because someone decided > it > > was such a common use case that it was worth it for the demonstrated > > performance advantage. (At least I hope that's what happened, otherwise > > there's no excuse for it!). > > > > Do you know you get a worthwhile performance benefit for what you're > doing? > > If not, why do it? > > > > Jonathan > > > > > > Scott Gonyea wrote: > > > >> I went for a different route: > >> > >> https://issues.apache.org/jira/browse/LUCENE-2644 > >> > >> Scott > >> > >> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote: > >> > >> > >> > >>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> > wrote: > >>> > >>> > >>> > >>>> Hi, > >>>> > >>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't > create > >>>> tokens, based solely on lower-casing characters. Is there a way to > tell > >>>> > >>>> > >>> it > >>> > >>> > >>>> NOT to drop non-characters? It's amazingly frustrating that the > >>>> TokenizerFactory and the FilterFactory have two entirely different > modes > >>>> > >>>> > >>> of > >>> > >>> > >>>> behavior. If I wanted it to tokenize based on non-lower case > >>>> characters.... > >>>> wouldn't I use, say, LetterTokenizerFactory and tack on the > >>>> LowerCaseFilterFactory? Or any number of combinations that would > >>>> > >>>> > >>> otherwise > >>> > >>> > >>>> achieve that specific end-result? > >>>> > >>>> > >>>> > >>> I don't think you should use LowerCaseTokenizerFactory if you dont want > >>> to > >>> divide text on non-letters, its intended to do just that. > >>> > >>> from the javadocs: > >>> LowerCaseTokenizer performs the function of LetterTokenizer and > >>> LowerCaseFilter together. It divides text at non-letters and converts > >>> them > >>> to lower case. While it is functionally equivalent to the combination > of > >>> LetterTokenizer and LowerCaseFilter, there is a performance advantage > to > >>> doing the two tasks at once, hence this (redundant) implementation. > >>> > >>> > >>> > >>> So... Is there a way for me to tell it to NOT split based on > >>> non-characters? > >>> Use a different tokenizer that doesn't split on non-characters, > >>> followed by > >>> a LowerCaseFilter > >>> > >>> -- > >>> Robert Muir > >>> rcm...@gmail.com > >>> > >>> > >>> > >> > >> > >> > > > > > -- > Robert Muir > rcm...@gmail.com >