There doesn't seem to have been anything readily available. All of the tokenizers make their own assumptions about how I want to treat the data. The end result is that this felt like the most direct approach. The default behavior of "LowerCaseTokenizer"(+Factory) was retained, while allowing it to be extended in very small ways--at the users discretion.
The comments noted that it was done for performance reasons, but I honestly cannot believe the performance gain is altogether worthwhile. Whether or not that's the case, I strongly believe that "LowerCaseTokenizer" should have (more correctly) been called "LowerCaseLetterTokenizer". There's arguably zero negative impact from my change. Where the (inherited) isTokenChar(int) method from LetterTokenizer was simply: protected boolean isTokenChar(int c) { return Character.isLetter(c); } I've (likewise) given the most-common use-case the the first check in the method: protected boolean isTokenChar(int c) { if(Character.isLetter(c)) { return true; } Scott On Tue, Sep 14, 2010 at 2:23 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > Why would you want to do that, instead of just using another tokenizer and > a lowercasefilter? It's more confusing less DRY code to leave them separate > -- the LowerCaseTokenizerFactory combines anyway because someone decided it > was such a common use case that it was worth it for the demonstrated > performance advantage. (At least I hope that's what happened, otherwise > there's no excuse for it!). > > Do you know you get a worthwhile performance benefit for what you're doing? > If not, why do it? > > Jonathan > > > Scott Gonyea wrote: > >> I went for a different route: >> >> https://issues.apache.org/jira/browse/LUCENE-2644 >> >> Scott >> >> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote: >> >> >> >>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote: >>> >>> >>> >>>> Hi, >>>> >>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create >>>> tokens, based solely on lower-casing characters. Is there a way to tell >>>> >>>> >>> it >>> >>> >>>> NOT to drop non-characters? It's amazingly frustrating that the >>>> TokenizerFactory and the FilterFactory have two entirely different modes >>>> >>>> >>> of >>> >>> >>>> behavior. If I wanted it to tokenize based on non-lower case >>>> characters.... >>>> wouldn't I use, say, LetterTokenizerFactory and tack on the >>>> LowerCaseFilterFactory? Or any number of combinations that would >>>> >>>> >>> otherwise >>> >>> >>>> achieve that specific end-result? >>>> >>>> >>>> >>> I don't think you should use LowerCaseTokenizerFactory if you dont want >>> to >>> divide text on non-letters, its intended to do just that. >>> >>> from the javadocs: >>> LowerCaseTokenizer performs the function of LetterTokenizer and >>> LowerCaseFilter together. It divides text at non-letters and converts >>> them >>> to lower case. While it is functionally equivalent to the combination of >>> LetterTokenizer and LowerCaseFilter, there is a performance advantage to >>> doing the two tasks at once, hence this (redundant) implementation. >>> >>> >>> >>> So... Is there a way for me to tell it to NOT split based on >>> non-characters? >>> Use a different tokenizer that doesn't split on non-characters, >>> followed by >>> a LowerCaseFilter >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> >>> >> >> >> >