Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Scott Gonyea Tue, 14 Sep 2010 14:53:10 -0700

There doesn't seem to have been anything readily available.  All of the
tokenizers make their own assumptions about how I want to treat the data.
 The end result is that this felt like the most direct approach.  The
default behavior of "LowerCaseTokenizer"(+Factory) was retained, while
allowing it to be extended in very small ways--at the users discretion.


The comments noted that it was done for performance reasons, but I honestly
cannot believe the performance gain is altogether worthwhile.  Whether or
not that's the case, I strongly believe that "LowerCaseTokenizer" should
have (more correctly) been called "LowerCaseLetterTokenizer".

There's arguably zero negative impact from my change.  Where the (inherited)
isTokenChar(int) method from LetterTokenizer was simply:

  protected boolean isTokenChar(int c) {

    return Character.isLetter(c);

  }

I've (likewise) given the most-common use-case the the first check in the
method:

  protected boolean isTokenChar(int c) {

    if(Character.isLetter(c))         {

      return true;

    }

Scott

On Tue, Sep 14, 2010 at 2:23 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

> Why would you want to do that, instead of just using another tokenizer and
> a lowercasefilter?  It's more confusing less DRY code to leave them separate
> -- the LowerCaseTokenizerFactory  combines anyway because someone decided it
> was such a common use case that it was worth it for the demonstrated
> performance advantage. (At least I hope that's what happened, otherwise
> there's no excuse for it!).
>
> Do you know you get a worthwhile performance benefit for what you're doing?
>  If not, why do it?
>
> Jonathan
>
>
> Scott Gonyea wrote:
>
>> I went for a different route:
>>
>> https://issues.apache.org/jira/browse/LUCENE-2644
>>
>> Scott
>>
>> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:
>>
>>
>>
>>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
>>>> tokens, based solely on lower-casing characters.  Is there a way to tell
>>>>
>>>>
>>> it
>>>
>>>
>>>> NOT to drop non-characters?  It's amazingly frustrating that the
>>>> TokenizerFactory and the FilterFactory have two entirely different modes
>>>>
>>>>
>>> of
>>>
>>>
>>>> behavior.  If I wanted it to tokenize based on non-lower case
>>>> characters....
>>>> wouldn't I use, say, LetterTokenizerFactory and tack on the
>>>> LowerCaseFilterFactory?  Or any number of combinations that would
>>>>
>>>>
>>> otherwise
>>>
>>>
>>>> achieve that specific end-result?
>>>>
>>>>
>>>>
>>> I don't think you should use LowerCaseTokenizerFactory if you dont want
>>> to
>>> divide text on non-letters, its intended to do just that.
>>>
>>> from the javadocs:
>>> LowerCaseTokenizer performs the function of LetterTokenizer and
>>> LowerCaseFilter together. It divides text at non-letters and converts
>>> them
>>> to lower case. While it is functionally equivalent to the combination of
>>> LetterTokenizer and LowerCaseFilter, there is a performance advantage to
>>> doing the two tasks at once, hence this (redundant) implementation.
>>>
>>>
>>>
>>> So... Is there a way for me to tell it to NOT split based on
>>> non-characters?
>>>    Use a different tokenizer that doesn't split on non-characters,
>>> followed by
>>> a LowerCaseFilter
>>>
>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>>
>>>
>>
>>
>>
>

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to