Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Scott Gonyea Tue, 14 Sep 2010 15:01:54 -0700

I'd agree with your point entirely.  My attacking LowerCaseTokenizer was a
result of not wanting to create yet more Classes.


That said, rightfully dumping LowerCaseTokenizer would probably have me
creating my own Tokenizer.

I could very well be thinking about this wrong...  But what if I wanted to
create tokens based on Non-Whitespace/Alpha/Numeric character-content?

It looks like I could perhaps use the PatternTokenizer, but that didn't
leave me with a comfortable feeling when I had first looked into it.

Scott

On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote:

> Jonathan, you bring up an excellent point.
>
> I think its worth our time to actually benchmark this LowerCaseTokenizer
> versus LetterTokenizer + LowerCaseFilter
>
> This tokenizer is quite old, and although I can understand there is no
> doubt
> its technically faster than LetterTokenizer + LowerCaseFilter even today
> (as
> it can just go through the char[] only a single time), I have my doubts
> that
> this brings any value these days...
>
>
> On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu>
> wrote:
>
> > Why would you want to do that, instead of just using another tokenizer
> and
> > a lowercasefilter?  It's more confusing less DRY code to leave them
> separate
> > -- the LowerCaseTokenizerFactory  combines anyway because someone decided
> it
> > was such a common use case that it was worth it for the demonstrated
> > performance advantage. (At least I hope that's what happened, otherwise
> > there's no excuse for it!).
> >
> > Do you know you get a worthwhile performance benefit for what you're
> doing?
> >  If not, why do it?
> >
> > Jonathan
> >
> >
> > Scott Gonyea wrote:
> >
> >> I went for a different route:
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-2644
> >>
> >> Scott
> >>
> >> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:
> >>
> >>
> >>
> >>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org>
> wrote:
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't
> create
> >>>> tokens, based solely on lower-casing characters.  Is there a way to
> tell
> >>>>
> >>>>
> >>> it
> >>>
> >>>
> >>>> NOT to drop non-characters?  It's amazingly frustrating that the
> >>>> TokenizerFactory and the FilterFactory have two entirely different
> modes
> >>>>
> >>>>
> >>> of
> >>>
> >>>
> >>>> behavior.  If I wanted it to tokenize based on non-lower case
> >>>> characters....
> >>>> wouldn't I use, say, LetterTokenizerFactory and tack on the
> >>>> LowerCaseFilterFactory?  Or any number of combinations that would
> >>>>
> >>>>
> >>> otherwise
> >>>
> >>>
> >>>> achieve that specific end-result?
> >>>>
> >>>>
> >>>>
> >>> I don't think you should use LowerCaseTokenizerFactory if you dont want
> >>> to
> >>> divide text on non-letters, its intended to do just that.
> >>>
> >>> from the javadocs:
> >>> LowerCaseTokenizer performs the function of LetterTokenizer and
> >>> LowerCaseFilter together. It divides text at non-letters and converts
> >>> them
> >>> to lower case. While it is functionally equivalent to the combination
> of
> >>> LetterTokenizer and LowerCaseFilter, there is a performance advantage
> to
> >>> doing the two tasks at once, hence this (redundant) implementation.
> >>>
> >>>
> >>>
> >>> So... Is there a way for me to tell it to NOT split based on
> >>> non-characters?
> >>>    Use a different tokenizer that doesn't split on non-characters,
> >>> followed by
> >>> a LowerCaseFilter
> >>>
> >>> --
> >>> Robert Muir
> >>> rcm...@gmail.com
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to