Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Jonathan Rochkind Tue, 14 Sep 2010 15:10:44 -0700

How about patching the LetterTokenizer to be capable of tokenizing howyou want, which can then be combined with a LowerCaseFilter (or not) asdesired. Or indeed creating a new tokenizer to do exactly what you want,possibly (but one that doesn't combine an embedded lowercasefilter inthere too!). Instead of patching the LowerCaseTokenizer, which is ofdubious value. Just brainstorming.

Another way to tokenize based on "Non-Whitespace/Alpha/Numericcharacter-content" might be using the existing PatternTokenizerFactorywith a suitable regexp, as you mention. Which of course could do whatthe LetterTokenizer does to, but presumably not as efficiently. Is thatwhat gives you an uncomfortable feeling? If it performs worse enough tomatter, then that's why you'd need a custom tokenizer, other than thatI'm not sure anything's undesirable about the PatternTokenizer.


Jonathan

Scott Gonyea wrote:

I'd agree with your point entirely.  My attacking LowerCaseTokenizer was a
result of not wanting to create yet more Classes.

That said, rightfully dumping LowerCaseTokenizer would probably have me
creating my own Tokenizer.

I could very well be thinking about this wrong...  But what if I wanted to
create tokens based on Non-Whitespace/Alpha/Numeric character-content?

It looks like I could perhaps use the PatternTokenizer, but that didn't
leave me with a comfortable feeling when I had first looked into it.

Scott

On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote:

Jonathan, you bring up an excellent point.

I think its worth our time to actually benchmark this LowerCaseTokenizer
versus LetterTokenizer + LowerCaseFilter

This tokenizer is quite old, and although I can understand there is no
doubt
its technically faster than LetterTokenizer + LowerCaseFilter even today
(as
it can just go through the char[] only a single time), I have my doubts
that
this brings any value these days...


On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:

Why would you want to do that, instead of just using another tokenizer

and

a lowercasefilter?  It's more confusing less DRY code to leave them

separate

-- the LowerCaseTokenizerFactory  combines anyway because someone decided

it

was such a common use case that it was worth it for the demonstrated
performance advantage. (At least I hope that's what happened, otherwise
there's no excuse for it!).

Do you know you get a worthwhile performance benefit for what you're

doing?

 If not, why do it?

Jonathan


Scott Gonyea wrote:

I went for a different route:

https://issues.apache.org/jira/browse/LUCENE-2644

Scott

On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:

On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org>

wrote:

Hi,

I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't

create

tokens, based solely on lower-casing characters.  Is there a way to

tell

it

NOT to drop non-characters?  It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different

modes

of

behavior.  If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory?  Or any number of combinations that would

otherwise

achieve that specific end-result?

I don't think you should use LowerCaseTokenizerFactory if you dont want
to
divide text on non-letters, its intended to do just that.

from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts
them
to lower case. While it is functionally equivalent to the combination

of

LetterTokenizer and LowerCaseFilter, there is a performance advantage

to

doing the two tasks at once, hence this (redundant) implementation.



So... Is there a way for me to tell it to NOT split based on
non-characters?
   Use a different tokenizer that doesn't split on non-characters,
followed by
a LowerCaseFilter

--
Robert Muir
rcm...@gmail.com

--
Robert Muir
rcm...@gmail.com

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to