Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Jonathan Rochkind Tue, 14 Sep 2010 17:10:27 -0700

Because (just IMO, I'm not an expert here either) the basic framework inSolr is that tokenizers tokenize, but they don't generally change bytesinside values. What changes bytes (or adds or removes tokens to thetoken stream initially created by a tokenizer, etc) is filters. Andthere's already a LowerCaseFilter to change your bytes by lower-casingthem, so if you want to lowercase your bytes, it's already there for you.

And, if we're talking patches (rather than a custom thing you write foryourself), users can already combine the tokenizing behavior they wantwith the filtering behavior they want by mixing and matching tokenizersand filters -- no need to write new code to provide options totokenizers, options which would have to be kept consistent betweenfilters to avoid confusion and would duplicate existing filterbehavior. The "DRY" here applies to giving solr users too manyduplicatative options in their schema.xml too, not just to theunderlying Java, although even Java code that just calls other existingJava code is still extra Java code to be maintained.

So that's why I suggest using a LowerCaseFilter to, well, lower-caseyour values. Instead of building lower-casing into a tokenizer andsubmitting that as a patch.

So if you need to build or change a tokenizer to get the tokenizingbehavior you want, then do that -- don't build in a lowercasefilter intothe tokenizer too. If what you want is basically the same as theexisting LetterTokenizer but with a few changes, then it might makesense to add options to the LetterTokenizer. If what you want is sodifferent from LetterTokenizer that this is inconvenient, then maybe youneed to create a new tokenizer. But either way, there's no reason tohave a LowerCaseFilter built into a Tokenizer -- unless maybe there isdemonstrated non-trivial performance advantage and (for a patch to solrrather than your own personal custom tokenizer) it's a common use case.

That's all I was trying to suggest. Perhaps I've misunderstood whatyou're needs or plans are though.


Scott Gonyea wrote:

There's a lot of reasons, with the performance hit being notable--but also
because I feel that using a regex on something this basic amounts to a lazy
hack.  I'm typically against regular expressions in XML.
 I'm vehemently opposed to them in cases where not using them should
otherwise be quite trivial.  Regarding LowerCaseFilter, etc:

My question is: Why should LowerCaseFilter be the means by which that work
is done? I fully agree with keeping things DRY, but I'm not quite sure I
agree with how that mantra is being employed.  For instance, the two
tokenizer statements:

<tokenizer class="solr.WhiteSpaceTokenizer" downCase="true">
<tokenizer class="solr.LowerCaseLetterTokenizer">

Can be written to utilize the same codebase, which makes things DRY and
*may* even be a bit more performant for less trivial transformations.

If nothing else, I think a "CharacterTokenizer" would be good way to go.

<tokenizer class="solr.CharacterTokenizer" downCase="true"
tokenizeSpecialCharacters="true" tokenizeWhiteSpace="true"
tokenizedCharcterClasses="wd"/>


All that said :)  I don't promote myself as an expert and I'm happy to be
shown the light / slapped across the head.

Scott

On Tue, Sep 14, 2010 at 3:10 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

How about patching the LetterTokenizer to be capable of tokenizing how you
want, which can then be combined with a LowerCaseFilter (or not) as desired.
Or indeed creating a new tokenizer to do exactly what you want, possibly
(but one that doesn't combine an embedded lowercasefilter in there too!).
Instead of patching the LowerCaseTokenizer, which is of dubious value. Just
brainstorming.

Another way to tokenize based on "Non-Whitespace/Alpha/Numeric
character-content" might be using the existing PatternTokenizerFactory with
a suitable regexp, as you mention.  Which of course could do what the
LetterTokenizer does to, but presumably not as efficiently. Is that what
gives you an uncomfortable feeling? If it performs worse enough to matter,
then that's why you'd need a custom tokenizer, other than that I'm not sure
anything's undesirable about the PatternTokenizer.


Jonathan

Scott Gonyea wrote:

I'd agree with your point entirely.  My attacking LowerCaseTokenizer was a
result of not wanting to create yet more Classes.

That said, rightfully dumping LowerCaseTokenizer would probably have me
creating my own Tokenizer.

I could very well be thinking about this wrong...  But what if I wanted to
create tokens based on Non-Whitespace/Alpha/Numeric character-content?

It looks like I could perhaps use the PatternTokenizer, but that didn't
leave me with a comfortable feeling when I had first looked into it.

Scott

On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote:

Jonathan, you bring up an excellent point.

I think its worth our time to actually benchmark this LowerCaseTokenizer
versus LetterTokenizer + LowerCaseFilter

This tokenizer is quite old, and although I can understand there is no
doubt
its technically faster than LetterTokenizer + LowerCaseFilter even today
(as
it can just go through the char[] only a single time), I have my doubts
that
this brings any value these days...


On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:

Why would you want to do that, instead of just using another tokenizer

and

a lowercasefilter?  It's more confusing less DRY code to leave them

separate

-- the LowerCaseTokenizerFactory  combines anyway because someone
decided

it

was such a common use case that it was worth it for the demonstrated
performance advantage. (At least I hope that's what happened, otherwise
there's no excuse for it!).

Do you know you get a worthwhile performance benefit for what you're

doing?

 If not, why do it?

Jonathan


Scott Gonyea wrote:

I went for a different route:

https://issues.apache.org/jira/browse/LUCENE-2644

Scott

On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com>
wrote:

On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org>

wrote:

Hi,

I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't

create

 tokens, based solely on lower-casing characters.  Is there a way to

tell

it

NOT to drop non-characters?  It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different

modes

of

behavior.  If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory?  Or any number of combinations that would

otherwise

achieve that specific end-result?

I don't think you should use LowerCaseTokenizerFactory if you dont
want
to
divide text on non-letters, its intended to do just that.

from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts
them
to lower case. While it is functionally equivalent to the combination

of

LetterTokenizer and LowerCaseFilter, there is a performance advantage

to

doing the two tasks at once, hence this (redundant) implementation.


So... Is there a way for me to tell it to NOT split based on
non-characters?
  Use a different tokenizer that doesn't split on non-characters,
followed by
a LowerCaseFilter

--
Robert Muir
rcm...@gmail.com

--

Robert Muir
rcm...@gmail.com

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to