[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Hoss Man (JIRA) Sun, 03 Jan 2010 21:51:18 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796087#action_12796087
 ]


Hoss Man commented on SOLR-1677:
--------------------------------


bq. The problem is the default value. If you leave out the version parameter 
instance-wise, you will get 2.4. And because of that all solr users will get 
stuck with that version and will never upgrade (because they leave the default 
and do not specify a different value).

That feels like a missleading statement ... the "Version" property on these 
objects is really more about getting the "recommended" behavior as of a 
particular version of Lucene ... saying that users will be "stuck with that 
version" is like saying users will be "stuck with StandardAnalyzer" instead of 
getting "NewHotnessAnalyzer" because they have to edit their config to use the 
newer/better analyzer -- Lucene-Java has opted to use a Version property on 
existing classes instead of adding new classes, but it's still conceptually the 
same thing: they get the bahavior they've always gotten, unless they change 
their config to get something different.

Besides which: 99.9% of Solr users copy the example config when they first 
start using Solr: we can set a "version" property on every Analyzer/Factory 
used in the example schema.xml and update them all when we upgrade the Lucene 
jars just as easily as we can update a single "global" value (it's a 
search+replaceAll instead of a search+replace)


bq. Why are you so against a default value? 

My concern is that it introduces action at a distance -- and not in a good way.

Here's the scenerio that seems garunteed to happen quite a bit if we add some 
new {{<luceneAnalyzerVersionDefault/>}} syntax to schema.xml...

{panel}

{{<luceneAnalyzerVersionDefault>2.9</luceneAnalyzerVersionDefault>}} is added 
to the example schema.xml, and users start using it as a result of 
copying/modifying the example configs.  Time passes, new bugs are fixed, and 
the example configs evolve to contain 
{{<luceneAnalyzerVersionDefault>3.4</luceneAnalyzerVersionDefault>}} 

A little while after that, User Bob emails solr-user with a question like...

{quote}
Hey, I'm using FooTokenFilterFactory and i noticed that at query time i see 
behaviorX when it really seems like i should see BehaviorY 
{quote}

User Carl helpfully replies...

{quote}
That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, 
but the default behavior was left as is for backcompatibility.  If you change 
your {{<luceneAnalyzerVersionDefault/>}} value to 3.1 (or 3.2) you'll get the 
newer/better behavior -- but if you used FooTokenFilterFactory in an _index_ 
analyzer you'll need to reindex.
{quote}

Bob makes the change to 3.2 that Carl recommended, and is happy to see now his 
queries work.  He only uses FooTokenFilterFactory at _query_ time, so he 
doens't bother to reindex, and every thing seems fine.

What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in 
hi's schema.xml file, Bob is also using the YakTokenizerFactory on a differnet 
field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. 
Now _some_ documents/queries that use yakField are failing -- and *failing 
silently.*

{panel}

Things just get a lot simpler when all of the configuration for an Analyzer, 
TokenizerFactory, or Tokenizer are all explict in their declaration -- indirect 
initialization is fine, as long as it's obvious.  Ie: <field/> declarations 
referencing fieldTypes by name -- It's easy to fuck up a bunch of fields by 
making a single change to one fieldType, but at least you can grep for the name 
of the fieldType to see all the fields you are affecting.  

Even if "Carl" knows/remembers to warn "Bob" that changing 
{{<luceneAnalyzerVersionDefault/>}} might change/break other things in his 
schema.xml the situation doesn't get much better: Uless Bob (or Carl) skim the 
code for every Analyzer, Tokenizer, and TokenFilter used in Bob's schema, they 
can't be sure what might get affected by making a small increase to the 
"global" luceneAnalyzerVersion setting ... which means the only safe thing for 
Bob to do is to set the property individual on the one place he really wants to 
make the change.

So why have the "global" in the first place?  It really just seems like more 
trouble then it's worth.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Reply via email to