[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Hoss Man (JIRA) Tue, 19 Jan 2010 17:37:20 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802609#action_12802609
 ]


Hoss Man commented on SOLR-1677:
--------------------------------

I'm definitely of two minds on this.

On the one hand...

Robert's clarification of his concerns convinces me that we don't need a global 
setting.  The issue of multiple related components in an analysis chain (ie: 
EsperantoTokenizer, EsperantoStopFilter, and EsperantoStemmerFilter) not being 
well tested in Lucene-Java when those components use differnet Version 
proeprties doesn't seem like a compelling argument because we've never made any 
claims that any combinations of analysis componets will work together.  People 
can easily construct Analyzers in their schema.xml that make no sense, and 
don't work at all, we'll never be able to solve that problem for everyone.   
Worrying about people miss-matching version numbers doesn't seem any different 
then worrying about them using inconsistent stopword files between an index 
analyzer and a query analyzer on the same field: buyer beware.

On the other hand...

I view the Version property of all these Lucene-Java classes an as 
implementation detail of the generalized ideal of providing multiple solutions 
for a similar problem that have subtly differnet behavior.  To my mind: Adding 
a version property to StandardTokenizer is just an alternate approach to 
deprecating StandardTokenizer and providing a new StadanrdTokenizer2 where the 
behavior is "improved" based on the subjective opinion of the Lucene community. 
 The Version property approach is easier to maintain in the Lucene source tree, 
but still requires roughly the same amount of work on the part of client app 
maintainers when upgrading: consider whether you think the "improved" behavior 
is better for your application, and modify your code as needed.  I've been 
looking at how this should be supported in Solr with that perspective, putting 
the schema.xml owner in the role of the client app maintainer.

But I'm realizing now that I'm clearly in the minority in viewing these 
multiple versions as "alternate implementations" ... everyone else seems to 
have a very fixed view that these Version based changes are genuine 
improvements/bug-fixes, w/o any expectation that clients might/could subjective 
decide "i want the old behavior" and that older "Versions" are supported purely 
for back-compatibility.

If that's how Version is really going to be used in Lucene-Java moving forward, 
then I can definitely understand the push for having it globally configured in 
Solr for simplification.

----

I won't fight you guys on this ... if I'm the only one that feels like a global 
value is bad, then i concede that probably says more about me then about the 
idea.

But I'm still really worried about the problem of (opaque) action at a 
distance, and the difficulties in understanding what effects there will be when 
changing the luceneVersionMatch property from one value to another.

This comment from Mark illustrates what scares me the most...

bq. it should say, if you change this, you must reindex. No worries about 
action at a distance. The action is to get the latest and greatest Lucene has 
to offer rather than older buggy or back compat behavior.

...that mindset, that as long as you reindex you'll be fine, totally downplays 
the fact that changes will happen in places the user may not realize.  w/o a 
clear way of knowing what exactly is changing when you modify that (global) 
value, users will have no idea what to look for when they "upgrade" it.  they 
won't have any visibility into what the fully set of behavior changes to 
exepect as a result of that update, to know what they should test to make sure 
it still works the way they need it to.

If they read in mailing list thread that they need to switch from 
{{<luceneMatchVersion>2.4</luceneMatchVersion>}} to 
{{<luceneMatchVersion>2.9</luceneMatchVersion>}} and completley reindex in 
order to get positions to be preserved in StopFilterFactory, that doesn't help 
them realize that they should do relevancy testing on fieldA and fieldB which 
use some language specific stemmer whose behavior changed in a small but 
significant way.

As a user, that's the nightmare scenario i don't want to have to deal with:  
greping through every class in Lucene-Java that has a Version property to see 
which ones have differnet behavior between the luceneMatchVersion property i'm 
currently using and the luceneMatchVersion property i've been told i should 
upgrade to in order to fix a bug ... just so i know what things i need to test 
after i make my change.

I guess this is will just be a documentation problem, but it seems like a 
pretty fucking big one.



> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Reply via email to