[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802609#action_12802609 ]
Hoss Man commented on SOLR-1677: -------------------------------- I'm definitely of two minds on this. On the one hand... Robert's clarification of his concerns convinces me that we don't need a global setting. The issue of multiple related components in an analysis chain (ie: EsperantoTokenizer, EsperantoStopFilter, and EsperantoStemmerFilter) not being well tested in Lucene-Java when those components use differnet Version proeprties doesn't seem like a compelling argument because we've never made any claims that any combinations of analysis componets will work together. People can easily construct Analyzers in their schema.xml that make no sense, and don't work at all, we'll never be able to solve that problem for everyone. Worrying about people miss-matching version numbers doesn't seem any different then worrying about them using inconsistent stopword files between an index analyzer and a query analyzer on the same field: buyer beware. On the other hand... I view the Version property of all these Lucene-Java classes an as implementation detail of the generalized ideal of providing multiple solutions for a similar problem that have subtly differnet behavior. To my mind: Adding a version property to StandardTokenizer is just an alternate approach to deprecating StandardTokenizer and providing a new StadanrdTokenizer2 where the behavior is "improved" based on the subjective opinion of the Lucene community. The Version property approach is easier to maintain in the Lucene source tree, but still requires roughly the same amount of work on the part of client app maintainers when upgrading: consider whether you think the "improved" behavior is better for your application, and modify your code as needed. I've been looking at how this should be supported in Solr with that perspective, putting the schema.xml owner in the role of the client app maintainer. But I'm realizing now that I'm clearly in the minority in viewing these multiple versions as "alternate implementations" ... everyone else seems to have a very fixed view that these Version based changes are genuine improvements/bug-fixes, w/o any expectation that clients might/could subjective decide "i want the old behavior" and that older "Versions" are supported purely for back-compatibility. If that's how Version is really going to be used in Lucene-Java moving forward, then I can definitely understand the push for having it globally configured in Solr for simplification. ---- I won't fight you guys on this ... if I'm the only one that feels like a global value is bad, then i concede that probably says more about me then about the idea. But I'm still really worried about the problem of (opaque) action at a distance, and the difficulties in understanding what effects there will be when changing the luceneVersionMatch property from one value to another. This comment from Mark illustrates what scares me the most... bq. it should say, if you change this, you must reindex. No worries about action at a distance. The action is to get the latest and greatest Lucene has to offer rather than older buggy or back compat behavior. ...that mindset, that as long as you reindex you'll be fine, totally downplays the fact that changes will happen in places the user may not realize. w/o a clear way of knowing what exactly is changing when you modify that (global) value, users will have no idea what to look for when they "upgrade" it. they won't have any visibility into what the fully set of behavior changes to exepect as a result of that update, to know what they should test to make sure it still works the way they need it to. If they read in mailing list thread that they need to switch from {{<luceneMatchVersion>2.4</luceneMatchVersion>}} to {{<luceneMatchVersion>2.9</luceneMatchVersion>}} and completley reindex in order to get positions to be preserved in StopFilterFactory, that doesn't help them realize that they should do relevancy testing on fieldA and fieldB which use some language specific stemmer whose behavior changed in a small but significant way. As a user, that's the nightmare scenario i don't want to have to deal with: greping through every class in Lucene-Java that has a Version property to see which ones have differnet behavior between the luceneMatchVersion property i'm currently using and the luceneMatchVersion property i've been told i should upgrade to in order to fix a bug ... just so i know what things i need to test after i make my change. I guess this is will just be a documentation problem, but it seems like a pretty fucking big one. > Add support for o.a.lucene.util.Version for BaseTokenizerFactory and > BaseTokenFilterFactory > ------------------------------------------------------------------------------------------- > > Key: SOLR-1677 > URL: https://issues.apache.org/jira/browse/SOLR-1677 > Project: Solr > Issue Type: Sub-task > Components: Schema and Analysis > Reporter: Uwe Schindler > Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, > SOLR-1677.patch > > > Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards > compatibility with old indexes created using older versions of Lucene. The > most important example is StandardTokenizer, which changed its behaviour with > posIncr and incorrect host token types in 2.4 and also in 2.9. > In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with > much more Unicode support, almost every Tokenizer/TokenFilter needs this > Version parameter. In 2.9, the deprecated old ctors without Version take > LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. > This patch adds basic support for the Lucene Version property to the base > factories. Subclasses then can use the luceneMatchVersion decoded enum (in > 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently > contains a helper map to decode the version strings, but in 3.0 is can be > replaced by Version.valueOf(String), as the Version is a subclass of Java5 > enums. The default value is Version.LUCENE_24 (as this is the default for the > no-version ctors in Lucene). > This patch also removes unneeded conversions to CharArraySet from > StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed > to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.