[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802973#action_12802973 ]
Hoss Man commented on SOLR-1677: -------------------------------- bq. I think I am slightly offended with some of your statements about 'subjective opinion of the Lucene Community' and 'they should do relevancy testing which use some language-specific stemmer whose behavior changed in a small but significant way'. That was not at all my intention, i'm sorry about that. I was in fact trying to speak entirely in generalities and theoretical examples. The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes -- we're not fixing bugs where 1+1=3. Even if everyone on java-dev, and java-user agrees that behavior A is broken and behavior B is correct, that is still (to me) a subjective opinion -- 1000 mens trash may be one mans treasure, and there could be users out there who have come to expect/rely on that behavior A. I tried to use a stemmer as an example because it's the type of class where making behavior more correct (ie: making the stemming match the semantics of the language more accurately) doesn't necessarily improve the percieved behavior for all users -- someone could be very happy with the "sloppy stemming" in the 3.1 version of a (hypothetical) EsperantoStemmer because it gives him really "loose" matches. And if you (or any one else) put in a lot of hard work making that stemmer "better" my all concievable metrics in 3.4, then i've got no problem telling that person "Sorry dude, if you don't want those fixes don't upgrade, or here are some other suggestions for getting 'loose' matching on that field." My concern is that there may be people who don't even realize they are depending on behavior like this. Without an easy way for users to understand what objects have improved/fixed behavior between luceneMatchVersion=X and luceneMatchVersion=Y they won't know the full list of things they should be considering/testing when they do change luceneMatchVersion. bq. I'm also not that worried that users won't know what changed - they will just know that they are in the same boat as those downloading Lucene latest greatest for the first time. But that's not true: a person downloading for the first time won't have any preconcieved expectaionts of how something will behavior; that's a very different boat from a person upgrading is going to expect things that were working to keep working -- those things may have actaully been bugs in earlier versions, but if they _seemed_ to be working for their use cases, it's going to feel like it's broken when the behavior changes. For a user who is conciously upgrading i'm ok with that. but when there is no easy way of knowing what behavior will change as a result of setting luceneMatchVersion=X that doens't feel fair to the user. Robert mentioned in an earlier comment that StopFilter's position increment behavior changes depending on the luceneMatchVersion -- what if an existing Solr 1.3 user notices a bug in some Tokenizer, and adds {{<luceneMatchVersion>3.0</luceneMatchVersion>}} to his schema.xml to fix it. Without clear documentation n _everything_ that is affected when doing that, he may not realize that StopFilter changed at all -- and even though the position incrememnt behavior may now be more correct, it might drasticly change the results he gets when using dismax with a particular qs or ps value. Hence my point that this becomes a serious documentation concern: finding a way to make it clear to users what they need to consider when modifying luceneMatchVersion. bq. I'm still all for allowing Version per component for experts use. But man, I wouldn't want to be in the boat, managing all my components as they mimic various bugs/bad behavior for various components. But if the example configs only show a global setting that isn't directly "linked" to any of hte individual object configurations, then normal users won't have any idea what could have/use individual luceneMatchVerssion settings anyway (even if they wanted to manage it piecemeal) Like i said: i've come around to the idea of having/advocating a global value. Once i got passed my mistaken thinking of "Version" as controlling "alternate versions" (as miller very clearly put it) I started to understand what you are all saying and i agree with you: a single global value is a good idea. My concern is just how to document things so that people don't get confused when they do need to change it. > Add support for o.a.lucene.util.Version for BaseTokenizerFactory and > BaseTokenFilterFactory > ------------------------------------------------------------------------------------------- > > Key: SOLR-1677 > URL: https://issues.apache.org/jira/browse/SOLR-1677 > Project: Solr > Issue Type: Sub-task > Components: Schema and Analysis > Reporter: Uwe Schindler > Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, > SOLR-1677.patch > > > Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards > compatibility with old indexes created using older versions of Lucene. The > most important example is StandardTokenizer, which changed its behaviour with > posIncr and incorrect host token types in 2.4 and also in 2.9. > In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with > much more Unicode support, almost every Tokenizer/TokenFilter needs this > Version parameter. In 2.9, the deprecated old ctors without Version take > LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. > This patch adds basic support for the Lucene Version property to the base > factories. Subclasses then can use the luceneMatchVersion decoded enum (in > 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently > contains a helper map to decode the version strings, but in 3.0 is can be > replaced by Version.valueOf(String), as the Version is a subclass of Java5 > enums. The default value is Version.LUCENE_24 (as this is the default for the > no-version ctors in Lucene). > This patch also removes unneeded conversions to CharArraySet from > StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed > to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.