[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Hoss Man (JIRA) Wed, 20 Jan 2010 12:27:17 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802973#action_12802973
 ]


Hoss Man commented on SOLR-1677:
--------------------------------


bq. I think I am slightly offended with some of your statements about 
'subjective opinion of the Lucene Community' and 'they should do relevancy 
testing which use some language-specific stemmer whose behavior changed in a 
small but significant way'.

That was not at all my intention, i'm sorry about that.  I was in fact trying 
to speak entirely in generalities and theoretical examples.

The point I was trying to make is that the types of bug fixes we make in Lucene 
are no mathematical absolutes -- we're not fixing bugs where 1+1=3.  Even if 
everyone on java-dev, and java-user agrees that behavior A is broken and 
behavior B is correct, that is still (to me) a subjective opinion -- 1000 mens 
trash may be one mans treasure, and there could be users out there who have 
come to expect/rely on that behavior A.

I tried to use a stemmer as an example because it's the type of class where 
making behavior more correct (ie: making the stemming match the semantics of 
the language more accurately) doesn't necessarily improve the percieved 
behavior for all users -- someone could be very happy with the "sloppy 
stemming" in the 3.1 version of a (hypothetical) EsperantoStemmer because it 
gives him really "loose" matches.  And if you (or any one else) put in a lot of 
hard work making that stemmer "better" my all concievable metrics in 3.4, then 
i've got no problem telling that person "Sorry dude, if you don't want those 
fixes don't upgrade, or here are some other suggestions for getting 'loose' 
matching on that field."

My concern is that there may be people who don't even realize they are 
depending on behavior like this.  Without an easy way for users to understand 
what objects have improved/fixed behavior between luceneMatchVersion=X and 
luceneMatchVersion=Y they won't know the full list of things they should be 
considering/testing when they do change luceneMatchVersion.

bq. I'm also not that worried that users won't know what changed - they will 
just know that they are in the same boat as those downloading Lucene latest 
greatest for the first time.

But that's not true:  a person downloading for the first time won't have any 
preconcieved expectaionts of how something will behavior; that's a very 
different boat from a person upgrading is going to expect things that were 
working to keep working -- those things may have actaully been bugs in earlier 
versions, but if they _seemed_ to be working for their use cases, it's going to 
feel like it's broken when the behavior changes.  For a user who is conciously 
upgrading i'm ok with that.  but when there is no easy way of knowing what 
behavior will change as a result of setting luceneMatchVersion=X that doens't 
feel fair to the user.

Robert mentioned in an earlier comment that StopFilter's position increment 
behavior changes depending on the luceneMatchVersion -- what if an existing 
Solr 1.3 user notices a bug in some Tokenizer, and adds 
{{<luceneMatchVersion>3.0</luceneMatchVersion>}} to his schema.xml to fix it.  
Without clear documentation n _everything_ that is affected when doing that, he 
may not realize that StopFilter changed at all -- and even though the position 
incrememnt behavior may now be more correct, it might drasticly change the 
results he gets when using dismax with a particular qs or ps value.  Hence my 
point that this becomes a serious documentation concern: finding a way to make 
it clear to users what they need to consider when modifying luceneMatchVersion.

bq. I'm still all for allowing Version per component for experts use. But man, 
I wouldn't want to be in the boat, managing all my components as they mimic 
various bugs/bad behavior for various components.

But if the example configs only show a global setting that isn't directly 
"linked" to any of hte individual object configurations, then normal users 
won't have any idea what could have/use individual luceneMatchVerssion settings 
anyway (even if they wanted to manage it piecemeal)

Like i said: i've come around to the idea of having/advocating a global value.  
Once i got passed my mistaken thinking of "Version" as controlling "alternate 
versions" (as miller very clearly put it) I started to understand what you are 
all saying and i agree with you: a single global value is a good idea.

My concern is just how to document things so that people don't get confused 
when they do need to change it.


> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Reply via email to