[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Hoss Man (JIRA) Tue, 26 Jan 2010 12:04:57 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805167#action_12805167
 ]


Hoss Man commented on SOLR-1677:
--------------------------------

bq. And here are the JIRA issues for stemming bugs, since you didnt take my 
hint to go and actually read them.

sigh.  I read both those issues when you filed them, and I agreed with your 
assessment that they are bugs we should fix -- if i had thought you were wrong 
i would have said so in the issue comments.

But that doesn't change the fact that sometimes people depend on buggy behavior 
-- and sometimes those people depend on the buggy behavior without even 
realizing it.  Bug fixes in a stemmer might make it more correct according to 
the stemmer algorithm specification, or the language semantics, but in some 
peculuar use cases an application might find the "correct" implementation less 
useful then the previous buggy version.

This is one reason why things like CHANGES.txt are important: to draw attention 
to what has changed between two versions of a piece of software, so people can 
make informed opinions about what they should test in their own applications 
when they upgrade things under the covers.  luceneMatchVersion should be no 
different.  We should try to find a simple way to inform people "when you 
switch from luceneMatchVersion=X to luceneMatchVersion=Y here are the bug fixes 
you will get" so they know what to test to determine if they are adversely 
affected by that bug fix in some way (and find their own work around)

bq. Perhaps you should come up with a better example than stemming, as you 
don't know what you are talking about.

1) It's true, I frequently don't know what i'm talking about ... this issue was 
a prime example, and i thank you, Uwe, and Miller for helping me realize that i 
was completely wrong in my understanding about the intended purpose of 
o.a.l.Version, and that a global setting for it in Solr makes total sense -- 
But that doesn't make my concerns about documenting the affects of that global 
setting any less valid.

2) Perhaps you should read the StopFilter example i already posted in my last 
comment...

{quote}
bq. Robert mentioned in an earlier comment that StopFilter's position increment 
behavior changes depending on the luceneMatchVersion -- what if an existing 
Solr 1.3 user notices a bug in some Tokenizer, and adds 
{{<luceneMatchVersion>3.0</luceneMatchVersion>}} to his schema.xml to fix it.  
Without clear documentation n _everything_ that is affected when doing that, he 
may not realize that StopFilter changed at all -- and even though the position 
incrememnt behavior may now be more correct, it might drasticly change the 
results he gets when using dismax with a particular qs or ps value.  Hence my 
point that this becomes a serious documentation concern: finding a way to make 
it clear to users what they need to consider when modifying luceneMatchVersion.
{quote}

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Reply via email to