[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Hoss Man (JIRA) Tue, 05 Jan 2010 13:28:17 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796854#action_12796854
 ]


Hoss Man commented on SOLR-1677:
--------------------------------

bq. User Carl isn't helpful, user Carl is an idiot.

Oh come on now ... that's not really a fair criticism of the example: there are 
plenty of legitimate ways to use (some) TokenFilters only at search time and I 
specifically structured my example to point out potential problems in cases 
just like that -- Carl was very clear that "if you used FooTokenFilterFactory 
in an index analyzer you'll need to reindex."


But fine, I'll amend my example to do it your way...


{panel}
...
Bob Asks his question (see previous example)

User Carl is on vacation and never sees Bob's email

User Dwight helpfully replies...

bq. That was identified as a bug with FooTokenFilter that was fixed in Lucene 
3.1, but the default behavior was left as is for backcompatibility. If you 
change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get 
the newer/better behavior - but you _must_ reindex all of your data after you 
make this change.

Bob makes the change to 3.2 that Carl recommended, reindexes all of his data, 
and is happy to see now his queries work and every thing seems fine.

What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in 
his schema.xml file, Bob is also using the YakTokenizerFactory on a differnet 
field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0.  
This change is generally considered "better" behavior then YakTokenizer had 
before, but in combination with another TokenFilter Bob is using on the 
yakField it causes behavior that is not what Bob wants.  Now some types of 
queries that use the yakField are failing, and *failing silently*.

{panel}

You could now argue that User Dwight is an idiot because he didn't warn Bob 
that other Analyzers/Tokenizers/TokenFilters might be affected.  But that just 
leads us to scenerious that re-iterates my point that this type of global value 
is something that would be dangerous to ever change....

{panel}
...
Bob Asks his question (see previous examples)

User Carl has unsubscribed from the solr-user list (because a Bill Murray 
look-a-like hurt his feelings) and never sees Bob's email.

User Dwight is on vacation and never sees Bob's email.

User Ernest helpfully replies...

{quote}
That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, 
but the default behavior was left as is for backcompatibility. If you change 
your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the 
newer/better behavior -- *But this is Very VERY Dangerous: It could potentially 
affect the behavior of other analyzers you are using.  You need to check the 
javadocs for each and every Analyzer, Tokenizer, and TokenFilter you use to see 
what their behavior is with various values of the Version property before you 
make a change like this.

Personally I never change the value of <luceneAnalyzerVersionDefault/> once i 
have an existing schema.xml file.  Instead i suggest you add 
{{luceneVersion="3.2"}} to your {{<filter class="solr.FooTokenFilterFactory 
/>}} declaration so that you know you are only changing the behavior you want 
to change.

BTW: You _must_ reindex all of your data after doing either of these things in 
order for it to work.
{quote}

Bob follow's Ernest's advice, and everything is fine .. but Bob is left 
wondering what the point is of a config option that's so dangerous to change, 
and wishes there was an easy way to know which of his Analyzers and Factories 
are depending on that scary "gobal" value.

{panel}

At the end of the day it just seems like a bigger risk then a feature ... I 
feel like i must still be misunderstanding the motivation you guys have for 
adding it, because it really seems like it boils down to "easier then having 
the property 2.9 set on every analyzer/factory"  

I guess i ultimately have no stringent objection to a global schema.xml seting 
like this existing as an expert level feature (for people who want really 
compact config files i guess), I just don't want to see it used in the example 
schema.xml file(s) where it's likely to screw novice users over.



> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Reply via email to