[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Hoss Man (JIRA) Mon, 11 Jan 2010 15:13:17 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798916#action_12798916
 ]


Hoss Man commented on SOLR-1677:
--------------------------------

bq. I don't think Version is intended so you can use X.Y on this part and Y.Z 
on this part and have any chance of anything working, for example it controls 
position increments on stopfilter but also in queryparser, if you use wacky 
combinations, things might not work.

How is that any different from letting users pass any Analyzer they want to the 
QueryParser constructor?  There's no guarantee that anything will every work if 
you do something crazy (like uppercase all terms when indexing, and lowercase 
all terms when searching) But lucene exposes that to the devolper and let's 
them make the choice -- likewise Solr happily lets you configure a query 
analyzer that's completely different from your index analyzer -- if that's what 
you want, that's what you get: being able to set different Version params 
should be no different.  If the QueryParser you are using says that version=X.Y 
will only work with StopFilter if it's version=X.Y as well that's fine -- but 
maybe you've solved that problem a completely different way with a comppletley 
alternate implementation of StopFilter (that doesn't care about version).  The 
user should be in control.

bq. sometimes things interact in ways we cannot detect automatically

which is why i think it's a bad idea to have a global default for this ... 
there may be situations where people explicitly want different behavior in 
different instances (ie: in this field i want the legacy 2.4 StopFilter 
behavior, but in this field i want the current 2.9 stop filter behavior) and 
having a default will mask the ability to do this, and make it easy to 
inadvertantly break it.

bq. its my understanding that things like this are why Version was created in 
the first place.

My understanding is castly different then yours ... All the discussions i 
remember about it were along the lines of preventing Class proliferation -- 
that people didn't' like the idea of creating StandardAnalyzer2 just because 
StandardAnalyzer had some behavior that was considered buggy but couldn't be 
removed - so now there is a constructor arg instead, and static constants that 
let you pick a fixed behavior, or a constant that let's you pick "current" no 
matter what it is -- so applications that always want the "current recommended 
behavior" can just upgrade a jar and get it.

But I don't remember any implication that it was expected that every object 
would have the same Version settings as every other object -- if that was the 
intention then shouldn't there be a standard interface for "Versionable" or 
"VersionAware" objects so they can test compatibility with one another (ie: 
QueryParser and Analyzers that might wrap StopFilter) ? ... or a "{{public 
static void setCurrentOperatingVersion(Version)}} method in the Version class, 
instead of letting each constructor take in an independent value?

----

FWIW: Even though I'm still convinced that having any sort of "global" default 
value for luceneMatchVersion is a bad idea -- and i'm going to keep trying to 
convince other people as well -- I want to make some comments about how i think 
it should be implemented if we do wind up doing it (just in case i get hit by a 
bus)

Making the Base*Factory analysis classses SolrCoreAware is really overkill for 
this -- there was a real conscious choice not to let things declared in 
schema.xml be SolrCoreAware, because it pulls back the curtain and exposes a 
lot of plumbing related APIs in way that could make it hard to refactor away 
SolrCore functionality later.  The list of plugin types that can be made 
SolrCoreAware is deliberately small, and confined to plugins that are already 
exposed to the full SolrCore API at some other time in their life cycle -- 
being SolrCoreAware just gives them access to the core during initialization.

If there is really going to be one uber-default global "luceneMatchVersion" 
then i think the place it makes the most sense to declare something like this 
is in the schema.xml -- many differnet solrconfig.xml files might be used with 
the same schema.xml, so if we're expecting that the "typical" behavior is to 
set this once and have it just work it should propogate from the IndexSchema 
object to the SolrCore and not vice-versa.

My suggestion for how to implement this would be...

# Add a new "luceneMatchVersion" attribute to the existing <schema/> tag.
# Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can 
use this to get the default.
# When init()ing new objects, include the key=>value pair of 
{{"luceneMatchVersion"=>schema.getLuceneMatchVersion()}} to the init method of 
the object if it's not already an init param for that particular instance.

This would eliminate the need to make any of the Analysis Factories 
SolrCoreAware (or even ResourceLoaderAware) just to know what the 
luceneMatchVersion should be -- the Base*Factories could still contain a 
{{protected Version luceneMatchVersion}} set by the base init() method that 
subclasses could use as needed.

NOTE: This still doesn't doesn't solve the "Analyzers must have no-arg 
constructors" part of hte issue -- but it doesn't make it worse.  We can make 
IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg 
"Version" constructor fairly easily.  If/When we provide a more general 
mechanism for passing constructor args to Analyzers, any Version params could 
be defaulted just like with the factory init() methods.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Reply via email to