[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Robert Muir (JIRA) Tue, 26 Jan 2010 12:39:00 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805187#action_12805187
 ]


Robert Muir commented on SOLR-1677:
-----------------------------------

bq. 2) Perhaps you should read the StopFilter example i already posted in my 
last comment...

https://issues.apache.org/jira/browse/LUCENE-2094?focusedCommentId=12783932&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12783932

as far as this one goes, i specifically commented before on this not being 
'hidden' by Version (with Solr users in mind) but instead its own option that 
every user should consider, regardless of defaults.

For the stopfilter posInc the user should think it through, its pretty strange, 
like i mention in my comment, that a definite article like 'the' gets a posInc 
bump in one language but not another, simply because it happens to be separated 
by a space.

I guess I could care less what the default is, if you care about such things 
you shouldn't be using the defaults and instead specifying this yourself in the 
schema, and Version has no effect. I can't really defend the whole stopfilter 
posInc thing, as again i think it doesn't make a whole lot of sense, maybe it 
works good for english I guess, I won't argue about it.


> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

Reply via email to