[ 
https://issues.apache.org/jira/browse/LUCENE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Heisey updated LUCENE-6689:
---------------------------------
    Comment: was deleted

(was: The reason that phrase searches don't match after LUCENE-5111 is that the 
query analysis on my real fieldType is slightly different -- catenateWords, 
catenateNumbers, and preserveOriginal are all disabled on the query analysis.  
With those settings and the previously given input of "aaa-bbb: ccc", aaa ends 
up at position 1 and bbb at position 2, which is not the same as the index 
analysis with the settings above.)

> Odd analysis problem with WDF, appears to be triggered by preceding analysis 
> components
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6689
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.8
>            Reporter: Shawn Heisey
>
> This problem shows up for me in Solr, but I believe the issue is down at the 
> Lucene level, so I've opened the issue in the LUCENE project.  We can move it 
> if necessary.
> I've boiled the problem down to this minimum Solr fieldType:
> {noformat}
>     <fieldType name="testType" class="solr.TextField"
> sortMissingLast="true" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="1"
>           catenateNumbers="1"
>           catenateAll="0"
>           preserveOriginal="1"
>         />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="0"
>           catenateNumbers="0"
>           catenateAll="0"
>           preserveOriginal="0"
>         />
>       </analyzer>
>     </fieldType>
> {noformat}
> On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then aaa ends up 
> at term position 1 and bbb at term position 2.  This seems perfectly 
> reasonable to me.  In Solr 4.9, both terms end up at position 2.  This causes 
> phrase queries which used to work to return zero hits.  The exact text of the 
> phrase query is in the original documents that match on 4.7.
> If the custom rbbi (which is included unmodified from the lucene icu analysis 
> source code) is not used, then the problem doesn't happen, because the 
> punctuation doesn't make it to the PRF.  If the PatternReplaceFilterFactory 
> is not present, then the problem doesn't happen.
> I can work around the problem by setting luceneMatchVersion to 4.7, but I 
> think the behavior is a bug, and I would rather not continue to use 4.7 
> analysis when I upgrade to 5.x, which I hope to do soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to