[ https://issues.apache.org/jira/browse/LUCENE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shawn Heisey updated LUCENE-6689: --------------------------------- Comment: was deleted (was: The reason that phrase searches don't match after LUCENE-5111 is that the query analysis on my real fieldType is slightly different -- catenateWords, catenateNumbers, and preserveOriginal are all disabled on the query analysis. With those settings and the previously given input of "aaa-bbb: ccc", aaa ends up at position 1 and bbb at position 2, which is not the same as the index analysis with the settings above.) > Odd analysis problem with WDF, appears to be triggered by preceding analysis > components > --------------------------------------------------------------------------------------- > > Key: LUCENE-6689 > URL: https://issues.apache.org/jira/browse/LUCENE-6689 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 4.8 > Reporter: Shawn Heisey > > This problem shows up for me in Solr, but I believe the issue is down at the > Lucene level, so I've opened the issue in the LUCENE project. We can move it > if necessary. > I've boiled the problem down to this minimum Solr fieldType: > {noformat} > <fieldType name="testType" class="solr.TextField" > sortMissingLast="true" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer > class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory" > rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" > replacement="$2" > /> > <filter class="solr.WordDelimiterFilterFactory" > splitOnCaseChange="1" > splitOnNumerics="1" > stemEnglishPossessive="1" > generateWordParts="1" > generateNumberParts="1" > catenateWords="1" > catenateNumbers="1" > catenateAll="0" > preserveOriginal="1" > /> > </analyzer> > <analyzer type="query"> > <tokenizer > class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory" > rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" > replacement="$2" > /> > <filter class="solr.WordDelimiterFilterFactory" > splitOnCaseChange="1" > splitOnNumerics="1" > stemEnglishPossessive="1" > generateWordParts="1" > generateNumberParts="1" > catenateWords="0" > catenateNumbers="0" > catenateAll="0" > preserveOriginal="0" > /> > </analyzer> > </fieldType> > {noformat} > On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then aaa ends up > at term position 1 and bbb at term position 2. This seems perfectly > reasonable to me. In Solr 4.9, both terms end up at position 2. This causes > phrase queries which used to work to return zero hits. The exact text of the > phrase query is in the original documents that match on 4.7. > If the custom rbbi (which is included unmodified from the lucene icu analysis > source code) is not used, then the problem doesn't happen, because the > punctuation doesn't make it to the PRF. If the PatternReplaceFilterFactory > is not present, then the problem doesn't happen. > I can work around the problem by setting luceneMatchVersion to 4.7, but I > think the behavior is a bug, and I would rather not continue to use 4.7 > analysis when I upgrade to 5.x, which I hope to do soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org