Re: Antw: Re: Behaviour of punctuation marks in phrase queries

Michael Gibney Fri, 17 May 2019 09:29:34 -0700

The SpanNearQuery in association with "a.b." input and WDGF is
expected behavior, since WDGF causes the query to search ("ab")|("a"
"b"), as 1 or 2 tokens, respectively. The "a. b." input
(whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so
sticks with the more straightforward PhraseQuery implementation.


That said, the problem you're encountering is related to a couple of issues:
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312

For this case specifically, the problem is that NearSpansOrdered
lazily returns one match per position *for the first subclause*. The
or clause ("ab"|"a" "b"), because positionLength is not indexed, will
always return "ab" first (implicit positionLength of 1). Again because
"ab"'s actual positionLength of 2 from index-time WDGF is not stored
in the index, the implicit positionLength of 1 at query-time gives the
impression of a gap between "ab" and "isar", violating the "slop=0"
constraint.

Because NearSpansOrdered.nextStartPosition() always advances by
calling nextStartPosition() on the first subclause (without exploring
for variant matches in other subclauses), the top-level
NearSpansOrdered advances after one attempt at matching, and the valid
match is missed.

Pending fixes to address the underlying issue (there is a candidate
patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312),
you could mitigate the problem to some extent by either forcing slop>0
(which as of 7.6 will be expanded into MultiPhraseQuery -- see
https://issues.apache.org/jira/browse/LUCENE-8531), or you could set
preserveOriginal=true on both index-time and query-time WDGF and
upgrade to 8.1 (which would prevent the extreme case of an *exact*
character-for-character matching query turning up no results -- see
https://issues.apache.org/jira/browse/LUCENE-8730).

On Fri, May 17, 2019 at 11:47 AM Erick Erickson <erickerick...@gmail.com> wrote:
>
> I’ll leave that explanation to someone who understands query parsers ;)
>
> > On May 17, 2019, at 7:57 AM, Doris Peter <doris.pe...@bsb-muenchen.de> 
> > wrote:
> >
> > Thanks a lot! I tried the debug parameter, which shows interesting 
> > differences:
> >
> > debug": {
> >
> >    "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> >    "querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> >    "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
> >    "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
> >    "QParser": "LuceneQParser"
> > }
> >
> > debug": {
> >        "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> >        "querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> >        "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, 
> > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > 0, true)]), all_places_txt:donau], 0, true))",
> >        "parsedquery_toString": "spanNear([all_places_txt:neuburg, 
> > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > 0, true)]), all_places_txt:donau], 0, true)",
> >        "QParser": "LuceneQParser"
> >    }
> >
> >
> > Something seems to go wrong here, as the parsedquery contains the 
> > SpanNearQuery instead of a PhraseQuery.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >>>> Erick Erickson <erickerick...@gmail.com> 5/17/2019 4:27 PM >>>
> > Three things:
> >
> > 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after 
> > it in the index config
> >
> > 2> It is usually unnecessary to have the exact same parameters at both 
> > query and index time for WDGFF. If you’ve split parts up at index time then 
> > mashed them all back together, you can usually only split them up at query 
> > time.
> >
> > 3> try adding &debug=query to the query and see what the results show for 
> > the parsed query. That usually gives you a clue what is really happening 
> > .vs. what you think is happening.
> >
> > Best,
> > Erick
> >
> >> On May 17, 2019, at 12:59 AM, Doris Peter <doris.pe...@bsb-muenchen.de> 
> >> wrote:
> >>
> >> Hello,
> >>
> >> We use Solr 7.6.0 to build our index, and I have got a Question about
> >> Phrase Queries:
> >>
> >> We use the following configuration in schema.xml:
> >>
> >>   <!-- Text Standard -->
> >>   <fieldType name="text" class="solr.TextField"
> >> positionIncrementGap="1000" sortMissingLast="true"
> >> autoGeneratePhraseQueries="true">
> >>     <analyzer type="index">
> >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>       <charFilter class="solr.MappingCharFilterFactory"
> >> mapping="mapping-FoldToASCII.txt"/>
> >>       <filter class="solr.CJKBigramFilterFactory"/>
> >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> >> protected="protectedword.txt"
> >>            preserveOriginal="0" splitOnNumerics="1"
> >> splitOnCaseChange="0"
> >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> >>            generateWordParts="1" generateNumberParts="1"
> >> stemEnglishPossessive="1"
> >>            types="wdfftypes.txt" />
> >>       <filter class="solr.LengthFilterFactory" min="1"
> >> max="2147483647"/>
> >>       <filter class="solr.LowerCaseFilterFactory"/>
> >>     </analyzer>
> >>     <analyzer type="query">
> >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>       <charFilter class="solr.MappingCharFilterFactory"
> >> mapping="mapping-FoldToASCII.txt"/>
> >>       <filter class="solr.CJKBigramFilterFactory"/>
> >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> >> protected="protectedword.txt"
> >>            preserveOriginal="0" splitOnNumerics="1"
> >> splitOnCaseChange="0"
> >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> >>            generateWordParts="1" generateNumberParts="1"
> >> stemEnglishPossessive="1"
> >>            types="wdfftypes.txt" />
> >>       <filter class="solr.LengthFilterFactory" min="1"
> >> max="2147483647"/>
> >>       <filter class="solr.LowerCaseFilterFactory"/>
> >>     </analyzer>
> >>   </fieldType>
> >>
> >>
> >>   If we search for a phrase like "Moosburg a.d. Isar" we don't get a
> >> match, though it's definitely in our Index.
> >>   If we search for "Moosburg a. d. Isar" with a blank between "a."
> >> and "d." we get a match.
> >>
> >>   This also happens for other non-word characters, like ' or , for
> >> example.
> >>
> >>   The strange thing about it is, that the Solr Analysis-Tool reports
> >> a match for the first version, but when we send a Solr Query, we get no
> >> result Documents.
> >>
> >>   Has anyone got an idea, what this could be?
> >>
> >>   Thank you very much in advance,
> >>
> >>   Doris Peter
> >
> >
>

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

Reply via email to