The SpanNearQuery in association with "a.b." input and WDGF is expected behavior, since WDGF causes the query to search ("ab")|("a" "b"), as 1 or 2 tokens, respectively. The "a. b." input (whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so sticks with the more straightforward PhraseQuery implementation.
That said, the problem you're encountering is related to a couple of issues: https://issues.apache.org/jira/browse/LUCENE-7398 https://issues.apache.org/jira/browse/LUCENE-4312 For this case specifically, the problem is that NearSpansOrdered lazily returns one match per position *for the first subclause*. The or clause ("ab"|"a" "b"), because positionLength is not indexed, will always return "ab" first (implicit positionLength of 1). Again because "ab"'s actual positionLength of 2 from index-time WDGF is not stored in the index, the implicit positionLength of 1 at query-time gives the impression of a gap between "ab" and "isar", violating the "slop=0" constraint. Because NearSpansOrdered.nextStartPosition() always advances by calling nextStartPosition() on the first subclause (without exploring for variant matches in other subclauses), the top-level NearSpansOrdered advances after one attempt at matching, and the valid match is missed. Pending fixes to address the underlying issue (there is a candidate patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312), you could mitigate the problem to some extent by either forcing slop>0 (which as of 7.6 will be expanded into MultiPhraseQuery -- see https://issues.apache.org/jira/browse/LUCENE-8531), or you could set preserveOriginal=true on both index-time and query-time WDGF and upgrade to 8.1 (which would prevent the extreme case of an *exact* character-for-character matching query turning up no results -- see https://issues.apache.org/jira/browse/LUCENE-8730). On Fri, May 17, 2019 at 11:47 AM Erick Erickson <erickerick...@gmail.com> wrote: > > I’ll leave that explanation to someone who understands query parsers ;) > > > On May 17, 2019, at 7:57 AM, Doris Peter <doris.pe...@bsb-muenchen.de> > > wrote: > > > > Thanks a lot! I tried the debug parameter, which shows interesting > > differences: > > > > debug": { > > > > "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"", > > "querystring": "all_places_txt:\"Neuburg a. d. Donau\"", > > "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")", > > "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"", > > "QParser": "LuceneQParser" > > } > > > > debug": { > > "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"", > > "querystring": "all_places_txt:\"Neuburg a.d. Donau\"", > > "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], > > 0, true)]), all_places_txt:donau], 0, true))", > > "parsedquery_toString": "spanNear([all_places_txt:neuburg, > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], > > 0, true)]), all_places_txt:donau], 0, true)", > > "QParser": "LuceneQParser" > > } > > > > > > Something seems to go wrong here, as the parsedquery contains the > > SpanNearQuery instead of a PhraseQuery. > > > > > > > > > > > > > > > > > > > >>>> Erick Erickson <erickerick...@gmail.com> 5/17/2019 4:27 PM >>> > > Three things: > > > > 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after > > it in the index config > > > > 2> It is usually unnecessary to have the exact same parameters at both > > query and index time for WDGFF. If you’ve split parts up at index time then > > mashed them all back together, you can usually only split them up at query > > time. > > > > 3> try adding &debug=query to the query and see what the results show for > > the parsed query. That usually gives you a clue what is really happening > > .vs. what you think is happening. > > > > Best, > > Erick > > > >> On May 17, 2019, at 12:59 AM, Doris Peter <doris.pe...@bsb-muenchen.de> > >> wrote: > >> > >> Hello, > >> > >> We use Solr 7.6.0 to build our index, and I have got a Question about > >> Phrase Queries: > >> > >> We use the following configuration in schema.xml: > >> > >> <!-- Text Standard --> > >> <fieldType name="text" class="solr.TextField" > >> positionIncrementGap="1000" sortMissingLast="true" > >> autoGeneratePhraseQueries="true"> > >> <analyzer type="index"> > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >> <charFilter class="solr.MappingCharFilterFactory" > >> mapping="mapping-FoldToASCII.txt"/> > >> <filter class="solr.CJKBigramFilterFactory"/> > >> <filter class="solr.WordDelimiterGraphFilterFactory" > >> protected="protectedword.txt" > >> preserveOriginal="0" splitOnNumerics="1" > >> splitOnCaseChange="0" > >> catenateWords="1" catenateNumbers="1" catenateAll="1" > >> generateWordParts="1" generateNumberParts="1" > >> stemEnglishPossessive="1" > >> types="wdfftypes.txt" /> > >> <filter class="solr.LengthFilterFactory" min="1" > >> max="2147483647"/> > >> <filter class="solr.LowerCaseFilterFactory"/> > >> </analyzer> > >> <analyzer type="query"> > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >> <charFilter class="solr.MappingCharFilterFactory" > >> mapping="mapping-FoldToASCII.txt"/> > >> <filter class="solr.CJKBigramFilterFactory"/> > >> <filter class="solr.WordDelimiterGraphFilterFactory" > >> protected="protectedword.txt" > >> preserveOriginal="0" splitOnNumerics="1" > >> splitOnCaseChange="0" > >> catenateWords="1" catenateNumbers="1" catenateAll="1" > >> generateWordParts="1" generateNumberParts="1" > >> stemEnglishPossessive="1" > >> types="wdfftypes.txt" /> > >> <filter class="solr.LengthFilterFactory" min="1" > >> max="2147483647"/> > >> <filter class="solr.LowerCaseFilterFactory"/> > >> </analyzer> > >> </fieldType> > >> > >> > >> If we search for a phrase like "Moosburg a.d. Isar" we don't get a > >> match, though it's definitely in our Index. > >> If we search for "Moosburg a. d. Isar" with a blank between "a." > >> and "d." we get a match. > >> > >> This also happens for other non-word characters, like ' or , for > >> example. > >> > >> The strange thing about it is, that the Solr Analysis-Tool reports > >> a match for the first version, but when we send a Solr Query, we get no > >> result Documents. > >> > >> Has anyone got an idea, what this could be? > >> > >> Thank you very much in advance, > >> > >> Doris Peter > > > > >