Re: Proximity Search with Phrases

Mikhail Khludnev Fri, 12 Sep 2025 12:43:54 -0700

I've checked the surround parser. Turns out it lacks braces support.
I've also added a reproducer for nested spans issue, which intervals are
able to handle
https://github.com/mkhludnev/solr-flexible-qparser/blob/860e17c16153b1d3ef337f099b0d9f572620e9b1/src/test/java/org/apache/solr/flexibleqp/TestCompeteWithSpans.java#L49



On Tue, Sep 9, 2025 at 1:12 PM Mikhail Khludnev <[email protected]> wrote:

> Right. complexphrase is not an option for nesting.
> I'm wondering if you encounter
> https://issues.apache.org/jira/browse/LUCENE-7398 Let us know please if
> you do.
> I'm interested in whether intervals are an option for such cases.
>
> On Mon, Sep 8, 2025 at 6:31 PM Matt Kuiper <[email protected]> wrote:
>
>> Thanks for the feedback!
>>
>> Mikhail - I did not see the complex query parser supporting proximity
>> between 2 phrases, however the XmlQParser might via spans.  Thanks for the
>> tip!
>>
>> Gus - we currently use the Surround query  parser for proximity between
>> two
>> terms. Do you know of a means to use it for proximity between phrases?
>> This would be ideal as we have a search client tool already using this
>> syntax.
>>
>> Dave - This type of approach might work for us (possibly like the complex
>> query parser) where it is not exactly finding proximity between two
>> phrases.  But verifying that all the worlds within two phrases are within
>> a
>> proximity range.  As you say this could handle stop words that may still
>> be
>> in the index from not blocking a match.
>>
>> Matt
>>
>> On Mon, Sep 8, 2025 at 7:29 AM Dave <[email protected]> wrote:
>>
>> > There are other clever ways to do it too, using the within parameter,
>> and
>> > other things I don’t remember off the top of my head but I gave a
>> > presentation a few years ago that utilized it.   It uses more raw solr
>> > parameters that you can take in a phrase but tokenize them and find out
>> > documents that have that phrase but may have words inside them, so you
>> > restrict the results to only documents that have all the words in the
>> > phrase but within that number of words plus 2 or 3 to take care of stop
>> > words that may show up, like “red house hill” would still find “red
>> house
>> > on top of the hill” within a proximity to each other of about 7.
>> >
>> > > On Sep 7, 2025, at 7:15 PM, Gus Heck <[email protected]> wrote:
>> > >
>> > > Or
>> > >
>> >
>> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#surround-query-parser
>> > >
>> > >> On Sun, Sep 7, 2025 at 4:32 PM Mikhail Khludnev <[email protected]>
>> > wrote:
>> > >>
>> > >> Hi
>> > >> I might be missing a point. But the way to create spans in Solr are:
>> > >>
>> > >>
>> >
>> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#xml-query-parser
>> > >>
>> > >>
>> >
>> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#complex-phrase-query-parser
>> > >>
>> > >>
>> > >>> On Fri, Sep 5, 2025 at 6:32 PM mtn search <[email protected]>
>> wrote:
>> > >>>
>> > >>> I may have found what I am running up against - if Chatgpt is
>> correct
>> > >>> on diagnosis?
>> > >>>
>> > >>> *My sample query*
>> > >>> /select?debug=true&indent=true&q={!lucene}spanNear(
>> > >>>  spanNear(spanTerm(body:separate),spanTerm(body:email),0,true),
>> > >>>  spanNear(spanTerm(body:will),spanTerm(body:be),0,true),
>> > >>>  10,false)
>> > >>>
>> > >>> *Text from body field from a message where the messages is returned
>> > from
>> > >>> the spanNear query above (I believe incorrectly)*
>> > >>>       "separate device there will not be any load on the email
>> servers"
>> > >>>
>> > >>> *Same text through analyzer*
>> > >>> text
>> > >>> raw_bytes
>> > >>> start
>> > >>> end
>> > >>>
>> > >>>
>> > >>> separate
>> > >>> [73 65 70 61 72 61 74 65]
>> > >>> 5
>> > >>> 13
>> > >>>
>> > >>> device
>> > >>> [64 65 76 69 63 65]
>> > >>> 14
>> > >>> 20
>> > >>>
>> > >>> there
>> > >>> [74 68 65 72 65]
>> > >>> 21
>> > >>> 26
>> > >>>
>> > >>> will
>> > >>> [77 69 6c 6c]
>> > >>> 27
>> > >>> 31
>> > >>>
>> > >>> not
>> > >>> [6e 6f 74]
>> > >>> 32
>> > >>> 35
>> > >>>
>> > >>> be
>> > >>> [62 65]
>> > >>> 36
>> > >>> 38
>> > >>>
>> > >>> any
>> > >>> [61 6e 79]
>> > >>> 39
>> > >>> 42
>> > >>>
>> > >>> load
>> > >>> [6c 6f 61 64]
>> > >>> 43
>> > >>> 47
>> > >>>
>> > >>> on
>> > >>> [6f 6e]
>> > >>> 48
>> > >>> 50
>> > >>>
>> > >>> the
>> > >>> [74 68 65]
>> > >>> 51
>> > >>> 54
>> > >>>
>> > >>> email
>> > >>> [65 6d 61 69 6c]
>> > >>> 55
>> > >>> 60
>> > >>>
>> > >>> server
>> > >>> [73 65 72 76 65 72]
>> > >>> 61
>> > >>> 68
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> *Chatgpt assessment*
>> > >>>
>> > >>>    Now, let’s check the spans:
>> > >>>
>> > >>>   -
>> > >>>
>> > >>>   Inner spanNear(separate, email, 0, true) is *not* going to match
>> > >>>   directly, because email isn’t right after separate.
>> > >>>   -
>> > >>>
>> > >>>   But Lucene is allowed to *reposition* the spans when used as
>> children
>> > >> of
>> > >>>   the outer spanNear. Each child span doesn’t need to be contiguous
>> > >> unless
>> > >>>   it resolves to a valid match somewhere in the text.
>> > >>>
>> > >>> *Conclusion: *This last line may explain why the message above was
>> > >> returned
>> > >>> by the query above, but appears to be incorrect.  While the
>> > words/tokens
>> > >> in
>> > >>> the query are in the message they do not honor the proximity
>> specified.
>> > >>> But apparently children spans do not have to honor the proximity
>> rules
>> > >>> specified.  AI suggested this query for proximity, I am now
>> concluding
>> > it
>> > >>> is not a valid approach.
>> > >>>
>> > >>> I am not seeing a Solr/Lucene http query approach for a proximity
>> > search
>> > >>> between phrases,  other than possibly to use the Lucene Java API for
>> > more
>> > >>> control.
>> > >>>
>> > >>> If others have found a workable solution, please let me know.
>> > >>>
>> > >>> Thanks,
>> > >>> Matt
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>> On Thu, Sep 4, 2025 at 3:26 PM mtn search <[email protected]>
>> > wrote:
>> > >>>
>> > >>>> Also, I am using the SolrAdmin Analysis UI to verify how Solr is
>> > >>>> tokenizing the messages and verifying manually position between
>> > tokens.
>> > >>>>
>> > >>>> Debug view of the query side:
>> > >>>> For query:
>> > >>>> "*params*":{
>> > >>>>      "q":"{!lucene}SpanNearQuery(body,(money question),5,true)",
>> > >>>>      "df":"body",
>> > >>>>      "debug":"true",
>> > >>>>      "indent":"true",
>> > >>>>      "q.op":"OR",
>> > >>>>      "wt":"json"}},
>> > >>>>
>> > >>>> It seems odd that in the parsed query that the "body" field named
>> is
>> > >>>> pre-appended to the value 5 and the text true.
>> > >>>>  "*debug*":{
>> > >>>>    "rawquerystring":"{!lucene}SpanNearQuery(body,(money
>> > >>>> question),5,true)",
>> > >>>>    "querystring":"{!lucene}SpanNearQuery(body,(money
>> > >> question),5,true)",
>> > >>>>    "parsedquery":"body:spannearquery (body:body (body:money
>> > >>>> body:question) (body:5 body:true))",
>> > >>>>    "*parsedquery_toString*":*"body:spannearquery *(body:body
>> > >> (body:money
>> > >>>> body:question)* (body:5 body:true*))",
>> > >>>>    "explain":{
>> > >>>>
>> > >>>> On Thu, Sep 4, 2025 at 12:04 PM mtn search <[email protected]>
>> > >> wrote:
>> > >>>>
>> > >>>>> Thanks Tim!  Yes I have tried a variety of values and am aware
>> > >>>>> of ordering vs non ordering.  I am getting more results than
>> expected
>> > >>> and
>> > >>>>> some that do not match the proximity criteria.   So when I set it
>> to
>> > a
>> > >>>>> small value like 2, I was seeking to see the result count drop
>> > >>>>> significantly as many would not match criteria.  Unfortunately,
>> the
>> > >>> count
>> > >>>>> does not drop.   Looks like a fundamental problem with how I am
>> using
>> > >>> the
>> > >>>>> syntax.  Still researching, and open to suggestions.
>> > >>>>>
>> > >>>>> Matt
>> > >>>>>
>> > >>>>> On Thu, Sep 4, 2025 at 11:54 AM Tim Casey <[email protected]>
>> wrote:
>> > >>>>>
>> > >>>>>> usually the span and proximities are off-by-one issues.
>> > Specifically
>> > >>> the
>> > >>>>>> order of the tokens will change the distance calculation.  I do
>> not
>> > >>> have
>> > >>>>>> an
>> > >>>>>> example off the top of my head.   But, when I was doing this, I
>> > >> usually
>> > >>>>>> started with a larger span and brought it down through looking at
>> > >>>>>> results.
>> > >>>>>>
>> > >>>>>> This is the case for the old 5~"phrase words" syntax.
>> > >>>>>>
>> > >>>>>> As an aside, "Not working" is taken by me to mean you are not
>> > getting
>> > >>>>>> results but the query passes parse.  Not working could mean a lot
>> > >> more
>> > >>> in
>> > >>>>>> this context.  So I am suggesting, instead of 2, try 10.
>> > >>>>>>
>> > >>>>>> On Thu, Sep 4, 2025 at 10:43 AM mtn search <[email protected]>
>> > >>> wrote:
>> > >>>>>>
>> > >>>>>>> Hello,
>> > >>>>>>>
>> > >>>>>>> Looking for guidance on approaches to implement a proximity
>> search
>> > >>>>>> between
>> > >>>>>>> phrases.
>> > >>>>>>>
>> > >>>>>>> Initially tried:
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>
>> > >>>
>> > >>
>> >
>> "q":"{!lucene}spanNear(spanNear(spanNear(spanTerm(body:off),spanTerm(body:the),0,true),
>> > >>>>>>> spanTerm(body: record),0,true),
>> > >>>>>> spanNear(spanTerm(body:new),spanTerm(body:
>> > >>>>>>> information),0,true) , 2N,false)",
>> > >>>>>>>      "defType":"lucene",
>> > >>>>>>>      "df":"body",
>> > >>>>>>>
>> > >>>>>>> However then simplified to just two terms:
>> > >>>>>>>
>> > >>>
>> "q":"{!lucene}spanNear(spanTerm(body:off),spanTerm(body:call),2,true)",
>> > >>>>>>>      "defType":"lucene",
>> > >>>>>>>      "df":"body",
>> > >>>>>>>
>> > >>>>>>> Both are not working.  Any tips?  Currently on Solr 9.4, but
>> will
>> > >>>>>> likely
>> > >>>>>>> need to run for some time on a Solr 6 instance.
>> > >>>>>>>
>> > >>>>>>> Thanks,
>> > >>>>>>> Matt
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >> Sincerely yours
>> > >> Mikhail Khludnev
>> > >>
>> > >
>> > >
>> > > --
>> > > http://www.needhamsoftware.com (work)
>> > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
>> >
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Proximity Search with Phrases

Reply via email to