Re: Proximity Search with Phrases

David Hastings Mon, 08 Sep 2025 09:21:09 -0700

if you want to get really clever, use a new field that tokenizes on the
number of words and also uses a stopword filter,  this will tokenize entire
phrases up to a certain length, then search against this field and boost
the matching


so

find a red fox turns into find_red_fox find_red red_fox
as three separate terms in the index, use that to match against the other
phrase within a certain distance.






On Mon, Sep 8, 2025 at 11:32 AM Matt Kuiper <[email protected]> wrote:

> Thanks for the feedback!
>
> Mikhail - I did not see the complex query parser supporting proximity
> between 2 phrases, however the XmlQParser might via spans.  Thanks for the
> tip!
>
> Gus - we currently use the Surround query  parser for proximity between two
> terms. Do you know of a means to use it for proximity between phrases?
> This would be ideal as we have a search client tool already using this
> syntax.
>
> Dave - This type of approach might work for us (possibly like the complex
> query parser) where it is not exactly finding proximity between two
> phrases.  But verifying that all the worlds within two phrases are within a
> proximity range.  As you say this could handle stop words that may still be
> in the index from not blocking a match.
>
> Matt
>
> On Mon, Sep 8, 2025 at 7:29 AM Dave <[email protected]> wrote:
>
> > There are other clever ways to do it too, using the within parameter, and
> > other things I don’t remember off the top of my head but I gave a
> > presentation a few years ago that utilized it.   It uses more raw solr
> > parameters that you can take in a phrase but tokenize them and find out
> > documents that have that phrase but may have words inside them, so you
> > restrict the results to only documents that have all the words in the
> > phrase but within that number of words plus 2 or 3 to take care of stop
> > words that may show up, like “red house hill” would still find “red house
> > on top of the hill” within a proximity to each other of about 7.
> >
> > > On Sep 7, 2025, at 7:15 PM, Gus Heck <[email protected]> wrote:
> > >
> > > Or
> > >
> >
> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#surround-query-parser
> > >
> > >> On Sun, Sep 7, 2025 at 4:32 PM Mikhail Khludnev <[email protected]>
> > wrote:
> > >>
> > >> Hi
> > >> I might be missing a point. But the way to create spans in Solr are:
> > >>
> > >>
> >
> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#xml-query-parser
> > >>
> > >>
> >
> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#complex-phrase-query-parser
> > >>
> > >>
> > >>> On Fri, Sep 5, 2025 at 6:32 PM mtn search <[email protected]>
> wrote:
> > >>>
> > >>> I may have found what I am running up against - if Chatgpt is correct
> > >>> on diagnosis?
> > >>>
> > >>> *My sample query*
> > >>> /select?debug=true&indent=true&q={!lucene}spanNear(
> > >>>  spanNear(spanTerm(body:separate),spanTerm(body:email),0,true),
> > >>>  spanNear(spanTerm(body:will),spanTerm(body:be),0,true),
> > >>>  10,false)
> > >>>
> > >>> *Text from body field from a message where the messages is returned
> > from
> > >>> the spanNear query above (I believe incorrectly)*
> > >>>       "separate device there will not be any load on the email
> servers"
> > >>>
> > >>> *Same text through analyzer*
> > >>> text
> > >>> raw_bytes
> > >>> start
> > >>> end
> > >>>
> > >>>
> > >>> separate
> > >>> [73 65 70 61 72 61 74 65]
> > >>> 5
> > >>> 13
> > >>>
> > >>> device
> > >>> [64 65 76 69 63 65]
> > >>> 14
> > >>> 20
> > >>>
> > >>> there
> > >>> [74 68 65 72 65]
> > >>> 21
> > >>> 26
> > >>>
> > >>> will
> > >>> [77 69 6c 6c]
> > >>> 27
> > >>> 31
> > >>>
> > >>> not
> > >>> [6e 6f 74]
> > >>> 32
> > >>> 35
> > >>>
> > >>> be
> > >>> [62 65]
> > >>> 36
> > >>> 38
> > >>>
> > >>> any
> > >>> [61 6e 79]
> > >>> 39
> > >>> 42
> > >>>
> > >>> load
> > >>> [6c 6f 61 64]
> > >>> 43
> > >>> 47
> > >>>
> > >>> on
> > >>> [6f 6e]
> > >>> 48
> > >>> 50
> > >>>
> > >>> the
> > >>> [74 68 65]
> > >>> 51
> > >>> 54
> > >>>
> > >>> email
> > >>> [65 6d 61 69 6c]
> > >>> 55
> > >>> 60
> > >>>
> > >>> server
> > >>> [73 65 72 76 65 72]
> > >>> 61
> > >>> 68
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> *Chatgpt assessment*
> > >>>
> > >>>    Now, let’s check the spans:
> > >>>
> > >>>   -
> > >>>
> > >>>   Inner spanNear(separate, email, 0, true) is *not* going to match
> > >>>   directly, because email isn’t right after separate.
> > >>>   -
> > >>>
> > >>>   But Lucene is allowed to *reposition* the spans when used as
> children
> > >> of
> > >>>   the outer spanNear. Each child span doesn’t need to be contiguous
> > >> unless
> > >>>   it resolves to a valid match somewhere in the text.
> > >>>
> > >>> *Conclusion: *This last line may explain why the message above was
> > >> returned
> > >>> by the query above, but appears to be incorrect.  While the
> > words/tokens
> > >> in
> > >>> the query are in the message they do not honor the proximity
> specified.
> > >>> But apparently children spans do not have to honor the proximity
> rules
> > >>> specified.  AI suggested this query for proximity, I am now
> concluding
> > it
> > >>> is not a valid approach.
> > >>>
> > >>> I am not seeing a Solr/Lucene http query approach for a proximity
> > search
> > >>> between phrases,  other than possibly to use the Lucene Java API for
> > more
> > >>> control.
> > >>>
> > >>> If others have found a workable solution, please let me know.
> > >>>
> > >>> Thanks,
> > >>> Matt
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>> On Thu, Sep 4, 2025 at 3:26 PM mtn search <[email protected]>
> > wrote:
> > >>>
> > >>>> Also, I am using the SolrAdmin Analysis UI to verify how Solr is
> > >>>> tokenizing the messages and verifying manually position between
> > tokens.
> > >>>>
> > >>>> Debug view of the query side:
> > >>>> For query:
> > >>>> "*params*":{
> > >>>>      "q":"{!lucene}SpanNearQuery(body,(money question),5,true)",
> > >>>>      "df":"body",
> > >>>>      "debug":"true",
> > >>>>      "indent":"true",
> > >>>>      "q.op":"OR",
> > >>>>      "wt":"json"}},
> > >>>>
> > >>>> It seems odd that in the parsed query that the "body" field named is
> > >>>> pre-appended to the value 5 and the text true.
> > >>>>  "*debug*":{
> > >>>>    "rawquerystring":"{!lucene}SpanNearQuery(body,(money
> > >>>> question),5,true)",
> > >>>>    "querystring":"{!lucene}SpanNearQuery(body,(money
> > >> question),5,true)",
> > >>>>    "parsedquery":"body:spannearquery (body:body (body:money
> > >>>> body:question) (body:5 body:true))",
> > >>>>    "*parsedquery_toString*":*"body:spannearquery *(body:body
> > >> (body:money
> > >>>> body:question)* (body:5 body:true*))",
> > >>>>    "explain":{
> > >>>>
> > >>>> On Thu, Sep 4, 2025 at 12:04 PM mtn search <[email protected]>
> > >> wrote:
> > >>>>
> > >>>>> Thanks Tim!  Yes I have tried a variety of values and am aware
> > >>>>> of ordering vs non ordering.  I am getting more results than
> expected
> > >>> and
> > >>>>> some that do not match the proximity criteria.   So when I set it
> to
> > a
> > >>>>> small value like 2, I was seeking to see the result count drop
> > >>>>> significantly as many would not match criteria.  Unfortunately, the
> > >>> count
> > >>>>> does not drop.   Looks like a fundamental problem with how I am
> using
> > >>> the
> > >>>>> syntax.  Still researching, and open to suggestions.
> > >>>>>
> > >>>>> Matt
> > >>>>>
> > >>>>> On Thu, Sep 4, 2025 at 11:54 AM Tim Casey <[email protected]>
> wrote:
> > >>>>>
> > >>>>>> usually the span and proximities are off-by-one issues.
> > Specifically
> > >>> the
> > >>>>>> order of the tokens will change the distance calculation.  I do
> not
> > >>> have
> > >>>>>> an
> > >>>>>> example off the top of my head.   But, when I was doing this, I
> > >> usually
> > >>>>>> started with a larger span and brought it down through looking at
> > >>>>>> results.
> > >>>>>>
> > >>>>>> This is the case for the old 5~"phrase words" syntax.
> > >>>>>>
> > >>>>>> As an aside, "Not working" is taken by me to mean you are not
> > getting
> > >>>>>> results but the query passes parse.  Not working could mean a lot
> > >> more
> > >>> in
> > >>>>>> this context.  So I am suggesting, instead of 2, try 10.
> > >>>>>>
> > >>>>>> On Thu, Sep 4, 2025 at 10:43 AM mtn search <[email protected]>
> > >>> wrote:
> > >>>>>>
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>> Looking for guidance on approaches to implement a proximity
> search
> > >>>>>> between
> > >>>>>>> phrases.
> > >>>>>>>
> > >>>>>>> Initially tried:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> "q":"{!lucene}spanNear(spanNear(spanNear(spanTerm(body:off),spanTerm(body:the),0,true),
> > >>>>>>> spanTerm(body: record),0,true),
> > >>>>>> spanNear(spanTerm(body:new),spanTerm(body:
> > >>>>>>> information),0,true) , 2N,false)",
> > >>>>>>>      "defType":"lucene",
> > >>>>>>>      "df":"body",
> > >>>>>>>
> > >>>>>>> However then simplified to just two terms:
> > >>>>>>>
> > >>>
> "q":"{!lucene}spanNear(spanTerm(body:off),spanTerm(body:call),2,true)",
> > >>>>>>>      "defType":"lucene",
> > >>>>>>>      "df":"body",
> > >>>>>>>
> > >>>>>>> Both are not working.  Any tips?  Currently on Solr 9.4, but will
> > >>>>>> likely
> > >>>>>>> need to run for some time on a Solr 6 instance.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Matt
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Sincerely yours
> > >> Mikhail Khludnev
> > >>
> > >
> > >
> > > --
> > > http://www.needhamsoftware.com (work)
> > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> >
>

Re: Proximity Search with Phrases

Reply via email to