if you want to get really clever, use a new field that tokenizes on the number of words and also uses a stopword filter, this will tokenize entire phrases up to a certain length, then search against this field and boost the matching
so find a red fox turns into find_red_fox find_red red_fox as three separate terms in the index, use that to match against the other phrase within a certain distance. On Mon, Sep 8, 2025 at 11:32 AM Matt Kuiper <kuipe...@gmail.com> wrote: > Thanks for the feedback! > > Mikhail - I did not see the complex query parser supporting proximity > between 2 phrases, however the XmlQParser might via spans. Thanks for the > tip! > > Gus - we currently use the Surround query parser for proximity between two > terms. Do you know of a means to use it for proximity between phrases? > This would be ideal as we have a search client tool already using this > syntax. > > Dave - This type of approach might work for us (possibly like the complex > query parser) where it is not exactly finding proximity between two > phrases. But verifying that all the worlds within two phrases are within a > proximity range. As you say this could handle stop words that may still be > in the index from not blocking a match. > > Matt > > On Mon, Sep 8, 2025 at 7:29 AM Dave <hastings.recurs...@gmail.com> wrote: > > > There are other clever ways to do it too, using the within parameter, and > > other things I don’t remember off the top of my head but I gave a > > presentation a few years ago that utilized it. It uses more raw solr > > parameters that you can take in a phrase but tokenize them and find out > > documents that have that phrase but may have words inside them, so you > > restrict the results to only documents that have all the words in the > > phrase but within that number of words plus 2 or 3 to take care of stop > > words that may show up, like “red house hill” would still find “red house > > on top of the hill” within a proximity to each other of about 7. > > > > > On Sep 7, 2025, at 7:15 PM, Gus Heck <gus.h...@gmail.com> wrote: > > > > > > Or > > > > > > https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#surround-query-parser > > > > > >> On Sun, Sep 7, 2025 at 4:32 PM Mikhail Khludnev <m...@apache.org> > > wrote: > > >> > > >> Hi > > >> I might be missing a point. But the way to create spans in Solr are: > > >> > > >> > > > https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#xml-query-parser > > >> > > >> > > > https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#complex-phrase-query-parser > > >> > > >> > > >>> On Fri, Sep 5, 2025 at 6:32 PM mtn search <search...@gmail.com> > wrote: > > >>> > > >>> I may have found what I am running up against - if Chatgpt is correct > > >>> on diagnosis? > > >>> > > >>> *My sample query* > > >>> /select?debug=true&indent=true&q={!lucene}spanNear( > > >>> spanNear(spanTerm(body:separate),spanTerm(body:email),0,true), > > >>> spanNear(spanTerm(body:will),spanTerm(body:be),0,true), > > >>> 10,false) > > >>> > > >>> *Text from body field from a message where the messages is returned > > from > > >>> the spanNear query above (I believe incorrectly)* > > >>> "separate device there will not be any load on the email > servers" > > >>> > > >>> *Same text through analyzer* > > >>> text > > >>> raw_bytes > > >>> start > > >>> end > > >>> > > >>> > > >>> separate > > >>> [73 65 70 61 72 61 74 65] > > >>> 5 > > >>> 13 > > >>> > > >>> device > > >>> [64 65 76 69 63 65] > > >>> 14 > > >>> 20 > > >>> > > >>> there > > >>> [74 68 65 72 65] > > >>> 21 > > >>> 26 > > >>> > > >>> will > > >>> [77 69 6c 6c] > > >>> 27 > > >>> 31 > > >>> > > >>> not > > >>> [6e 6f 74] > > >>> 32 > > >>> 35 > > >>> > > >>> be > > >>> [62 65] > > >>> 36 > > >>> 38 > > >>> > > >>> any > > >>> [61 6e 79] > > >>> 39 > > >>> 42 > > >>> > > >>> load > > >>> [6c 6f 61 64] > > >>> 43 > > >>> 47 > > >>> > > >>> on > > >>> [6f 6e] > > >>> 48 > > >>> 50 > > >>> > > >>> the > > >>> [74 68 65] > > >>> 51 > > >>> 54 > > >>> > > >>> email > > >>> [65 6d 61 69 6c] > > >>> 55 > > >>> 60 > > >>> > > >>> server > > >>> [73 65 72 76 65 72] > > >>> 61 > > >>> 68 > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> *Chatgpt assessment* > > >>> > > >>> Now, let’s check the spans: > > >>> > > >>> - > > >>> > > >>> Inner spanNear(separate, email, 0, true) is *not* going to match > > >>> directly, because email isn’t right after separate. > > >>> - > > >>> > > >>> But Lucene is allowed to *reposition* the spans when used as > children > > >> of > > >>> the outer spanNear. Each child span doesn’t need to be contiguous > > >> unless > > >>> it resolves to a valid match somewhere in the text. > > >>> > > >>> *Conclusion: *This last line may explain why the message above was > > >> returned > > >>> by the query above, but appears to be incorrect. While the > > words/tokens > > >> in > > >>> the query are in the message they do not honor the proximity > specified. > > >>> But apparently children spans do not have to honor the proximity > rules > > >>> specified. AI suggested this query for proximity, I am now > concluding > > it > > >>> is not a valid approach. > > >>> > > >>> I am not seeing a Solr/Lucene http query approach for a proximity > > search > > >>> between phrases, other than possibly to use the Lucene Java API for > > more > > >>> control. > > >>> > > >>> If others have found a workable solution, please let me know. > > >>> > > >>> Thanks, > > >>> Matt > > >>> > > >>> > > >>> > > >>> > > >>> > > >>>> On Thu, Sep 4, 2025 at 3:26 PM mtn search <search...@gmail.com> > > wrote: > > >>> > > >>>> Also, I am using the SolrAdmin Analysis UI to verify how Solr is > > >>>> tokenizing the messages and verifying manually position between > > tokens. > > >>>> > > >>>> Debug view of the query side: > > >>>> For query: > > >>>> "*params*":{ > > >>>> "q":"{!lucene}SpanNearQuery(body,(money question),5,true)", > > >>>> "df":"body", > > >>>> "debug":"true", > > >>>> "indent":"true", > > >>>> "q.op":"OR", > > >>>> "wt":"json"}}, > > >>>> > > >>>> It seems odd that in the parsed query that the "body" field named is > > >>>> pre-appended to the value 5 and the text true. > > >>>> "*debug*":{ > > >>>> "rawquerystring":"{!lucene}SpanNearQuery(body,(money > > >>>> question),5,true)", > > >>>> "querystring":"{!lucene}SpanNearQuery(body,(money > > >> question),5,true)", > > >>>> "parsedquery":"body:spannearquery (body:body (body:money > > >>>> body:question) (body:5 body:true))", > > >>>> "*parsedquery_toString*":*"body:spannearquery *(body:body > > >> (body:money > > >>>> body:question)* (body:5 body:true*))", > > >>>> "explain":{ > > >>>> > > >>>> On Thu, Sep 4, 2025 at 12:04 PM mtn search <search...@gmail.com> > > >> wrote: > > >>>> > > >>>>> Thanks Tim! Yes I have tried a variety of values and am aware > > >>>>> of ordering vs non ordering. I am getting more results than > expected > > >>> and > > >>>>> some that do not match the proximity criteria. So when I set it > to > > a > > >>>>> small value like 2, I was seeking to see the result count drop > > >>>>> significantly as many would not match criteria. Unfortunately, the > > >>> count > > >>>>> does not drop. Looks like a fundamental problem with how I am > using > > >>> the > > >>>>> syntax. Still researching, and open to suggestions. > > >>>>> > > >>>>> Matt > > >>>>> > > >>>>> On Thu, Sep 4, 2025 at 11:54 AM Tim Casey <tca...@gmail.com> > wrote: > > >>>>> > > >>>>>> usually the span and proximities are off-by-one issues. > > Specifically > > >>> the > > >>>>>> order of the tokens will change the distance calculation. I do > not > > >>> have > > >>>>>> an > > >>>>>> example off the top of my head. But, when I was doing this, I > > >> usually > > >>>>>> started with a larger span and brought it down through looking at > > >>>>>> results. > > >>>>>> > > >>>>>> This is the case for the old 5~"phrase words" syntax. > > >>>>>> > > >>>>>> As an aside, "Not working" is taken by me to mean you are not > > getting > > >>>>>> results but the query passes parse. Not working could mean a lot > > >> more > > >>> in > > >>>>>> this context. So I am suggesting, instead of 2, try 10. > > >>>>>> > > >>>>>> On Thu, Sep 4, 2025 at 10:43 AM mtn search <search...@gmail.com> > > >>> wrote: > > >>>>>> > > >>>>>>> Hello, > > >>>>>>> > > >>>>>>> Looking for guidance on approaches to implement a proximity > search > > >>>>>> between > > >>>>>>> phrases. > > >>>>>>> > > >>>>>>> Initially tried: > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>> > > >> > > > "q":"{!lucene}spanNear(spanNear(spanNear(spanTerm(body:off),spanTerm(body:the),0,true), > > >>>>>>> spanTerm(body: record),0,true), > > >>>>>> spanNear(spanTerm(body:new),spanTerm(body: > > >>>>>>> information),0,true) , 2N,false)", > > >>>>>>> "defType":"lucene", > > >>>>>>> "df":"body", > > >>>>>>> > > >>>>>>> However then simplified to just two terms: > > >>>>>>> > > >>> > "q":"{!lucene}spanNear(spanTerm(body:off),spanTerm(body:call),2,true)", > > >>>>>>> "defType":"lucene", > > >>>>>>> "df":"body", > > >>>>>>> > > >>>>>>> Both are not working. Any tips? Currently on Solr 9.4, but will > > >>>>>> likely > > >>>>>>> need to run for some time on a Solr 6 instance. > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> Matt > > >>>>>>> > > >>>>>> > > >>>>> > > >>> > > >> > > >> > > >> -- > > >> Sincerely yours > > >> Mikhail Khludnev > > >> > > > > > > > > > -- > > > http://www.needhamsoftware.com (work) > > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > >