Three things off the top of my head, in order of how long it’d take to implement:
*** If it’s _always_ some distance from the start or end, index special beginning and end tags. perhaps a nonsense string like BEGINslkdjfhsldkfhsdkfh and ENDslakshalskdfhj. Now your searches become phrase queries with slop. Searching for “erick in the first 100 words” becomes: "BEGINslkdjfhsldkfhsdkfh erick”~100 *** Index each term with a payload indicating its position and use a payload function to determine whether the term should count as a hit. You’d probably have to have a field telling you how long is field is to know what offset “50 words from the end” is. *** Get into the low level Lucene code. After all if you index the position information to support phrase queries, you have exactly the position of the word. NOTE: you’d also probably have to index a separate field with the total length of the field in it so you know what position “100 words from the end” is. I suspect you could make this the most efficient, but I wouldn’t go here unless your performance is poor as it’d take some development work. Note: I haven’t thought these out very carefully so caveat emptor. Here’s a place to get started with payloads if you decide to go that route: https://lucidworks.com/post/solr-payloads/ Best, Erick > On Oct 16, 2019, at 5:47 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > > So are these really text locations or rather actually sections of the > document. If later, can you parse out sections during indexing? > > Regards, > Alex > > On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, <adi.kamin...@verint.com> > wrote: > >> Hi, >> Thanks for the responses. >> >> It's a soft boundary which is resulted by dynamic syntax from our >> application. So may vary from different user searches, one user can search >> some "word1" in starting 30 words, and another can search "word2" in >> starting 10 words. The use case is to match some terms/phrase in specific >> document places in order to identify scripts/specific word ocuurences. >> >> So I guess copy field won't work here. >> >> Any other suggestions/thoughts ? >> Maybe some hidden position filters in native level to limit from start/end >> of the document ? >> >> Thanks, >> Adi >> >> -----Original Message----- >> From: Tim Casey <tca...@gmail.com> >> Sent: Tuesday, October 15, 2019 11:05 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Position search >> >> If this is about a normalized query, I would put the normalization text >> into a specific field. The reason for this is you may want to search the >> overall text during any form of expansion phase of searching for data. >> That is, maybe you want to know the context of up to the 120th word. At >> least you have both. >> Also, you may want to note which normalized fields were truncated or were >> simply too small. This would give some guidance as to the bias of the >> normalization. If 95% of the fields were not truncated, there is a chance >> you are not doing good at normalizing because you have a set of >> particularly short messages. So I would expect a small set of side fields >> remarking this. This would allow you to carry the measures along with the >> data. >> >> tim >> >> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch <arafa...@gmail.com >>> >> wrote: >> >>> Is the 100 words a hard boundary or a soft one? >>> >>> If it is a hard one (always 100 words), the easiest is probably copy >>> field and in the (unstored) copy, trim off whatever you don't want to >>> search. Possibly using regular expressions. Of course, "what's a word" >>> is an important question here. >>> >>> Similarly, you could do that with Update Request Processors and >>> clone/process field even before it hits the schema. Then you could >>> store the extract for highlighting purposes. >>> >>> Regards, >>> Alex. >>> >>> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi <adi.kamin...@verint.com> >>> wrote: >>>> >>>> Hi, >>>> What's the recommended way to search in Solr (assuming 8.2 is used) >>>> for >>> specific terms/phrases/expressions while limiting the search from >>> position perspective. >>>> For example to search only in the first/last 100 words of the document >> ? >>>> >>>> Is there any built-in functionality for that ? >>>> >>>> Thanks in advance, >>>> Adi >>>> >>>> >>>> This electronic message may contain proprietary and confidential >>> information of Verint Systems Inc., its affiliates and/or >>> subsidiaries. The information is intended to be for the use of the >>> individual(s) or >>> entity(ies) named above. If you are not the intended recipient (or >>> authorized to receive this e-mail for the intended recipient), you may >>> not use, copy, disclose or distribute to anyone this message or any >>> information contained in this message. If you have received this >>> electronic message in error, please notify us by replying to this e-mail. >>> >> >> >> This electronic message may contain proprietary and confidential >> information of Verint Systems Inc., its affiliates and/or subsidiaries. The >> information is intended to be for the use of the individual(s) or >> entity(ies) named above. If you are not the intended recipient (or >> authorized to receive this e-mail for the intended recipient), you may not >> use, copy, disclose or distribute to anyone this message or any information >> contained in this message. If you have received this electronic message in >> error, please notify us by replying to this e-mail. >>