Three things off the top of my head, in order of how long it’d take to 
implement:

***
If it’s _always_ some distance from the start or end, index special beginning 
and end tags. perhaps a nonsense string like BEGINslkdjfhsldkfhsdkfh  and 
ENDslakshalskdfhj. Now your searches become phrase queries with slop. Searching 
for “erick in the first 100 words” becomes:

"BEGINslkdjfhsldkfhsdkfh erick”~100

***
Index each term with a payload indicating its position and use a payload 
function to determine whether the term should count as a hit. You’d probably 
have to have a field telling you how long is field is to know what offset “50 
words from the end” is.

***
Get into the low level Lucene code. After all if you index the position 
information to support phrase queries, you have exactly the position of the 
word. NOTE: you’d also probably have to index a separate field with the total 
length of the field in it so you know what position “100 words from the end” 
is. I suspect you could make this the most efficient, but I wouldn’t go here 
unless your performance is poor as it’d take some development work.

Note: I haven’t thought these out very carefully so caveat emptor.

Here’s a place to get started with payloads if you decide to go that route:

https://lucidworks.com/post/solr-payloads/

Best,
Erick


> On Oct 16, 2019, at 5:47 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
> 
> So are these really text locations or rather actually sections of the
> document. If later, can you parse out sections during indexing?
> 
> Regards,
>     Alex
> 
> On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, <adi.kamin...@verint.com>
> wrote:
> 
>> Hi,
>> Thanks for the responses.
>> 
>> It's a soft boundary which is resulted by dynamic syntax from our
>> application. So may vary from different user searches, one user can search
>> some "word1" in starting 30 words, and another can search "word2" in
>> starting 10 words. The use case is to match some terms/phrase in specific
>> document places in order to identify scripts/specific word ocuurences.
>> 
>> So I guess copy field won't work here.
>> 
>> Any other suggestions/thoughts ?
>> Maybe some hidden position filters in native level to limit from start/end
>> of the document ?
>> 
>> Thanks,
>> Adi
>> 
>> -----Original Message-----
>> From: Tim Casey <tca...@gmail.com>
>> Sent: Tuesday, October 15, 2019 11:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Position search
>> 
>> If this is about a normalized query, I would put the normalization text
>> into a specific field.  The reason for this is you may want to search the
>> overall text during any form of expansion phase of searching for data.
>> That is, maybe you want to know the context of up to the 120th word.  At
>> least you have both.
>> Also, you may want to note which normalized fields were truncated or were
>> simply too small. This would give some guidance as to the bias of the
>> normalization.  If 95% of the fields were not truncated, there is a chance
>> you are not doing good at normalizing because you have a set of
>> particularly short messages.  So I would expect a small set of side fields
>> remarking this.  This would allow you to carry the measures along with the
>> data.
>> 
>> tim
>> 
>> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch <arafa...@gmail.com
>>> 
>> wrote:
>> 
>>> Is the 100 words a hard boundary or a soft one?
>>> 
>>> If it is a hard one (always 100 words), the easiest is probably copy
>>> field and in the (unstored) copy, trim off whatever you don't want to
>>> search. Possibly using regular expressions. Of course, "what's a word"
>>> is an important question here.
>>> 
>>> Similarly, you could do that with Update Request Processors and
>>> clone/process field even before it hits the schema. Then you could
>>> store the extract for highlighting purposes.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi <adi.kamin...@verint.com>
>>> wrote:
>>>> 
>>>> Hi,
>>>> What's the recommended way to search in Solr (assuming 8.2 is used)
>>>> for
>>> specific terms/phrases/expressions while limiting the search from
>>> position perspective.
>>>> For example to search only in the first/last 100 words of the document
>> ?
>>>> 
>>>> Is there any built-in functionality for that ?
>>>> 
>>>> Thanks in advance,
>>>> Adi
>>>> 
>>>> 
>>>> This electronic message may contain proprietary and confidential
>>> information of Verint Systems Inc., its affiliates and/or
>>> subsidiaries. The information is intended to be for the use of the
>>> individual(s) or
>>> entity(ies) named above. If you are not the intended recipient (or
>>> authorized to receive this e-mail for the intended recipient), you may
>>> not use, copy, disclose or distribute to anyone this message or any
>>> information contained in this message. If you have received this
>>> electronic message in error, please notify us by replying to this e-mail.
>>> 
>> 
>> 
>> This electronic message may contain proprietary and confidential
>> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
>> information is intended to be for the use of the individual(s) or
>> entity(ies) named above. If you are not the intended recipient (or
>> authorized to receive this e-mail for the intended recipient), you may not
>> use, copy, disclose or distribute to anyone this message or any information
>> contained in this message. If you have received this electronic message in
>> error, please notify us by replying to this e-mail.
>> 

Reply via email to