Re: Heap Size Space and Span Queries

Sjoerd Smeets Mon, 19 Dec 2022 06:53:11 -0800

Btw, is it worthwhile creating a ticket for SpanQueries going mental with
the heap in certain cases?


On Mon, Dec 19, 2022 at 8:51 AM Sjoerd Smeets <[email protected]> wrote:

> Thanks Uwe! There is no requirement yet for to have support for a FIRST
> operator, bu I get your point. I'll use this as feedback and see what we
> will do.
>
> On Mon, Dec 19, 2022 at 8:43 AM Uwe Schindler <[email protected]> wrote:
>
>> Hi,
>>
>> the approach I am using fo patent search (the syntax you have posted
>> looks ike patent search) is to have a general transformation to Intervals,
>> except when operands left/right of NEAR operator are just plain terms
>> without any special structure or subqueries. In that case I transform it to
>> a phrase with slop=5 (for NEAR5). All other cases get IntervalQuery.
>>
>> The well known FIRST/FIRST5 operator (first 5 words of document)in patent
>> search is also doable by a IntervalQuery instead of SpanFirst legacy class,
>> but theres an implementation missing for intervals, so you have to
>> implement it on your own (but that's easy). I can help with a simple
>> implementation for it.
>>
>> Uwe
>> Am 19.12.2022 um 15:35 schrieb Sjoerd Smeets:
>>
>> Thanks everybody. I indeed have the memory dumps of these. I'm happy to
>> share that with you. These are pretty big files (3g compressed - 32g
>> uncompressed).
>>
>> I built a querparser that basically supports a syntax for distance
>> searches between stemmed and unstemmed, and unordered and ordered. E.g.
>> (term1 NEAR5 term5) OR ("term7" NEAR20 "term15") OR ("term3" ONEAR20
>> term6). Where NEAR stands for an unordered SpanNear and a ONEAR and
>> unordered SpanNear.
>>
>> I've implemented a generic approach where all these get converted to a
>> SpanQuery as you can see it can be a mix of everything in 1 query. So your
>> suggestion is to replace these with PhraseQueries and IntervalQueries and
>> combine these?
>>
>> Thanks again for your help,
>> Sjoerd
>>
>> On Fri, Dec 16, 2022 at 2:51 PM Mikhail Khludnev <[email protected]> wrote:
>>
>>> Forwarding the note to users. Thanks Uwe for sharing your observations.
>>> Thanks to Mr Woodward who brought intervals to the party.
>>>
>>>
>>> On Fri, Dec 16, 2022 at 7:33 PM Uwe Schindler <[email protected]> wrote:
>>>
>>>> Spans seem to have the problem of creating huge "List<Something>"
>>>> during query iteration to track some stuff. I never understood the code,
>>>> but to me it was always crazy to have Lists populated during execution. We
>>>> replaced all SpanQueries by Intervals in patent search and speed is much
>>>> faster and heap usage is tiny.
>>>>
>>>> A span/phrase with inOrder=false can always replaced by a phrase with
>>>> slop. The slop is always without order, as it is an "edit distance" only
>>>> (see documentation). If you need in order, an interval is required.
>>>>
>>>> Phrases are only in order for "slop=0". Compare to "slop=1" which means
>>>> "next to each other" and is no longer in order.
>>>>
>>>> Uwe
>>>> Am 15.12.2022 um 16:44 schrieb Mikhail Khludnev:
>>>>
>>>> Michael, thanks for stepping in!
>>>>
>>>> >   it seems that simple phrase
>>>> queries would suffice here in place of spanNear?
>>>>
>>>> I think it wouldn't. It seems to me 4 is slop, and false is inOrder.
>>>> Sjoerd, can you comment about particualt span queries you uses?
>>>> Also, do you have any heap dump summary to confirm high memory
>>>> consumption by spans?
>>>>
>>>> On Thu, Dec 15, 2022 at 5:33 PM Michael Gibney <
>>>> [email protected]> wrote:
>>>>
>>>>> I don't think that nested boolean disjunctions consisting of isolated
>>>>> spanNear queries at the leaves should have memory issues (as opposed
>>>>> to nested spanNear queries around disjunctions, which might well do).
>>>>> Am I misreading the string representation of that query? A little bit
>>>>> more explicit information about how the query is built, so that we can
>>>>> be certain of what we're dealing with, would be helpful.
>>>>>
>>>>> It'd certainly be worth trying IntervalsQuery -- but part of what
>>>>> makes me think I must be missing something in interpreting the string
>>>>> representation of the query provided: it seems that simple phrase
>>>>> queries would suffice here in place of spanNear?
>>>>>
>>>>> Regarding SpanQuery vs. IntervalsQuery performance and
>>>>> characteristics, there's some possibly-relevant discussion on
>>>>> LUCENE-9204:
>>>>>
>>>>>
>>>>> https://issues.apache.org/jira/browse/LUCENE-9204?focusedCommentId=17352589#comment-17352589
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> On Wed, Dec 14, 2022 at 1:27 PM Mikhail Khludnev <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > Developers,
>>>>> > Is it expected for Spans? Can IntervalsQuery help here?
>>>>> >
>>>>> > On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> Hi,
>>>>> >>
>>>>> >> I've implemented a Span Query parser and when running the below
>>>>> query, I'm
>>>>> >> seeing Heap Size Space messages on certain shards:
>>>>> >>
>>>>> >> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException:
>>>>> >> java.lang.OutOfMemoryError: Java heap space
>>>>> >>
>>>>> >> The span query that I'm running is the following:
>>>>> >>
>>>>> >> ((spanNear([unstemmed_text:charge, unstemmed_text:account], 4,
>>>>> false)
>>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:account], 4, false))
>>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:deposit], 4, false))
>>>>> >> spanNear([unstemmed_text:charge, unstemmed_text:deposit], 4, false)
>>>>> >>
>>>>> >> The heap size at the moment is set to 48Gb. We are running 4 shards
>>>>> in 1
>>>>> >> JVM and the 4 shards combined have 24M docs evenly distributed
>>>>> across the
>>>>> >> shards. We do use the collapse feature as well.
>>>>> >>
>>>>> >> This is on Solr 8.6.0
>>>>> >>
>>>>> >> What are the considerations for running Span Queries and heap sizes?
>>>>> >>
>>>>> >> Any suggestions are welcome
>>>>> >>
>>>>> >> Sjoerd
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Sincerely yours
>>>>> > Mikhail Khludnev
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>>
>>>> --
>>>> Uwe Schindler
>>>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
>>>> eMail: [email protected]
>>>>
>>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>> --
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
>> eMail: [email protected]
>>
>>

Re: Heap Size Space and Span Queries

Reply via email to