Re: Heap Size Space and Span Queries

Sjoerd Smeets Mon, 19 Dec 2022 06:51:44 -0800

Thanks Uwe! There is no requirement yet for to have support for a FIRST
operator, bu I get your point. I'll use this as feedback and see what we
will do.


On Mon, Dec 19, 2022 at 8:43 AM Uwe Schindler <[email protected]> wrote:

> Hi,
>
> the approach I am using fo patent search (the syntax you have posted looks
> ike patent search) is to have a general transformation to Intervals, except
> when operands left/right of NEAR operator are just plain terms without any
> special structure or subqueries. In that case I transform it to a phrase
> with slop=5 (for NEAR5). All other cases get IntervalQuery.
>
> The well known FIRST/FIRST5 operator (first 5 words of document)in patent
> search is also doable by a IntervalQuery instead of SpanFirst legacy class,
> but theres an implementation missing for intervals, so you have to
> implement it on your own (but that's easy). I can help with a simple
> implementation for it.
>
> Uwe
> Am 19.12.2022 um 15:35 schrieb Sjoerd Smeets:
>
> Thanks everybody. I indeed have the memory dumps of these. I'm happy to
> share that with you. These are pretty big files (3g compressed - 32g
> uncompressed).
>
> I built a querparser that basically supports a syntax for distance
> searches between stemmed and unstemmed, and unordered and ordered. E.g.
> (term1 NEAR5 term5) OR ("term7" NEAR20 "term15") OR ("term3" ONEAR20
> term6). Where NEAR stands for an unordered SpanNear and a ONEAR and
> unordered SpanNear.
>
> I've implemented a generic approach where all these get converted to a
> SpanQuery as you can see it can be a mix of everything in 1 query. So your
> suggestion is to replace these with PhraseQueries and IntervalQueries and
> combine these?
>
> Thanks again for your help,
> Sjoerd
>
> On Fri, Dec 16, 2022 at 2:51 PM Mikhail Khludnev <[email protected]> wrote:
>
>> Forwarding the note to users. Thanks Uwe for sharing your observations.
>> Thanks to Mr Woodward who brought intervals to the party.
>>
>>
>> On Fri, Dec 16, 2022 at 7:33 PM Uwe Schindler <[email protected]> wrote:
>>
>>> Spans seem to have the problem of creating huge "List<Something>" during
>>> query iteration to track some stuff. I never understood the code, but to me
>>> it was always crazy to have Lists populated during execution. We replaced
>>> all SpanQueries by Intervals in patent search and speed is much faster and
>>> heap usage is tiny.
>>>
>>> A span/phrase with inOrder=false can always replaced by a phrase with
>>> slop. The slop is always without order, as it is an "edit distance" only
>>> (see documentation). If you need in order, an interval is required.
>>>
>>> Phrases are only in order for "slop=0". Compare to "slop=1" which means
>>> "next to each other" and is no longer in order.
>>>
>>> Uwe
>>> Am 15.12.2022 um 16:44 schrieb Mikhail Khludnev:
>>>
>>> Michael, thanks for stepping in!
>>>
>>> >   it seems that simple phrase
>>> queries would suffice here in place of spanNear?
>>>
>>> I think it wouldn't. It seems to me 4 is slop, and false is inOrder.
>>> Sjoerd, can you comment about particualt span queries you uses?
>>> Also, do you have any heap dump summary to confirm high memory
>>> consumption by spans?
>>>
>>> On Thu, Dec 15, 2022 at 5:33 PM Michael Gibney <
>>> [email protected]> wrote:
>>>
>>>> I don't think that nested boolean disjunctions consisting of isolated
>>>> spanNear queries at the leaves should have memory issues (as opposed
>>>> to nested spanNear queries around disjunctions, which might well do).
>>>> Am I misreading the string representation of that query? A little bit
>>>> more explicit information about how the query is built, so that we can
>>>> be certain of what we're dealing with, would be helpful.
>>>>
>>>> It'd certainly be worth trying IntervalsQuery -- but part of what
>>>> makes me think I must be missing something in interpreting the string
>>>> representation of the query provided: it seems that simple phrase
>>>> queries would suffice here in place of spanNear?
>>>>
>>>> Regarding SpanQuery vs. IntervalsQuery performance and
>>>> characteristics, there's some possibly-relevant discussion on
>>>> LUCENE-9204:
>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/LUCENE-9204?focusedCommentId=17352589#comment-17352589
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Wed, Dec 14, 2022 at 1:27 PM Mikhail Khludnev <[email protected]>
>>>> wrote:
>>>> >
>>>> > Developers,
>>>> > Is it expected for Spans? Can IntervalsQuery help here?
>>>> >
>>>> > On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> I've implemented a Span Query parser and when running the below
>>>> query, I'm
>>>> >> seeing Heap Size Space messages on certain shards:
>>>> >>
>>>> >> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException:
>>>> >> java.lang.OutOfMemoryError: Java heap space
>>>> >>
>>>> >> The span query that I'm running is the following:
>>>> >>
>>>> >> ((spanNear([unstemmed_text:charge, unstemmed_text:account], 4, false)
>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:account], 4, false))
>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:deposit], 4, false))
>>>> >> spanNear([unstemmed_text:charge, unstemmed_text:deposit], 4, false)
>>>> >>
>>>> >> The heap size at the moment is set to 48Gb. We are running 4 shards
>>>> in 1
>>>> >> JVM and the 4 shards combined have 24M docs evenly distributed
>>>> across the
>>>> >> shards. We do use the collapse feature as well.
>>>> >>
>>>> >> This is on Solr 8.6.0
>>>> >>
>>>> >> What are the considerations for running Span Queries and heap sizes?
>>>> >>
>>>> >> Any suggestions are welcome
>>>> >>
>>>> >> Sjoerd
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Sincerely yours
>>>> > Mikhail Khludnev
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>>> --
>>> Uwe Schindler
>>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
>>> eMail: [email protected]
>>>
>>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: [email protected]
>
>

Re: Heap Size Space and Span Queries

Reply via email to