Re: Heap Size Space and Span Queries

Uwe Schindler Mon, 19 Dec 2022 06:44:01 -0800

Hi,

the approach I am using fo patent search (the syntax you have postedlooks ike patent search) is to have a general transformation toIntervals, except when operands left/right of NEAR operator are justplain terms without any special structure or subqueries. In that case Itransform it to a phrase with slop=5 (for NEAR5). All other cases getIntervalQuery.

The well known FIRST/FIRST5 operator (first 5 words of document)inpatent search is also doable by a IntervalQuery instead of SpanFirstlegacy class, but theres an implementation missing for intervals, so youhave to implement it on your own (but that's easy). I can help with asimple implementation for it.


Uwe

Am 19.12.2022 um 15:35 schrieb Sjoerd Smeets:

Thanks everybody. I indeed have the memory dumps of these. I'm happyto share that with you. These are pretty big files (3g compressed -32g uncompressed).

I built a querparser that basically supports a syntax for distancesearches between stemmed and unstemmed, and unordered and ordered.E.g. (term1 NEAR5 term5) OR ("term7" NEAR20 "term15") OR ("term3"ONEAR20 term6). Where NEAR stands for an unordered SpanNear and aONEAR and unordered SpanNear.

I've implemented a generic approach where all these get converted to aSpanQuery as you can see it can be a mix of everything in 1 query. Soyour suggestion is to replace these with PhraseQueries andIntervalQueries and combine these?


Thanks again for your help,
Sjoerd

On Fri, Dec 16, 2022 at 2:51 PM Mikhail Khludnev <[email protected]> wrote:

    Forwarding the note to users. Thanks Uwe for sharing your
    observations. Thanks to Mr Woodward who brought intervals to the
    party.


    On Fri, Dec 16, 2022 at 7:33 PM Uwe Schindler <[email protected]> wrote:

        Spans seem to have the problem of creating huge
        "List<Something>" during query iteration to track some stuff.
        I never understood the code, but to me it was always crazy to
        have Lists populated during execution. We replaced all
        SpanQueries by Intervals in patent search and speed is much
        faster and heap usage is tiny.

        A span/phrase with inOrder=false can always replaced by a
        phrase with slop. The slop is always without order, as it is
        an "edit distance" only (see documentation). If you need in
        order, an interval is required.

        Phrases are only in order for "slop=0". Compare to "slop=1"
        which means "next to each other" and is no longer in order.

        Uwe

        Am 15.12.2022 um 16:44 schrieb Mikhail Khludnev:

        Michael, thanks for stepping in!

        >   it seems that simple phrase
        queries would suffice here in place of spanNear?

        I think it wouldn't. It seems to me 4 is slop, and false is
        inOrder.
        Sjoerd, can you comment about particualt span queries you uses?
        Also, do you have any heap dump summary to confirm high
        memory consumption by spans?

        On Thu, Dec 15, 2022 at 5:33 PM Michael Gibney
        <[email protected]> wrote:

            I don't think that nested boolean disjunctions consisting
            of isolated
            spanNear queries at the leaves should have memory issues
            (as opposed
            to nested spanNear queries around disjunctions, which
            might well do).
            Am I misreading the string representation of that query?
            A little bit
            more explicit information about how the query is built,
            so that we can
            be certain of what we're dealing with, would be helpful.

            It'd certainly be worth trying IntervalsQuery -- but part
            of what
            makes me think I must be missing something in
            interpreting the string
            representation of the query provided: it seems that
            simple phrase
            queries would suffice here in place of spanNear?

            Regarding SpanQuery vs. IntervalsQuery performance and
            characteristics, there's some possibly-relevant discussion on
            LUCENE-9204:

            
https://issues.apache.org/jira/browse/LUCENE-9204?focusedCommentId=17352589#comment-17352589

            Michael


            On Wed, Dec 14, 2022 at 1:27 PM Mikhail Khludnev
            <[email protected]> wrote:
            >
            > Developers,
            > Is it expected for Spans? Can IntervalsQuery help here?
            >
            > On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets
            <[email protected]> wrote:
            >>
            >> Hi,
            >>
            >> I've implemented a Span Query parser and when running
            the below query, I'm
            >> seeing Heap Size Space messages on certain shards:
            >>
            >> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException:
            >> java.lang.OutOfMemoryError: Java heap space
            >>
            >> The span query that I'm running is the following:
            >>
            >> ((spanNear([unstemmed_text:charge,
            unstemmed_text:account], 4, false)
            >> spanNear([unstemmed_text:pledge,
            unstemmed_text:account], 4, false))
            >> spanNear([unstemmed_text:pledge,
            unstemmed_text:deposit], 4, false))
            >> spanNear([unstemmed_text:charge,
            unstemmed_text:deposit], 4, false)
            >>
            >> The heap size at the moment is set to 48Gb. We are
            running 4 shards in 1
            >> JVM and the 4 shards combined have 24M docs evenly
            distributed across the
            >> shards. We do use the collapse feature as well.
            >>
            >> This is on Solr 8.6.0
            >>
            >> What are the considerations for running Span Queries
            and heap sizes?
            >>
            >> Any suggestions are welcome
            >>
            >> Sjoerd
            >
            >
            >
            > --
            > Sincerely yours
            > Mikhail Khludnev

            
---------------------------------------------------------------------
            To unsubscribe, e-mail: [email protected]
            For additional commands, e-mail: [email protected]

--Sincerely yours

        Mikhail Khludnev

--Uwe Schindler

        Achterdiek 19, D-28357 Bremen
        https://www.thetaphi.de
        eMail:[email protected]

--Sincerely yours

    Mikhail Khludnev

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:[email protected]

Re: Heap Size Space and Span Queries

Reply via email to