I'd certainly be happy to take a closer look at this, esp. if you're
able to specify more explicitly how the query is built. So far you've
shared string representations that could be pseudocode, or output of
query toString(), etc. I'm confess I'm a bit surprised by what you're
seeing, because as far as I understand, whatever legitimate criticism
of SpanQueries there might be, I would not expect the behavior you've
described.
fwiw I don't see or recall anywhere that "List<Something>" are
created/modified in a hot path in current main branch (it's been a
really long time since I looked at the code, but at one point I
understood it!).
> replaced all SpanQueries by Intervals in patent search and speed is much
> faster and heap usage is tiny
I'm curious whether this was with or without rewriting ("pulling up")
internal disjunctions
(https://issues.apache.org/jira/browse/LUCENE-8477)?
On Mon, Dec 19, 2022 at 9:53 AM Sjoerd Smeets <[email protected]> wrote:
>
> Btw, is it worthwhile creating a ticket for SpanQueries going mental with
> the heap in certain cases?
>
> On Mon, Dec 19, 2022 at 8:51 AM Sjoerd Smeets <[email protected]> wrote:
>
> > Thanks Uwe! There is no requirement yet for to have support for a FIRST
> > operator, bu I get your point. I'll use this as feedback and see what we
> > will do.
> >
> > On Mon, Dec 19, 2022 at 8:43 AM Uwe Schindler <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> the approach I am using fo patent search (the syntax you have posted
> >> looks ike patent search) is to have a general transformation to Intervals,
> >> except when operands left/right of NEAR operator are just plain terms
> >> without any special structure or subqueries. In that case I transform it to
> >> a phrase with slop=5 (for NEAR5). All other cases get IntervalQuery.
> >>
> >> The well known FIRST/FIRST5 operator (first 5 words of document)in patent
> >> search is also doable by a IntervalQuery instead of SpanFirst legacy class,
> >> but theres an implementation missing for intervals, so you have to
> >> implement it on your own (but that's easy). I can help with a simple
> >> implementation for it.
> >>
> >> Uwe
> >> Am 19.12.2022 um 15:35 schrieb Sjoerd Smeets:
> >>
> >> Thanks everybody. I indeed have the memory dumps of these. I'm happy to
> >> share that with you. These are pretty big files (3g compressed - 32g
> >> uncompressed).
> >>
> >> I built a querparser that basically supports a syntax for distance
> >> searches between stemmed and unstemmed, and unordered and ordered. E.g.
> >> (term1 NEAR5 term5) OR ("term7" NEAR20 "term15") OR ("term3" ONEAR20
> >> term6). Where NEAR stands for an unordered SpanNear and a ONEAR and
> >> unordered SpanNear.
> >>
> >> I've implemented a generic approach where all these get converted to a
> >> SpanQuery as you can see it can be a mix of everything in 1 query. So your
> >> suggestion is to replace these with PhraseQueries and IntervalQueries and
> >> combine these?
> >>
> >> Thanks again for your help,
> >> Sjoerd
> >>
> >> On Fri, Dec 16, 2022 at 2:51 PM Mikhail Khludnev <[email protected]> wrote:
> >>
> >>> Forwarding the note to users. Thanks Uwe for sharing your observations.
> >>> Thanks to Mr Woodward who brought intervals to the party.
> >>>
> >>>
> >>> On Fri, Dec 16, 2022 at 7:33 PM Uwe Schindler <[email protected]> wrote:
> >>>
> >>>> Spans seem to have the problem of creating huge "List<Something>"
> >>>> during query iteration to track some stuff. I never understood the code,
> >>>> but to me it was always crazy to have Lists populated during execution.
> >>>> We
> >>>> replaced all SpanQueries by Intervals in patent search and speed is much
> >>>> faster and heap usage is tiny.
> >>>>
> >>>> A span/phrase with inOrder=false can always replaced by a phrase with
> >>>> slop. The slop is always without order, as it is an "edit distance" only
> >>>> (see documentation). If you need in order, an interval is required.
> >>>>
> >>>> Phrases are only in order for "slop=0". Compare to "slop=1" which means
> >>>> "next to each other" and is no longer in order.
> >>>>
> >>>> Uwe
> >>>> Am 15.12.2022 um 16:44 schrieb Mikhail Khludnev:
> >>>>
> >>>> Michael, thanks for stepping in!
> >>>>
> >>>> > it seems that simple phrase
> >>>> queries would suffice here in place of spanNear?
> >>>>
> >>>> I think it wouldn't. It seems to me 4 is slop, and false is inOrder.
> >>>> Sjoerd, can you comment about particualt span queries you uses?
> >>>> Also, do you have any heap dump summary to confirm high memory
> >>>> consumption by spans?
> >>>>
> >>>> On Thu, Dec 15, 2022 at 5:33 PM Michael Gibney <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> I don't think that nested boolean disjunctions consisting of isolated
> >>>>> spanNear queries at the leaves should have memory issues (as opposed
> >>>>> to nested spanNear queries around disjunctions, which might well do).
> >>>>> Am I misreading the string representation of that query? A little bit
> >>>>> more explicit information about how the query is built, so that we can
> >>>>> be certain of what we're dealing with, would be helpful.
> >>>>>
> >>>>> It'd certainly be worth trying IntervalsQuery -- but part of what
> >>>>> makes me think I must be missing something in interpreting the string
> >>>>> representation of the query provided: it seems that simple phrase
> >>>>> queries would suffice here in place of spanNear?
> >>>>>
> >>>>> Regarding SpanQuery vs. IntervalsQuery performance and
> >>>>> characteristics, there's some possibly-relevant discussion on
> >>>>> LUCENE-9204:
> >>>>>
> >>>>>
> >>>>> https://issues.apache.org/jira/browse/LUCENE-9204?focusedCommentId=17352589#comment-17352589
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>>
> >>>>> On Wed, Dec 14, 2022 at 1:27 PM Mikhail Khludnev <[email protected]>
> >>>>> wrote:
> >>>>> >
> >>>>> > Developers,
> >>>>> > Is it expected for Spans? Can IntervalsQuery help here?
> >>>>> >
> >>>>> > On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets <[email protected]>
> >>>>> wrote:
> >>>>> >>
> >>>>> >> Hi,
> >>>>> >>
> >>>>> >> I've implemented a Span Query parser and when running the below
> >>>>> query, I'm
> >>>>> >> seeing Heap Size Space messages on certain shards:
> >>>>> >>
> >>>>> >> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException:
> >>>>> >> java.lang.OutOfMemoryError: Java heap space
> >>>>> >>
> >>>>> >> The span query that I'm running is the following:
> >>>>> >>
> >>>>> >> ((spanNear([unstemmed_text:charge, unstemmed_text:account], 4,
> >>>>> false)
> >>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:account], 4, false))
> >>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:deposit], 4, false))
> >>>>> >> spanNear([unstemmed_text:charge, unstemmed_text:deposit], 4, false)
> >>>>> >>
> >>>>> >> The heap size at the moment is set to 48Gb. We are running 4 shards
> >>>>> in 1
> >>>>> >> JVM and the 4 shards combined have 24M docs evenly distributed
> >>>>> across the
> >>>>> >> shards. We do use the collapse feature as well.
> >>>>> >>
> >>>>> >> This is on Solr 8.6.0
> >>>>> >>
> >>>>> >> What are the considerations for running Span Queries and heap sizes?
> >>>>> >>
> >>>>> >> Any suggestions are welcome
> >>>>> >>
> >>>>> >> Sjoerd
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Sincerely yours
> >>>>> > Mikhail Khludnev
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Sincerely yours
> >>>> Mikhail Khludnev
> >>>>
> >>>> --
> >>>> Uwe Schindler
> >>>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> >>>> eMail: [email protected]
> >>>>
> >>>>
> >>>
> >>> --
> >>> Sincerely yours
> >>> Mikhail Khludnev
> >>>
> >> --
> >> Uwe Schindler
> >> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> >> eMail: [email protected]
> >>
> >>