Btw, is it worthwhile creating a ticket for SpanQueries going mental with the heap in certain cases?
On Mon, Dec 19, 2022 at 8:51 AM Sjoerd Smeets <[email protected]> wrote: > Thanks Uwe! There is no requirement yet for to have support for a FIRST > operator, bu I get your point. I'll use this as feedback and see what we > will do. > > On Mon, Dec 19, 2022 at 8:43 AM Uwe Schindler <[email protected]> wrote: > >> Hi, >> >> the approach I am using fo patent search (the syntax you have posted >> looks ike patent search) is to have a general transformation to Intervals, >> except when operands left/right of NEAR operator are just plain terms >> without any special structure or subqueries. In that case I transform it to >> a phrase with slop=5 (for NEAR5). All other cases get IntervalQuery. >> >> The well known FIRST/FIRST5 operator (first 5 words of document)in patent >> search is also doable by a IntervalQuery instead of SpanFirst legacy class, >> but theres an implementation missing for intervals, so you have to >> implement it on your own (but that's easy). I can help with a simple >> implementation for it. >> >> Uwe >> Am 19.12.2022 um 15:35 schrieb Sjoerd Smeets: >> >> Thanks everybody. I indeed have the memory dumps of these. I'm happy to >> share that with you. These are pretty big files (3g compressed - 32g >> uncompressed). >> >> I built a querparser that basically supports a syntax for distance >> searches between stemmed and unstemmed, and unordered and ordered. E.g. >> (term1 NEAR5 term5) OR ("term7" NEAR20 "term15") OR ("term3" ONEAR20 >> term6). Where NEAR stands for an unordered SpanNear and a ONEAR and >> unordered SpanNear. >> >> I've implemented a generic approach where all these get converted to a >> SpanQuery as you can see it can be a mix of everything in 1 query. So your >> suggestion is to replace these with PhraseQueries and IntervalQueries and >> combine these? >> >> Thanks again for your help, >> Sjoerd >> >> On Fri, Dec 16, 2022 at 2:51 PM Mikhail Khludnev <[email protected]> wrote: >> >>> Forwarding the note to users. Thanks Uwe for sharing your observations. >>> Thanks to Mr Woodward who brought intervals to the party. >>> >>> >>> On Fri, Dec 16, 2022 at 7:33 PM Uwe Schindler <[email protected]> wrote: >>> >>>> Spans seem to have the problem of creating huge "List<Something>" >>>> during query iteration to track some stuff. I never understood the code, >>>> but to me it was always crazy to have Lists populated during execution. We >>>> replaced all SpanQueries by Intervals in patent search and speed is much >>>> faster and heap usage is tiny. >>>> >>>> A span/phrase with inOrder=false can always replaced by a phrase with >>>> slop. The slop is always without order, as it is an "edit distance" only >>>> (see documentation). If you need in order, an interval is required. >>>> >>>> Phrases are only in order for "slop=0". Compare to "slop=1" which means >>>> "next to each other" and is no longer in order. >>>> >>>> Uwe >>>> Am 15.12.2022 um 16:44 schrieb Mikhail Khludnev: >>>> >>>> Michael, thanks for stepping in! >>>> >>>> > it seems that simple phrase >>>> queries would suffice here in place of spanNear? >>>> >>>> I think it wouldn't. It seems to me 4 is slop, and false is inOrder. >>>> Sjoerd, can you comment about particualt span queries you uses? >>>> Also, do you have any heap dump summary to confirm high memory >>>> consumption by spans? >>>> >>>> On Thu, Dec 15, 2022 at 5:33 PM Michael Gibney < >>>> [email protected]> wrote: >>>> >>>>> I don't think that nested boolean disjunctions consisting of isolated >>>>> spanNear queries at the leaves should have memory issues (as opposed >>>>> to nested spanNear queries around disjunctions, which might well do). >>>>> Am I misreading the string representation of that query? A little bit >>>>> more explicit information about how the query is built, so that we can >>>>> be certain of what we're dealing with, would be helpful. >>>>> >>>>> It'd certainly be worth trying IntervalsQuery -- but part of what >>>>> makes me think I must be missing something in interpreting the string >>>>> representation of the query provided: it seems that simple phrase >>>>> queries would suffice here in place of spanNear? >>>>> >>>>> Regarding SpanQuery vs. IntervalsQuery performance and >>>>> characteristics, there's some possibly-relevant discussion on >>>>> LUCENE-9204: >>>>> >>>>> >>>>> https://issues.apache.org/jira/browse/LUCENE-9204?focusedCommentId=17352589#comment-17352589 >>>>> >>>>> Michael >>>>> >>>>> >>>>> On Wed, Dec 14, 2022 at 1:27 PM Mikhail Khludnev <[email protected]> >>>>> wrote: >>>>> > >>>>> > Developers, >>>>> > Is it expected for Spans? Can IntervalsQuery help here? >>>>> > >>>>> > On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> Hi, >>>>> >> >>>>> >> I've implemented a Span Query parser and when running the below >>>>> query, I'm >>>>> >> seeing Heap Size Space messages on certain shards: >>>>> >> >>>>> >> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException: >>>>> >> java.lang.OutOfMemoryError: Java heap space >>>>> >> >>>>> >> The span query that I'm running is the following: >>>>> >> >>>>> >> ((spanNear([unstemmed_text:charge, unstemmed_text:account], 4, >>>>> false) >>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:account], 4, false)) >>>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:deposit], 4, false)) >>>>> >> spanNear([unstemmed_text:charge, unstemmed_text:deposit], 4, false) >>>>> >> >>>>> >> The heap size at the moment is set to 48Gb. We are running 4 shards >>>>> in 1 >>>>> >> JVM and the 4 shards combined have 24M docs evenly distributed >>>>> across the >>>>> >> shards. We do use the collapse feature as well. >>>>> >> >>>>> >> This is on Solr 8.6.0 >>>>> >> >>>>> >> What are the considerations for running Span Queries and heap sizes? >>>>> >> >>>>> >> Any suggestions are welcome >>>>> >> >>>>> >> Sjoerd >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Sincerely yours >>>>> > Mikhail Khludnev >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> -- >>>> Sincerely yours >>>> Mikhail Khludnev >>>> >>>> -- >>>> Uwe Schindler >>>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de >>>> eMail: [email protected] >>>> >>>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> >> -- >> Uwe Schindler >> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de >> eMail: [email protected] >> >>
