Thanks Uwe! There is no requirement yet for to have support for a FIRST operator, bu I get your point. I'll use this as feedback and see what we will do.
On Mon, Dec 19, 2022 at 8:43 AM Uwe Schindler <[email protected]> wrote: > Hi, > > the approach I am using fo patent search (the syntax you have posted looks > ike patent search) is to have a general transformation to Intervals, except > when operands left/right of NEAR operator are just plain terms without any > special structure or subqueries. In that case I transform it to a phrase > with slop=5 (for NEAR5). All other cases get IntervalQuery. > > The well known FIRST/FIRST5 operator (first 5 words of document)in patent > search is also doable by a IntervalQuery instead of SpanFirst legacy class, > but theres an implementation missing for intervals, so you have to > implement it on your own (but that's easy). I can help with a simple > implementation for it. > > Uwe > Am 19.12.2022 um 15:35 schrieb Sjoerd Smeets: > > Thanks everybody. I indeed have the memory dumps of these. I'm happy to > share that with you. These are pretty big files (3g compressed - 32g > uncompressed). > > I built a querparser that basically supports a syntax for distance > searches between stemmed and unstemmed, and unordered and ordered. E.g. > (term1 NEAR5 term5) OR ("term7" NEAR20 "term15") OR ("term3" ONEAR20 > term6). Where NEAR stands for an unordered SpanNear and a ONEAR and > unordered SpanNear. > > I've implemented a generic approach where all these get converted to a > SpanQuery as you can see it can be a mix of everything in 1 query. So your > suggestion is to replace these with PhraseQueries and IntervalQueries and > combine these? > > Thanks again for your help, > Sjoerd > > On Fri, Dec 16, 2022 at 2:51 PM Mikhail Khludnev <[email protected]> wrote: > >> Forwarding the note to users. Thanks Uwe for sharing your observations. >> Thanks to Mr Woodward who brought intervals to the party. >> >> >> On Fri, Dec 16, 2022 at 7:33 PM Uwe Schindler <[email protected]> wrote: >> >>> Spans seem to have the problem of creating huge "List<Something>" during >>> query iteration to track some stuff. I never understood the code, but to me >>> it was always crazy to have Lists populated during execution. We replaced >>> all SpanQueries by Intervals in patent search and speed is much faster and >>> heap usage is tiny. >>> >>> A span/phrase with inOrder=false can always replaced by a phrase with >>> slop. The slop is always without order, as it is an "edit distance" only >>> (see documentation). If you need in order, an interval is required. >>> >>> Phrases are only in order for "slop=0". Compare to "slop=1" which means >>> "next to each other" and is no longer in order. >>> >>> Uwe >>> Am 15.12.2022 um 16:44 schrieb Mikhail Khludnev: >>> >>> Michael, thanks for stepping in! >>> >>> > it seems that simple phrase >>> queries would suffice here in place of spanNear? >>> >>> I think it wouldn't. It seems to me 4 is slop, and false is inOrder. >>> Sjoerd, can you comment about particualt span queries you uses? >>> Also, do you have any heap dump summary to confirm high memory >>> consumption by spans? >>> >>> On Thu, Dec 15, 2022 at 5:33 PM Michael Gibney < >>> [email protected]> wrote: >>> >>>> I don't think that nested boolean disjunctions consisting of isolated >>>> spanNear queries at the leaves should have memory issues (as opposed >>>> to nested spanNear queries around disjunctions, which might well do). >>>> Am I misreading the string representation of that query? A little bit >>>> more explicit information about how the query is built, so that we can >>>> be certain of what we're dealing with, would be helpful. >>>> >>>> It'd certainly be worth trying IntervalsQuery -- but part of what >>>> makes me think I must be missing something in interpreting the string >>>> representation of the query provided: it seems that simple phrase >>>> queries would suffice here in place of spanNear? >>>> >>>> Regarding SpanQuery vs. IntervalsQuery performance and >>>> characteristics, there's some possibly-relevant discussion on >>>> LUCENE-9204: >>>> >>>> >>>> https://issues.apache.org/jira/browse/LUCENE-9204?focusedCommentId=17352589#comment-17352589 >>>> >>>> Michael >>>> >>>> >>>> On Wed, Dec 14, 2022 at 1:27 PM Mikhail Khludnev <[email protected]> >>>> wrote: >>>> > >>>> > Developers, >>>> > Is it expected for Spans? Can IntervalsQuery help here? >>>> > >>>> > On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets <[email protected]> >>>> wrote: >>>> >> >>>> >> Hi, >>>> >> >>>> >> I've implemented a Span Query parser and when running the below >>>> query, I'm >>>> >> seeing Heap Size Space messages on certain shards: >>>> >> >>>> >> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException: >>>> >> java.lang.OutOfMemoryError: Java heap space >>>> >> >>>> >> The span query that I'm running is the following: >>>> >> >>>> >> ((spanNear([unstemmed_text:charge, unstemmed_text:account], 4, false) >>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:account], 4, false)) >>>> >> spanNear([unstemmed_text:pledge, unstemmed_text:deposit], 4, false)) >>>> >> spanNear([unstemmed_text:charge, unstemmed_text:deposit], 4, false) >>>> >> >>>> >> The heap size at the moment is set to 48Gb. We are running 4 shards >>>> in 1 >>>> >> JVM and the 4 shards combined have 24M docs evenly distributed >>>> across the >>>> >> shards. We do use the collapse feature as well. >>>> >> >>>> >> This is on Solr 8.6.0 >>>> >> >>>> >> What are the considerations for running Span Queries and heap sizes? >>>> >> >>>> >> Any suggestions are welcome >>>> >> >>>> >> Sjoerd >>>> > >>>> > >>>> > >>>> > -- >>>> > Sincerely yours >>>> > Mikhail Khludnev >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> >>> -- >>> Uwe Schindler >>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de >>> eMail: [email protected] >>> >>> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de > eMail: [email protected] > >
