Re: Querying into a Collector visits documents multiple times
Ah sorry never mind. Confused collector and collector manager On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov wrote: > Separate issue, but this collector is not going to work with concurrent > search since the sum is not updated in a thread safe manner. Maybe you > don't care, since you don't use a thread pool to execute your queries, but > you probably should! > > On Wed, Sep 22, 2021, 8:38 AM Adrien Grand wrote: > >> Hi Steven, >> >> This collector looks correct to me. Resetting the counter to 0 on the >> first >> segment is indeed not necessary. >> >> We have plenty of collectors that are very similar to this one and we >> never >> observed any double-counting issue. I would suspect an issue in the code >> that calls this collector. Maybe try to print the stack trace under the ` >> if (context.docBase == 0) {` check to see why your collector is being >> called twice? >> >> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker < >> stevenschlans...@gmail.com> wrote: >> >> > Hi Lucene users, >> > >> > I am developing a search application that needs to do some basic >> > summary statistics. We use Lucene 8.9.0. >> > To improve performance for e.g. summing a value across 10,000 >> > documents, we are using DocValues as columnar storage. >> > >> > In order to retrieve the DocValues without collecting all hits into a >> > TopDocs, which we determined to cause a lot of memory pressure and >> > consume much time, we are using the expert Collector query interface. >> > >> > Here's the code, simplified a bit for the list: >> > >> > final collector = new Collector() { >> > long sum = 0; >> > >> > @Override >> > public ScoreMode scoreMode() { >> > return ScoreMode.COMPLETE_NO_SCORES; >> > } >> > >> > @Override >> > public LeafCollector getLeafCollector(final LeafReaderContext >> > context) throws IOException { >> > if (context.docBase == 0) { >> > sum = 0; // XXX: this should not be necessary? >> > } >> > final var subtotalValue = >> > context.reader().getNumericDocValues("subtotal"); >> > return new LeafCollector() { >> > @Override >> > public void setScorer(final Scorable scorer) throws >> > IOException { >> > } >> > >> > @Override >> > public void collect(final int doc) throws IOException { >> > if (subtotalValue.docID() > doc || >> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) { >> > return; >> > } >> > sum += subtotalValue.longValue(); >> > } >> > }; >> > } >> > } >> > searcher.search(myQuery, collector); >> > return collector.sum; >> > >> > The query is a moderately complicated Boolean query with some >> > TermQuery and MultiTermQuery instances combined together. >> > While first testing, I observed that seemingly the collector is called >> > twice for each document, and the sum is exactly double what you would >> > expect. >> > >> > It seems that the Collector is observing every matched document twice, >> > and by printing out the Scorer, I see that it's done with two >> > different BooleanScorer instances. >> > You can see my hack that resets the collector every time it starts at >> > docBase 0. which I am sure is not the right approach, but seems to >> > work. >> > What is the right pattern to ensure my Collector only observes result >> > documents once, no matter the input query? I see a note in the >> > documentation that state is supposed to be stored on the Scorer >> > implementation, but I am not providing a custom Scorer, nor do I >> > actually want any scoring at all. >> > >> > Thank you for any guidance! >> > Steven >> > >> > - >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> -- >> Adrien >> >
Re: Querying into a Collector visits documents multiple times
Separate issue, but this collector is not going to work with concurrent search since the sum is not updated in a thread safe manner. Maybe you don't care, since you don't use a thread pool to execute your queries, but you probably should! On Wed, Sep 22, 2021, 8:38 AM Adrien Grand wrote: > Hi Steven, > > This collector looks correct to me. Resetting the counter to 0 on the first > segment is indeed not necessary. > > We have plenty of collectors that are very similar to this one and we never > observed any double-counting issue. I would suspect an issue in the code > that calls this collector. Maybe try to print the stack trace under the ` > if (context.docBase == 0) {` check to see why your collector is being > called twice? > > On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker < > stevenschlans...@gmail.com> wrote: > > > Hi Lucene users, > > > > I am developing a search application that needs to do some basic > > summary statistics. We use Lucene 8.9.0. > > To improve performance for e.g. summing a value across 10,000 > > documents, we are using DocValues as columnar storage. > > > > In order to retrieve the DocValues without collecting all hits into a > > TopDocs, which we determined to cause a lot of memory pressure and > > consume much time, we are using the expert Collector query interface. > > > > Here's the code, simplified a bit for the list: > > > > final collector = new Collector() { > > long sum = 0; > > > > @Override > > public ScoreMode scoreMode() { > > return ScoreMode.COMPLETE_NO_SCORES; > > } > > > > @Override > > public LeafCollector getLeafCollector(final LeafReaderContext > > context) throws IOException { > > if (context.docBase == 0) { > > sum = 0; // XXX: this should not be necessary? > > } > > final var subtotalValue = > > context.reader().getNumericDocValues("subtotal"); > > return new LeafCollector() { > > @Override > > public void setScorer(final Scorable scorer) throws > > IOException { > > } > > > > @Override > > public void collect(final int doc) throws IOException { > > if (subtotalValue.docID() > doc || > > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) { > > return; > > } > > sum += subtotalValue.longValue(); > > } > > }; > > } > > } > > searcher.search(myQuery, collector); > > return collector.sum; > > > > The query is a moderately complicated Boolean query with some > > TermQuery and MultiTermQuery instances combined together. > > While first testing, I observed that seemingly the collector is called > > twice for each document, and the sum is exactly double what you would > > expect. > > > > It seems that the Collector is observing every matched document twice, > > and by printing out the Scorer, I see that it's done with two > > different BooleanScorer instances. > > You can see my hack that resets the collector every time it starts at > > docBase 0. which I am sure is not the right approach, but seems to > > work. > > What is the right pattern to ensure my Collector only observes result > > documents once, no matter the input query? I see a note in the > > documentation that state is supposed to be stored on the Scorer > > implementation, but I am not providing a custom Scorer, nor do I > > actually want any scoring at all. > > > > Thank you for any guidance! > > Steven > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > -- > Adrien >
Re: Querying into a Collector visits documents multiple times
Hi Steven, This collector looks correct to me. Resetting the counter to 0 on the first segment is indeed not necessary. We have plenty of collectors that are very similar to this one and we never observed any double-counting issue. I would suspect an issue in the code that calls this collector. Maybe try to print the stack trace under the ` if (context.docBase == 0) {` check to see why your collector is being called twice? On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker < stevenschlans...@gmail.com> wrote: > Hi Lucene users, > > I am developing a search application that needs to do some basic > summary statistics. We use Lucene 8.9.0. > To improve performance for e.g. summing a value across 10,000 > documents, we are using DocValues as columnar storage. > > In order to retrieve the DocValues without collecting all hits into a > TopDocs, which we determined to cause a lot of memory pressure and > consume much time, we are using the expert Collector query interface. > > Here's the code, simplified a bit for the list: > > final collector = new Collector() { > long sum = 0; > > @Override > public ScoreMode scoreMode() { > return ScoreMode.COMPLETE_NO_SCORES; > } > > @Override > public LeafCollector getLeafCollector(final LeafReaderContext > context) throws IOException { > if (context.docBase == 0) { > sum = 0; // XXX: this should not be necessary? > } > final var subtotalValue = > context.reader().getNumericDocValues("subtotal"); > return new LeafCollector() { > @Override > public void setScorer(final Scorable scorer) throws > IOException { > } > > @Override > public void collect(final int doc) throws IOException { > if (subtotalValue.docID() > doc || > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) { > return; > } > sum += subtotalValue.longValue(); > } > }; > } > } > searcher.search(myQuery, collector); > return collector.sum; > > The query is a moderately complicated Boolean query with some > TermQuery and MultiTermQuery instances combined together. > While first testing, I observed that seemingly the collector is called > twice for each document, and the sum is exactly double what you would > expect. > > It seems that the Collector is observing every matched document twice, > and by printing out the Scorer, I see that it's done with two > different BooleanScorer instances. > You can see my hack that resets the collector every time it starts at > docBase 0. which I am sure is not the right approach, but seems to > work. > What is the right pattern to ensure my Collector only observes result > documents once, no matter the input query? I see a note in the > documentation that state is supposed to be stored on the Scorer > implementation, but I am not providing a custom Scorer, nor do I > actually want any scoring at all. > > Thank you for any guidance! > Steven > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Adrien
Querying into a Collector visits documents multiple times
Hi Lucene users, I am developing a search application that needs to do some basic summary statistics. We use Lucene 8.9.0. To improve performance for e.g. summing a value across 10,000 documents, we are using DocValues as columnar storage. In order to retrieve the DocValues without collecting all hits into a TopDocs, which we determined to cause a lot of memory pressure and consume much time, we are using the expert Collector query interface. Here's the code, simplified a bit for the list: final collector = new Collector() { long sum = 0; @Override public ScoreMode scoreMode() { return ScoreMode.COMPLETE_NO_SCORES; } @Override public LeafCollector getLeafCollector(final LeafReaderContext context) throws IOException { if (context.docBase == 0) { sum = 0; // XXX: this should not be necessary? } final var subtotalValue = context.reader().getNumericDocValues("subtotal"); return new LeafCollector() { @Override public void setScorer(final Scorable scorer) throws IOException { } @Override public void collect(final int doc) throws IOException { if (subtotalValue.docID() > doc || !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) { return; } sum += subtotalValue.longValue(); } }; } } searcher.search(myQuery, collector); return collector.sum; The query is a moderately complicated Boolean query with some TermQuery and MultiTermQuery instances combined together. While first testing, I observed that seemingly the collector is called twice for each document, and the sum is exactly double what you would expect. It seems that the Collector is observing every matched document twice, and by printing out the Scorer, I see that it's done with two different BooleanScorer instances. You can see my hack that resets the collector every time it starts at docBase 0. which I am sure is not the right approach, but seems to work. What is the right pattern to ensure my Collector only observes result documents once, no matter the input query? I see a note in the documentation that state is supposed to be stored on the Scorer implementation, but I am not providing a custom Scorer, nor do I actually want any scoring at all. Thank you for any guidance! Steven - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org