[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring
[ https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173187#comment-17173187 ] Vincenzo D'Amore commented on LUCENE-9107: -- Hi, I did a little step further trying to identify the difference of performance using CommonTermsQuery with different versions of Solr (7.3.1 vs 8.6.0). In this fork of anserini repo branch test_8.6.0 [https://github.com/freedev/anserini/blob/test_8.6.0] There I was trying the ann sample, here the steps to reproduce the problem: copy and build {quote}{{git clone [https://github.com/freedev/anserini.git]}} {{git checkout test_8.6.0}} {{mvn -Prelease clean package}} {quote} create the lucene index {quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar io.anserini.ann.IndexVectors -input glove.6B.300d.txt -path glove300-idx-8.6.0 -encoding fw}} {quote} reproduce the issue (the vector used for the world apple is hardcoded into the ApproximateNearestNeighborSearch main) {quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar io.anserini.ann.ApproximateNearestNeighborSearch -input glove.6B.300d.txt -path glove300-idx-8.6.0 -encoding fw -word apple}} {quote} This is the VisualVM Sampler output after having monitored {{ApproximateNearestNeighborSearch}} with Java Flight Recorder !image-2020-08-07-16-54-27-905.png|width=921,height=609! Changing the line [186 in ApproximateNearestNeighborSearch|https://github.com/freedev/anserini/blob/test_8.6.0/src/main/java/io/anserini/ann/ApproximateNearestNeighborSearch.java#L186] from: {{TopScoreDocCollector.create(indexArgs.depth, 0);}} to: {{TopScoreDocCollector.create(indexArgs.depth, Integer.MAX_VALUE);}} greately reduces the time spent (from ~2 sec to 3-400 milliseconds), see the screenshot: !Screenshot 2020-08-07 at 16.20.05.png|width=927,height=613! > CommonsTermsQuery with huge no. of terms slower with top-k scoring > -- > > Key: LUCENE-9107 > URL: https://issues.apache.org/jira/browse/LUCENE-9107 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.3 >Reporter: Tommaso Teofili >Priority: Major > Attachments: Screenshot 2020-08-07 at 16.20.01.png, Screenshot > 2020-08-07 at 16.20.05.png, image-2020-08-07-16-54-27-905.png > > > In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots > of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low > frequency terms, the query, although big, finishes in around 2-300ms with > Lucene 7.6.0. > However, when upgrading the code to Lucene 8.x, the query runs in 2-3s > instead [2]. > After digging a bit into it it seems that the regression in speed comes from > the fact that top-k scoring introduced by default in version 8 is causing > that, not sure "where" exactly in the code though. > When switching back to complete hit scoring [3], the speed goes back to the > initial 2-300ms also in Lucene 8.3.x. > It'd be nice to understand the reason why this is happening and if it is only > concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well. > If this is a case that depends on the data and application involved (Anserini > in this case), the application should handle it, otherwise if it is a > regression/bug in Lucene it'd be nice to fix it. > [1] : > https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java > [2] : > https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java > [3] : > https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring
[ https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006815#comment-17006815 ] Tommaso Teofili commented on LUCENE-9107: - thanks Adrien for looking into this, I've tried with a pure disjunction (BooleanQuery) and the numbers are about the same as with {{CommonTermsQuery}}. {{ClassicSimilarity}} slowness contribution is non trivial: top-k scoring with {{ClassicSimilarity}} ranges 2 to 2.5 seconds, whereas it ranges 1.5 to 2 seconds with {{BM25Similarity}}. > CommonsTermsQuery with huge no. of terms slower with top-k scoring > -- > > Key: LUCENE-9107 > URL: https://issues.apache.org/jira/browse/LUCENE-9107 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.3 >Reporter: Tommaso Teofili >Priority: Major > > In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots > of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low > frequency terms, the query, although big, finishes in around 2-300ms with > Lucene 7.6.0. > However, when upgrading the code to Lucene 8.x, the query runs in 2-3s > instead [2]. > After digging a bit into it it seems that the regression in speed comes from > the fact that top-k scoring introduced by default in version 8 is causing > that, not sure "where" exactly in the code though. > When switching back to complete hit scoring [3], the speed goes back to the > initial 2-300ms also in Lucene 8.3.x. > It'd be nice to understand the reason why this is happening and if it is only > concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well. > If this is a case that depends on the data and application involved (Anserini > in this case), the application should handle it, otherwise if it is a > regression/bug in Lucene it'd be nice to fix it. > [1] : > https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java > [2] : > https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java > [3] : > https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring
[ https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002848#comment-17002848 ] Adrien Grand commented on LUCENE-9107: -- CommonTermsQuery probably makes the issue worse by having clauses on multiple levels of boolean queries (see e.g. how the nested boolean queries perform worse than single-level boolean queries in the nightly benchmarks http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html), but this is an issue with BooleanQuery too. We have complex logic that tries to skip as many hits as possible, but when this logic is defeated, which is typically the case when - there are lots of clauses, - or clauses have about the same max scores, - or maximum score upper bounds are highly overestimated (ClassicSimilarity might contribute a bit here too), then we need to pay the price for this overhead without getting any benefits. What latency do you get if you run a pure disjunction with these clauses instead of a CommonTermsQuery? > CommonsTermsQuery with huge no. of terms slower with top-k scoring > -- > > Key: LUCENE-9107 > URL: https://issues.apache.org/jira/browse/LUCENE-9107 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.3 >Reporter: Tommaso Teofili >Priority: Major > > In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots > of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low > frequency terms, the query, although big, finishes in around 2-300ms with > Lucene 7.6.0. > However, when upgrading the code to Lucene 8.x, the query runs in 2-3s > instead [2]. > After digging a bit into it it seems that the regression in speed comes from > the fact that top-k scoring introduced by default in version 8 is causing > that, not sure "where" exactly in the code though. > When switching back to complete hit scoring [3], the speed goes back to the > initial 2-300ms also in Lucene 8.3.x. > It'd be nice to understand the reason why this is happening and if it is only > concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well. > If this is a case that depends on the data and application involved (Anserini > in this case), the application should handle it, otherwise if it is a > regression/bug in Lucene it'd be nice to fix it. > [1] : > https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java > [2] : > https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java > [3] : > https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org