[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring

2020-08-07 Thread Vincenzo D'Amore (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173187#comment-17173187
 ] 

Vincenzo D'Amore commented on LUCENE-9107:
--

Hi, I did a little step further trying to identify the difference of 
performance using CommonTermsQuery with different versions of Solr (7.3.1 vs 
8.6.0).

In this fork of anserini repo branch test_8.6.0 
[https://github.com/freedev/anserini/blob/test_8.6.0]

There I was trying the ann sample, here the steps to reproduce the problem:
 copy and build
{quote}{{git clone [https://github.com/freedev/anserini.git]}}
 {{git checkout test_8.6.0}}
 {{mvn -Prelease clean package}}
{quote}
create the lucene index
{quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar 
io.anserini.ann.IndexVectors -input glove.6B.300d.txt -path glove300-idx-8.6.0 
-encoding fw}}
{quote}
reproduce the issue (the vector used for the world apple is hardcoded into the 
ApproximateNearestNeighborSearch main)
{quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar 
io.anserini.ann.ApproximateNearestNeighborSearch -input glove.6B.300d.txt -path 
glove300-idx-8.6.0 -encoding fw -word apple}}
{quote}
 

This is the VisualVM Sampler output after having monitored 
{{ApproximateNearestNeighborSearch}} with Java Flight Recorder

!image-2020-08-07-16-54-27-905.png|width=921,height=609!

Changing the line [186 in 
ApproximateNearestNeighborSearch|https://github.com/freedev/anserini/blob/test_8.6.0/src/main/java/io/anserini/ann/ApproximateNearestNeighborSearch.java#L186]

from:

{{TopScoreDocCollector.create(indexArgs.depth, 0);}}

to:

{{TopScoreDocCollector.create(indexArgs.depth, Integer.MAX_VALUE);}}

greately reduces the time spent (from ~2 sec to 3-400 milliseconds), see the 
screenshot:

 

!Screenshot 2020-08-07 at 16.20.05.png|width=927,height=613!

> CommonsTermsQuery with huge no. of terms slower with top-k scoring
> --
>
> Key: LUCENE-9107
> URL: https://issues.apache.org/jira/browse/LUCENE-9107
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.3
>Reporter: Tommaso Teofili
>Priority: Major
> Attachments: Screenshot 2020-08-07 at 16.20.01.png, Screenshot 
> 2020-08-07 at 16.20.05.png, image-2020-08-07-16-54-27-905.png
>
>
> In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots 
> of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low 
> frequency terms, the query, although big, finishes in around 2-300ms with 
> Lucene 7.6.0. 
> However, when upgrading the code to Lucene 8.x, the query runs in 2-3s 
> instead [2].
> After digging a bit into it it seems that the regression in speed comes from 
> the fact that top-k scoring introduced by default in version 8 is causing 
> that, not sure "where" exactly in the code though.
> When switching back to complete hit scoring [3], the speed goes back to the 
> initial 2-300ms also in Lucene 8.3.x.
> It'd be nice to understand the reason why this is happening and if it is only 
> concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
> If this is a case that depends on the data and application involved (Anserini 
> in this case), the application should handle it, otherwise if it is a 
> regression/bug in Lucene it'd be nice to fix it.
> [1] : 
> https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
> [2] : 
> https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
> [3] : 
> https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring

2020-01-02 Thread Tommaso Teofili (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006815#comment-17006815
 ] 

Tommaso Teofili commented on LUCENE-9107:
-

thanks Adrien for looking into this, I've tried with a pure disjunction 
(BooleanQuery) and the numbers are about the same as with {{CommonTermsQuery}}. 
{{ClassicSimilarity}} slowness contribution is non trivial: top-k scoring with 
{{ClassicSimilarity}} ranges 2 to 2.5 seconds, whereas it ranges 1.5 to 2 
seconds with {{BM25Similarity}}.

> CommonsTermsQuery with huge no. of terms slower with top-k scoring
> --
>
> Key: LUCENE-9107
> URL: https://issues.apache.org/jira/browse/LUCENE-9107
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.3
>Reporter: Tommaso Teofili
>Priority: Major
>
> In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots 
> of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low 
> frequency terms, the query, although big, finishes in around 2-300ms with 
> Lucene 7.6.0. 
> However, when upgrading the code to Lucene 8.x, the query runs in 2-3s 
> instead [2].
> After digging a bit into it it seems that the regression in speed comes from 
> the fact that top-k scoring introduced by default in version 8 is causing 
> that, not sure "where" exactly in the code though.
> When switching back to complete hit scoring [3], the speed goes back to the 
> initial 2-300ms also in Lucene 8.3.x.
> It'd be nice to understand the reason why this is happening and if it is only 
> concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
> If this is a case that depends on the data and application involved (Anserini 
> in this case), the application should handle it, otherwise if it is a 
> regression/bug in Lucene it'd be nice to fix it.
> [1] : 
> https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
> [2] : 
> https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
> [3] : 
> https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring

2019-12-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002848#comment-17002848
 ] 

Adrien Grand commented on LUCENE-9107:
--

CommonTermsQuery probably makes the issue worse by having clauses on multiple 
levels of boolean queries (see e.g. how the nested boolean queries perform 
worse than single-level boolean queries in the nightly benchmarks 
http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html), but 
this is an issue with BooleanQuery too. We have complex logic that tries to 
skip as many hits as possible, but when this logic is defeated, which is 
typically the case when
 - there are lots of clauses,
 - or clauses have about the same max scores,
 - or maximum score upper bounds are highly overestimated (ClassicSimilarity 
might contribute a bit here too),
then we need to pay the price for this overhead without getting any benefits.

What latency do you get if you run a pure disjunction with these clauses 
instead of a CommonTermsQuery?

> CommonsTermsQuery with huge no. of terms slower with top-k scoring
> --
>
> Key: LUCENE-9107
> URL: https://issues.apache.org/jira/browse/LUCENE-9107
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.3
>Reporter: Tommaso Teofili
>Priority: Major
>
> In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots 
> of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low 
> frequency terms, the query, although big, finishes in around 2-300ms with 
> Lucene 7.6.0. 
> However, when upgrading the code to Lucene 8.x, the query runs in 2-3s 
> instead [2].
> After digging a bit into it it seems that the regression in speed comes from 
> the fact that top-k scoring introduced by default in version 8 is causing 
> that, not sure "where" exactly in the code though.
> When switching back to complete hit scoring [3], the speed goes back to the 
> initial 2-300ms also in Lucene 8.3.x.
> It'd be nice to understand the reason why this is happening and if it is only 
> concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
> If this is a case that depends on the data and application involved (Anserini 
> in this case), the application should handle it, otherwise if it is a 
> regression/bug in Lucene it'd be nice to fix it.
> [1] : 
> https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
> [2] : 
> https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
> [3] : 
> https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org