[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-23 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570417#comment-17570417
 ] 

Zach Chen commented on LUCENE-10480:


>From the latest nightly benchmark result, the negative impact to nested 
>boolean queries have been resolved, and the performance boost to top-level 
>disjunction queries have been maintained. Thanks for all the guidance 
>[~jpountz] !

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Zach Chen
>Priority: Minor
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568798#comment-17568798
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit 8ebb3305648aea8f551c2dd144d5a527b8982638 in lucene's branch 
refs/heads/branch_9x from Zach Chen
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8ebb3305648 ]

LUCENE-10480: (Backporting) Use BulkScorer to limit BMMScorer to only top-level 
disjunctions (#1037)



> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568781#comment-17568781
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit 28ce8abb5105dba5bc08b7f800f86f3741268bc9 in lucene's branch 
refs/heads/main from Zach Chen
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=28ce8abb510 ]

LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions 
(#1018)



> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566149#comment-17566149
 ] 

Zach Chen commented on LUCENE-10480:


{quote}I wouldn't say blocker, but maybe we could give us time indeed by only 
using this new scorer on top-level disjunctions for now so that we have more 
time to figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.
{quote}
Sounds good. I tried a few quick approaches to limit BMM scorer to top-level 
disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they 
didn't work due to weight's / query's recursive logic. So I ended up wrapping 
the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] 
pending tests update) like your other PR. Please let me know if this approach 
looks good to you, or if there's a better approach. 

 

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565375#comment-17565375
 ] 

Adrien Grand commented on LUCENE-10480:
---

+1 to explore this in a separate issue.

bq. Do you think this slowdown to AndHighOrMedMed may be considered as blocker 
to 9.3 release? 

I wouldn't say blocker, but maybe we could give us time indeed by only using 
this new scorer on top-level disjunctions for now so that we have more time to 
figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-11 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565261#comment-17565261
 ] 

Zach Chen commented on LUCENE-10480:


{quote}Another thing that changes performance sometimes is the doc ID order, 
were you using multiple indexing threads maybe?
{quote}
Ok this is actually the case for me. I was previously using 10 threads to index 
(INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with 
default setting, I was able to reproduce the slowdown:

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       91.27      (4.3%)       85.52      
(4.3%)   -6.3% ( -14% -    2%) 0.000
                        PKLookup      333.25      (4.3%)      329.48      
(3.8%)   -1.1% (  -8% -    7%) 0.380
                     AndHighHigh      104.25      (2.9%)      103.11      
(3.0%)   -1.1% (  -6% -    5%) 0.247
                        SpanNear       16.52      (3.8%)       16.36      
(3.1%)   -0.9% (  -7% -    6%) 0.396
                    TermGroup10K       23.99      (3.3%)       23.78      
(3.0%)   -0.9% (  -6% -    5%) 0.384
                          Phrase      234.74      (2.7%)      232.71      
(1.8%)   -0.9% (  -5% -    3%) 0.235
                      AndHighMed      163.80      (3.5%)      162.42      
(4.3%)   -0.8% (  -8% -    7%) 0.496
                    TermBGroup1M       48.02      (3.5%)       47.65      
(3.7%)   -0.8% (  -7% -    6%) 0.496
                    SloppyPhrase        4.82      (3.4%)        4.78      
(2.7%)   -0.7% (  -6% -    5%) 0.460
                    TermGroup100       41.90      (3.9%)       41.63      
(3.3%)   -0.7% (  -7% -    6%) 0.569
                            Term     2680.42      (4.7%)     2664.05      
(3.3%)   -0.6% (  -8% -    7%) 0.632
                     TermGroup1M       39.95      (2.9%)       39.71      
(3.2%)   -0.6% (  -6% -    5%) 0.531
                  TermBGroup1M1P       84.21      (6.1%)       83.82      
(5.7%)   -0.5% ( -11% -   12%) 0.801
                         Respell      113.78      (1.9%)      113.44      
(1.7%)   -0.3% (  -3% -    3%) 0.603
     BrowseRandomLabelSSDVFacets       20.75      (8.2%)       20.74     
(10.3%)   -0.0% ( -17% -   20%) 0.989
                          Fuzzy2       83.12      (1.8%)       83.11      
(1.1%)   -0.0% (  -2% -    2%) 0.976
       BrowseDayOfYearSSDVFacets       26.69     (12.0%)       26.70     
(11.6%)    0.0% ( -21% -   26%) 0.995
                        Wildcard      115.84      (5.1%)      115.96      
(5.8%)    0.1% ( -10% -   11%) 0.951
               TermDayOfYearSort      260.70      (5.4%)      260.99      
(2.8%)    0.1% (  -7% -    8%) 0.937
         AndHighMedDayTaxoFacets      136.32      (2.6%)      136.63      
(2.3%)    0.2% (  -4% -    5%) 0.773
                IntervalsOrdered      128.13      (7.5%)      128.45      
(7.7%)    0.3% ( -13% -   16%) 0.916
        AndHighHighDayTaxoFacets       13.82      (2.8%)       13.87      
(2.6%)    0.4% (  -4% -    5%) 0.657
                          Fuzzy1       79.16      (2.7%)       79.60      
(1.8%)    0.6% (  -3% -    5%) 0.433
                   TermMonthSort      360.17      (6.4%)      362.83      
(7.1%)    0.7% ( -11% -   15%) 0.728
                   TermTitleSort      191.21      (6.8%)      192.70      
(7.1%)    0.8% ( -12% -   15%) 0.723
                      TermDTSort      208.40      (2.9%)      210.39      
(2.9%)    1.0% (  -4% -    7%) 0.301
            MedTermDayTaxoFacets       78.66      (5.2%)       79.59      
(4.4%)    1.2% (  -7% -   11%) 0.436
                  TermDateFacets       41.04      (5.4%)       41.61      
(4.7%)    1.4% (  -8% -   12%) 0.385
                          IntNRQ      122.00      (8.1%)      124.08      
(8.3%)    1.7% ( -13% -   19%) 0.513
          OrHighMedDayTaxoFacets       23.16      (8.4%)       23.71      
(4.9%)    2.4% ( -10% -   17%) 0.272
           BrowseMonthSSDVFacets       28.68     (13.8%)       29.55     
(16.8%)    3.0% ( -24% -   39%) 0.531
       BrowseDayOfYearTaxoFacets       30.40     (32.2%)       31.67     
(34.2%)    4.2% ( -47% -  103%) 0.690
            BrowseDateTaxoFacets       30.26     (32.2%)       31.57     
(34.4%)    4.3% ( -47% -  104%) 0.680
                         Prefix3      402.14      (8.6%)      419.96      
(8.9%)    4.4% ( -12% -   23%) 0.109
                AndMedOrHighHigh       94.79      (4.0%)       99.03      
(4.5%)    4.5% (  -3% -   13%) 0.001
     BrowseRandomLabelTaxoFacets       32.45     (49.2%)       35.05     
(53.4%)    8.0% ( -63% -  217%) 0.622
           BrowseMonthTaxoFacets       28.68     (35.3%)       31.37     
(39.1%)    9.4% ( -48% -  129%) 0.425
            BrowseDateSSDVFacets        3.96     (28.1%)        4.54     
(26.3%)   14.7% ( -31% -   96%) 0.089
                   

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-11 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564885#comment-17564885
 ] 

Adrien Grand commented on LUCENE-10480:
---

I haven't tried to reproduce it but the steps you took by running on wikibigall 
with the nightly tasks file sound good to me. Another thing that changes 
performance sometimes is the doc ID order, were you using multiple indexing 
threads maybe?

Ignoring the fact that we cannot reproduce the slowdown, if I try to think of 
the main differences between WANDScorer and BlockMaxMaxscoreScorer for 
AndHighOrMedMed, I think the main one is the way that {{advanceShallow}} is 
computed. Conjunctions use block boundaries of the clause that has the lowest 
cost, so this could explain why we are seeing a slowdown with AndHighOrMedMed 
(since the conjunction uses block boundaries of OrMedMed) and not 
AndMedOrHighHigh (since the conjunction uses block boundaries of Med). Maybe we 
could explore other approaches for {{advanceShallow}} such as taking the 
minimum block boundary across essential clauses only instead of all clauses.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-10 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564747#comment-17564747
 ] 

Zach Chen commented on LUCENE-10480:


{quote}I'll see if I can run the original nightly benchmark code / tests from 
my machine to see if there's any difference.
{quote}
I tried to run ** *nightlyBench.py* locally on my machine over the weekend, but 
that turns out to require some changes to the script itself,  and I haven't 
been able to run it fully so far.

On the other hand, I tried a few more run configurations with ** *localrun.py,* 
including running it in a virtual ubuntu box  (as the nightly benchmark runs on 
linux box), but still have no luck so far re-producing the 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 slow-down. 

[~jpountz], just curious, are you able to reproduce the slow-down locally on 
your end as well ?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-09 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564611#comment-17564611
 ] 

Zach Chen commented on LUCENE-10480:


{quote}[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]
 recovered fully but 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 only a bit. I'm unsure what explains there is still a slowdown compared to BMW.
{quote}
Hmm this is quite strange. Looks like 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 was still having about -13%  (5 / 38) impact.  I just ran the full suite of 
wikinightly tasks a few times (by copying *wikinightly.tasks* into 
*wikimedium.10M.nostopwords.tasks* and running *localrun.py* with source 
*wikimedium10m,* and removing *VectorSearch* queries as they were causing 
failure NPE for me) but couldn't reproduce the slow down (baseline is using 
head before all BMM changes):

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
     BrowseRandomLabelSSDVFacets       20.83      (3.8%)       20.09      
(6.5%)   -3.6% ( -13% -    6%) 0.034
           BrowseMonthSSDVFacets       30.36     (10.6%)       29.56     
(12.7%)   -2.7% ( -23% -   23%) 0.473
                         Prefix3      402.70      (9.3%)      397.59      
(9.9%)   -1.3% ( -18% -   19%) 0.674
               TermDayOfYearSort      183.55      (6.5%)      181.61      
(6.9%)   -1.1% ( -13% -   13%) 0.617
                   TermTitleSort      195.99      (7.2%)      194.25      
(8.1%)   -0.9% ( -15% -   15%) 0.713
                        PKLookup      293.80      (3.7%)      291.47      
(4.8%)   -0.8% (  -8% -    7%) 0.555
                   TermMonthSort      283.86      (7.1%)      281.74      
(8.0%)   -0.7% ( -14% -   15%) 0.755
                        Wildcard      227.26      (6.2%)      225.87      
(6.4%)   -0.6% ( -12% -   12%) 0.759
                            Term     2227.50      (3.7%)     2219.57      
(3.3%)   -0.4% (  -7% -    6%) 0.748
                          Fuzzy1      134.77      (2.8%)      134.37      
(2.3%)   -0.3% (  -5% -    4%) 0.712
                    TermGroup100       53.61      (3.7%)       53.47      
(4.6%)   -0.3% (  -8% -    8%) 0.846
                      TermDTSort      143.16      (3.2%)      142.89      
(3.3%)   -0.2% (  -6% -    6%) 0.857
                  TermBGroup1M1P       79.44      (5.5%)       79.29      
(5.5%)   -0.2% ( -10% -   11%) 0.917
        AndHighHighDayTaxoFacets       45.01      (2.3%)       44.94      
(2.1%)   -0.1% (  -4% -    4%) 0.833
     BrowseRandomLabelTaxoFacets       30.94     (50.0%)       30.92     
(46.8%)   -0.0% ( -64% -  193%) 0.998
         AndHighMedDayTaxoFacets       78.11      (3.2%)       78.11      
(3.0%)   -0.0% (  -6% -    6%) 0.998
                          Phrase      202.17      (2.7%)      202.18      
(2.0%)    0.0% (  -4% -    4%) 0.996
                          Fuzzy2       76.10      (2.6%)       76.15      
(2.0%)    0.1% (  -4% -    4%) 0.933
                     TermGroup1M       22.65      (3.8%)       22.67      
(3.2%)    0.1% (  -6% -    7%) 0.919
                  TermDateFacets       32.50      (5.3%)       32.60      
(5.5%)    0.3% (  -9% -   11%) 0.861
       BrowseDayOfYearSSDVFacets       26.31      (5.9%)       26.39      
(8.5%)    0.3% ( -13% -   15%) 0.897
                         Respell       88.21      (2.2%)       88.49      
(2.1%)    0.3% (  -3% -    4%) 0.642
                        SpanNear       16.14      (4.0%)       16.22      
(4.2%)    0.5% (  -7% -    9%) 0.706
            MedTermDayTaxoFacets       73.42      (4.8%)       73.85      
(4.9%)    0.6% (  -8% -   10%) 0.708
                    TermBGroup1M       48.92      (4.2%)       49.23      
(2.8%)    0.6% (  -6% -    8%) 0.581
                IntervalsOrdered       22.42      (5.8%)       22.59      
(4.2%)    0.7% (  -8% -   11%) 0.651
          OrHighMedDayTaxoFacets       25.27      (6.1%)       25.46      
(6.6%)    0.7% ( -11% -   14%) 0.711
                    TermGroup10K       30.26      (4.2%)       30.50      
(2.9%)    0.8% (  -6% -    8%) 0.494
                    SloppyPhrase       91.40      (5.6%)       92.16      
(6.3%)    0.8% ( -10% -   13%) 0.662
                          IntNRQ      152.74     (20.3%)      154.86     
(17.1%)    1.4% ( -29% -   48%) 0.815
                      AndHighMed       88.55      (2.6%)       89.98      
(3.1%)    1.6% (  -3% -    7%) 0.073
                     AndHighHigh       29.10      (2.7%)       29.68      
(3.1%)    2.0% (  -3% -    8%) 0.032
       BrowseDayOfYearTaxoFacets       31.29     (40.0%)       31.93     
(38.0%)    2.0% ( -54% -  133%) 0.869
            BrowseDateTaxoFacets       31.18     (40.3%)       31.87     
(38.5%)    2.2% ( -54% -  

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-09 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564565#comment-17564565
 ] 

Adrien Grand commented on LUCENE-10480:
---

[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]
 recovered fully but 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 only a bit. I'm unsure what explains there is still a slowdown compared to BMW.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564326#comment-17564326
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit 090cbc50dd7e5659494149f470378ab7f6a90cf1 in lucene's branch 
refs/heads/branch_9x from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=090cbc50dd7 ]

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve 
disjunction within conjunction (#1006) (#1008)

(cherry picked from commit da8143bfa38cd5fadae4b4712b9e639e79016021)

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563627#comment-17563627
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit da8143bfa38cd5fadae4b4712b9e639e79016021 in lucene's branch 
refs/heads/main from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=da8143bfa38 ]

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve 
disjunction within conjunction (#1006)



> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-06 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563536#comment-17563536
 ] 

Zach Chen commented on LUCENE-10480:


Ok I see. Maybe I can also try to run some benchmark experiments with different 
JVM compilation / code cache parameters to further test things out. Will report 
back if I find something interesting!

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-06 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563022#comment-17563022
 ] 

Adrien Grand commented on LUCENE-10480:
---

I still suspect that one issue when only running queries that are very good at 
dynamic pruning is that the JVM doesn't have time to warm up. These queries can 
figure out the top 10 hits by only evaluating a few thousands hits, so probably 
that parts of the logic still runs in interpreted mode. The fact that queries 
run slower when you run them in isolation further suggests that this is the 
problematic scenario, not the case when the benchmark includes multiple types 
of queries?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562944#comment-17562944
 ] 

Zach Chen commented on LUCENE-10480:


{quote}maybe there are bits from advance() that we could move to matches() so 
that we would hand it over to the other clause before we start doing expensive 
operations like computing scores.
{quote}
This approach does help stabilizing performance for disjunction within 
conjunction queries (and also provide some small gains)! I have opened a PR for 
it [https://github.com/apache/lucene/pull/1006] .

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919
 ] 

Zach Chen commented on LUCENE-10480:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                  AndHighOrMedMed       
44.47     (26.6%)       63.37     (35.8%)   

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562730#comment-17562730
 ] 

Adrien Grand commented on LUCENE-10480:
---

Looking at this new scorer from the perspective of disjunctions within 
conjunctions, maybe there are bits from advance() that we could move to 
matches() so that we would hand it over to the other clause before we start 
doing expensive operations like computing scores. What do you think 
[~zacharymorn]?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562711#comment-17562711
 ] 

Adrien Grand commented on LUCENE-10480:
---

Nightly benchmarks picked up the change and top-level disjunctions are seeing 
massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562266#comment-17562266
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit a5c99aca1abc9b73a0c68d4f23533311382b718c in lucene's branch 
refs/heads/branch_9x from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a5c99aca1ab ]

LUCENE-10480: Use BMM scorer for 2 clauses disjunction (#972) (#1002)

(cherry picked from commit 503ec5597331454bf8b6af79b9701cfdccf5)

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561787#comment-17561787
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit 503ec5597331454bf8b6af79b9701cfdccf5 in lucene's branch 
refs/heads/main from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=503ec559733 ]

LUCENE-10480: Use BMM scorer for 2 clauses disjunction (#972)



> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553494#comment-17553494
 ] 

Adrien Grand commented on LUCENE-10480:
---

Good question, looking at your BlockMaxMaxScoreScorer it looks like it also has 
potential for being specialized in the 2-clauses case by having two sub scorers 
and tracking during document collection whether the scorer that produces lower 
scores is optional or required. I didn't have concrete plans in mind when 
opening the issue, I was just observing that we pay significant overhead for 
supporting arbitrary numbers of clauses when disjunctions often have only two 
clauses.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-06-09 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552556#comment-17552556
 ] 

Zach Chen commented on LUCENE-10480:


Hi [~jpountz] , this issue reminded me of our experiments last year 
implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% 
improvement for OrHighHigh and OrHighMed 
queries|[https://github.com/apache/lucene/pull/101#issuecomment-840255508].] Do 
you think we should continue to explore in that direction, or there might be 
better / simpler algorithms we could look into?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org