[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-12-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306125#comment-16306125
 ] 

ASF subversion and git services commented on LUCENE-7993:
-

Commit c95dc6d95743f4a9a1ffe9baa04c3a9d1e3acdf9 in lucene-solr's branch 
refs/heads/master from [~jpountz]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c95dc6d ]

LUCENE-7993: Faster phrases if total hit counts are not required.


> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch, LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-12-07 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281873#comment-16281873
 ] 

David Smiley commented on LUCENE-7993:
--

Nice work!

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch, LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-26 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220743#comment-16220743
 ] 

Adrien Grand commented on LUCENE-7993:
--

Benchmarks on wikibig this time, which is more appropriate since artificially 
truncated documents defeat the purpose of this optimization. HighPrase is now 
3x faster.

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  OrHighHigh   97.15  (3.7%)   85.83  (3.6%)  
-11.7% ( -18% -   -4%)
  Fuzzy2  142.85  (8.7%)  131.63 (11.0%)   
-7.9% ( -25% -   12%)
  Fuzzy1  216.22  (9.6%)  200.10  (8.1%)   
-7.5% ( -22% -   11%)
 MedSloppyPhrase8.02  (7.4%)7.78 (10.1%)   
-3.0% ( -19% -   15%)
HighSloppyPhrase   31.23  (5.7%)   30.59  (7.7%)   
-2.0% ( -14% -   12%)
 MedSpanNear  124.68  (4.7%)  122.26  (4.7%)   
-1.9% ( -10% -7%)
 LowSpanNear   34.39  (8.2%)   33.90  (8.0%)   
-1.4% ( -16% -   16%)
 LowSloppyPhrase   27.55  (5.1%)   27.28  (6.8%)   
-1.0% ( -12% -   11%)
  IntNRQ  164.57  (7.2%)  163.10  (8.5%)   
-0.9% ( -15% -   16%)
HighSpanNear   48.43  (4.5%)   48.03  (4.2%)   
-0.8% (  -9% -8%)
 Respell  226.20  (3.1%)  225.11  (4.7%)   
-0.5% (  -8% -7%)
  AndHighLow 1211.79  (3.9%) 1211.37  (3.1%)   
-0.0% (  -6% -7%)
  AndHighMed  130.59  (2.0%)  130.71  (1.8%)
0.1% (  -3% -3%)
   HighTermMonthSort  307.88  (7.8%)  308.47  (8.4%)
0.2% ( -14% -   17%)
 MedTerm  361.52  (2.9%)  362.23  (2.8%)
0.2% (  -5% -6%)
 AndHighHigh  114.80  (1.9%)  115.38  (1.8%)
0.5% (  -3% -4%)
 Prefix3  248.47  (5.0%)  249.86  (5.3%)
0.6% (  -9% -   11%)
HighTerm  201.95  (2.9%)  203.53  (2.9%)
0.8% (  -4% -6%)
Wildcard  224.17  (4.4%)  226.12  (3.9%)
0.9% (  -7% -9%)
 LowTerm 1862.62  (3.6%) 1903.87  (4.2%)
2.2% (  -5% -   10%)
   OrHighMed  106.09  (4.6%)  145.10  (5.5%)   
36.8% (  25% -   49%)
   LowPhrase   81.86  (5.9%)  112.43  (3.5%)   
37.4% (  26% -   49%)
   HighTermDayOfYearSort  227.00  (7.3%)  312.89 (10.6%)   
37.8% (  18% -   60%)
   MedPhrase   17.95 (14.2%)   43.93 (15.1%)  
144.7% ( 101% -  202%)
  HighPhrase   29.28  (7.5%)   87.43  (8.6%)  
198.6% ( 169% -  231%)
   OrHighLow  110.21  (3.9%)  835.01 (34.0%)  
657.6% ( 596% -  723%)
{noformat}

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-24 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216607#comment-16216607
 ] 

Adrien Grand commented on LUCENE-7993:
--

bq. where we look at maximum tf value versus the minimum one

I think it is a bit more complicated than that? For instance {{"a b a b"}} has 
3 matches for {{"a b"~2}}?

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213858#comment-16213858
 ] 

Robert Muir commented on LUCENE-7993:
-

I see. I wonder if we could try a simple "degraded" form of the optimization at 
first, where we look at maximum tf value versus the minimum one. In other 
words, if freq(a)=4 and freq(b)=2, we'd test score(4) for sloppyPhrase instead 
of score(2) like we do for exactPhrase.

I realize this is not very good and really makes the optimization significantly 
less potent, but perhaps still avoids reading a lot of positions, safe and easy 
as a start?

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-21 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213772#comment-16213772
 ] 

Adrien Grand commented on LUCENE-7993:
--

Yes, I think it would be possible. I started with exact phrases which are 
easier to reason about but we should definitely think about sloppy phrases too. 
I think it just needs a bit more thinking given that a term can count twice in 
the frequency of sloppy freqs, eg. if you search for "a b"~3 and your document 
contains "a b a" (two matches in spite of a freq of 1 for b)?

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211856#comment-16211856
 ] 

Robert Muir commented on LUCENE-7993:
-

Do we think the opto can be extended to sloppy phrase queries as well? E.G. 
assuming we had a similar requirement on sloppyFreq such that "more slop only 
makes scores worse" and so on? I feel like its possible. Just thinking thru how 
we should exactly state the requirements on the Similarity here.

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-17 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208508#comment-16208508
 ] 

Adrien Grand commented on LUCENE-7993:
--

Thanks for looking! I opened LUCENE-7997.

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204642#comment-16204642
 ] 

Robert Muir commented on LUCENE-7993:
-

feel free to open separate issue for the sim testing/cleanup, i can try to 
assist with that stuff. I have done battle with many of these before over this 
same stuff.

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204641#comment-16204641
 ] 

Robert Muir commented on LUCENE-7993:
-

ok, i get it, the optimization is ok since we actually call score for the doc 
with the theoretically maximum possible TF (taking its norm into account), just 
before reading positions.

Note that this optimization is definitely unsafe for some broken similarities 
(specifically the ones documented to be broken in this way, such as DFR model 
P), and probably also for certain parameters to e.g. DFR NormalizationXXX. 
Additionally some similarities (such as AxiomaticXYZ) are not in our random 
test framework, so its unknown there. We could use some explicit tests rather 
than relying on the test suite in that way, too.

But the requirement is 100% reasonable, we can't let some fundamentally broken 
formulas get in our way here :) I would go a step further and say that maybe 
these broken ones should move to the sandbox?

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

2017-10-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204639#comment-16204639
 ] 

Robert Muir commented on LUCENE-7993:
-

{quote}
if we can assume that the score increases (or stagnates) when the term freq 
increases, which sounds like an ok requirement for a sane Similarity?
{quote}

But what about length normalization?

> Speed up phrase queries when total hit count is not needed
> --
>
> Key: LUCENE-7993
> URL: https://issues.apache.org/jira/browse/LUCENE-7993
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to 
> introduce to support MAXSCORE, I wondered whether the same API could support 
> other optimizations. The idea is that when running phrase queries, before we 
> start reading positions, we already have access to the term frequency of each 
> term. And the frequency of the phrase is bounded by the minimum term 
> frequency of the involved terms. So if the score for that minimum term 
> frequency is not competitive then it means that the score for the phrase is 
> not competitive either if we can assume that the score increases (or 
> stagnates) when the term freq increases, which sounds like an ok requirement 
> for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org