[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306125#comment-16306125 ] ASF subversion and git services commented on LUCENE-7993: - Commit c95dc6d95743f4a9a1ffe9baa04c3a9d1e3acdf9 in lucene-solr's branch refs/heads/master from [~jpountz] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c95dc6d ] LUCENE-7993: Faster phrases if total hit counts are not required. > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch, LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281873#comment-16281873 ] David Smiley commented on LUCENE-7993: -- Nice work! > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch, LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220743#comment-16220743 ] Adrien Grand commented on LUCENE-7993: -- Benchmarks on wikibig this time, which is more appropriate since artificially truncated documents defeat the purpose of this optimization. HighPrase is now 3x faster. {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff OrHighHigh 97.15 (3.7%) 85.83 (3.6%) -11.7% ( -18% - -4%) Fuzzy2 142.85 (8.7%) 131.63 (11.0%) -7.9% ( -25% - 12%) Fuzzy1 216.22 (9.6%) 200.10 (8.1%) -7.5% ( -22% - 11%) MedSloppyPhrase8.02 (7.4%)7.78 (10.1%) -3.0% ( -19% - 15%) HighSloppyPhrase 31.23 (5.7%) 30.59 (7.7%) -2.0% ( -14% - 12%) MedSpanNear 124.68 (4.7%) 122.26 (4.7%) -1.9% ( -10% -7%) LowSpanNear 34.39 (8.2%) 33.90 (8.0%) -1.4% ( -16% - 16%) LowSloppyPhrase 27.55 (5.1%) 27.28 (6.8%) -1.0% ( -12% - 11%) IntNRQ 164.57 (7.2%) 163.10 (8.5%) -0.9% ( -15% - 16%) HighSpanNear 48.43 (4.5%) 48.03 (4.2%) -0.8% ( -9% -8%) Respell 226.20 (3.1%) 225.11 (4.7%) -0.5% ( -8% -7%) AndHighLow 1211.79 (3.9%) 1211.37 (3.1%) -0.0% ( -6% -7%) AndHighMed 130.59 (2.0%) 130.71 (1.8%) 0.1% ( -3% -3%) HighTermMonthSort 307.88 (7.8%) 308.47 (8.4%) 0.2% ( -14% - 17%) MedTerm 361.52 (2.9%) 362.23 (2.8%) 0.2% ( -5% -6%) AndHighHigh 114.80 (1.9%) 115.38 (1.8%) 0.5% ( -3% -4%) Prefix3 248.47 (5.0%) 249.86 (5.3%) 0.6% ( -9% - 11%) HighTerm 201.95 (2.9%) 203.53 (2.9%) 0.8% ( -4% -6%) Wildcard 224.17 (4.4%) 226.12 (3.9%) 0.9% ( -7% -9%) LowTerm 1862.62 (3.6%) 1903.87 (4.2%) 2.2% ( -5% - 10%) OrHighMed 106.09 (4.6%) 145.10 (5.5%) 36.8% ( 25% - 49%) LowPhrase 81.86 (5.9%) 112.43 (3.5%) 37.4% ( 26% - 49%) HighTermDayOfYearSort 227.00 (7.3%) 312.89 (10.6%) 37.8% ( 18% - 60%) MedPhrase 17.95 (14.2%) 43.93 (15.1%) 144.7% ( 101% - 202%) HighPhrase 29.28 (7.5%) 87.43 (8.6%) 198.6% ( 169% - 231%) OrHighLow 110.21 (3.9%) 835.01 (34.0%) 657.6% ( 596% - 723%) {noformat} > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216607#comment-16216607 ] Adrien Grand commented on LUCENE-7993: -- bq. where we look at maximum tf value versus the minimum one I think it is a bit more complicated than that? For instance {{"a b a b"}} has 3 matches for {{"a b"~2}}? > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213858#comment-16213858 ] Robert Muir commented on LUCENE-7993: - I see. I wonder if we could try a simple "degraded" form of the optimization at first, where we look at maximum tf value versus the minimum one. In other words, if freq(a)=4 and freq(b)=2, we'd test score(4) for sloppyPhrase instead of score(2) like we do for exactPhrase. I realize this is not very good and really makes the optimization significantly less potent, but perhaps still avoids reading a lot of positions, safe and easy as a start? > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213772#comment-16213772 ] Adrien Grand commented on LUCENE-7993: -- Yes, I think it would be possible. I started with exact phrases which are easier to reason about but we should definitely think about sloppy phrases too. I think it just needs a bit more thinking given that a term can count twice in the frequency of sloppy freqs, eg. if you search for "a b"~3 and your document contains "a b a" (two matches in spite of a freq of 1 for b)? > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211856#comment-16211856 ] Robert Muir commented on LUCENE-7993: - Do we think the opto can be extended to sloppy phrase queries as well? E.G. assuming we had a similar requirement on sloppyFreq such that "more slop only makes scores worse" and so on? I feel like its possible. Just thinking thru how we should exactly state the requirements on the Similarity here. > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208508#comment-16208508 ] Adrien Grand commented on LUCENE-7993: -- Thanks for looking! I opened LUCENE-7997. > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204642#comment-16204642 ] Robert Muir commented on LUCENE-7993: - feel free to open separate issue for the sim testing/cleanup, i can try to assist with that stuff. I have done battle with many of these before over this same stuff. > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204641#comment-16204641 ] Robert Muir commented on LUCENE-7993: - ok, i get it, the optimization is ok since we actually call score for the doc with the theoretically maximum possible TF (taking its norm into account), just before reading positions. Note that this optimization is definitely unsafe for some broken similarities (specifically the ones documented to be broken in this way, such as DFR model P), and probably also for certain parameters to e.g. DFR NormalizationXXX. Additionally some similarities (such as AxiomaticXYZ) are not in our random test framework, so its unknown there. We could use some explicit tests rather than relying on the test suite in that way, too. But the requirement is 100% reasonable, we can't let some fundamentally broken formulas get in our way here :) I would go a step further and say that maybe these broken ones should move to the sandbox? > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7993) Speed up phrase queries when total hit count is not needed
[ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204639#comment-16204639 ] Robert Muir commented on LUCENE-7993: - {quote} if we can assume that the score increases (or stagnates) when the term freq increases, which sounds like an ok requirement for a sane Similarity? {quote} But what about length normalization? > Speed up phrase queries when total hit count is not needed > -- > > Key: LUCENE-7993 > URL: https://issues.apache.org/jira/browse/LUCENE-7993 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-7993.patch > > > Follow-up of LUCENE-4100: When thinking about the API that we needed to > introduce to support MAXSCORE, I wondered whether the same API could support > other optimizations. The idea is that when running phrase queries, before we > start reading positions, we already have access to the term frequency of each > term. And the frequency of the phrase is bounded by the minimum term > frequency of the involved terms. So if the score for that minimum term > frequency is not competitive then it means that the score for the phrase is > not competitive either if we can assume that the score increases (or > stagnates) when the term freq increases, which sounds like an ok requirement > for a sane Similarity? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org