[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883470#comment-16883470 ] ASF subversion and git services commented on LUCENE-8875: - Commit ee79a20174528a99b1a805af5ce2212276db1630 in lucene-solr's branch refs/heads/jira/SOLR-13565 from Atri Sharma [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ee79a20 ] LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits (#754) This commit introduces a new collector which is optimized for cases when the number of hits is large and/or the actual hits collected are sparse in comparison to the number of hits requested. > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Fix For: 8.2 > > Time Spent: 9h > Remaining Estimate: 0h > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882295#comment-16882295 ] ASF subversion and git services commented on LUCENE-8875: - Commit 7339eb272c30e993e0a8e73154fdfca8ef9879e4 in lucene-solr's branch refs/heads/branch_8x from Atri Sharma [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7339eb2 ] LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits (#754) This commit introduces a new collector which is optimized for cases when the number of hits is large and/or the actual hits collected are sparse in comparison to the number of hits requested. > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Time Spent: 9h > Remaining Estimate: 0h > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882278#comment-16882278 ] ASF subversion and git services commented on LUCENE-8875: - Commit ee79a20174528a99b1a805af5ce2212276db1630 in lucene-solr's branch refs/heads/master from Atri Sharma [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ee79a20 ] LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits (#754) This commit introduces a new collector which is optimized for cases when the number of hits is large and/or the actual hits collected are sparse in comparison to the number of hits requested. > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Time Spent: 9h > Remaining Estimate: 0h > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871732#comment-16871732 ] Adrien Grand commented on LUCENE-8875: -- Aggregates don't have this issue since they don't track top hits? +1 to having a separate collector for large N values in sandbox. > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871620#comment-16871620 ] Atri Sharma commented on LUCENE-8875: - I meant Elasticsearch aggregates (although I am not sure if this new proposed collector has a direct improvement in that front, on second thought). The meat of the point here is that I believe it is the path of minimal invasion if we introduced a new collector which clearly calls out that it is meant for cases when N is very large (>10k?), and lists out the benefits and trade offs clearly. Are there any catches that are applicable here? > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871560#comment-16871560 ] Adrien Grand commented on LUCENE-8875: -- What do you mean by bucket aggregates? > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871543#comment-16871543 ] Atri Sharma commented on LUCENE-8875: - While I do agree that too many hits are not what top N hits are intended for, but some increasing popular use cases are inclined in that direction (bucket aggregates?) I think it would be fair to allow such users to use a different Collector which optimises their case while not muddling with the commonly used code path. WDYT? > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871508#comment-16871508 ] Adrien Grand commented on LUCENE-8875: -- I like pre-populating the hit queue mostly because it makes the collector code simpler and likely a bit faster. As a comparison TopFieldCollector can't pre-populate the hit queue, which forces it to have different code paths for the case that the priority queue is full (common path) or that the queue is not full yet. In general I'm seeing large number of hits as an abuse case. > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?
[ https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870679#comment-16870679 ] Atri Sharma commented on LUCENE-8875: - Another thing to explore is to have a sleek set of arrays instead of ScoreDocs: [https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/] Maybe have a new implementation of a PQ using this idea, and a new Collector which uses the threshold sentinel filling + the new PQ? Only used for very large N? > Should TopScoreDocCollector Always Populate Sentinel Values? > > > Key: LUCENE-8875 > URL: https://issues.apache.org/jira/browse/LUCENE-8875 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > TopScoreDocCollector always initializes HitQueue as the PQ implementation, > and instruct HitQueue to populate with sentinels. While this is a great > safety mechanism, for very large datasets where the query's selectivity is > high, the sentinel population can be redundant and can become a large enough > bottleneck in itself. Does it make sense to introduce a new parameter in > TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and > does not populate sentinels? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org