[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-07-11 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883470#comment-16883470
 ] 

ASF subversion and git services commented on LUCENE-8875:
-

Commit ee79a20174528a99b1a805af5ce2212276db1630 in lucene-solr's branch 
refs/heads/jira/SOLR-13565 from Atri Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ee79a20 ]

LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits (#754)

This commit introduces a new collector which is optimized for
cases when the number of hits is large and/or the actual hits
collected are sparse in comparison to the number of hits
requested.

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-07-10 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882295#comment-16882295
 ] 

ASF subversion and git services commented on LUCENE-8875:
-

Commit 7339eb272c30e993e0a8e73154fdfca8ef9879e4 in lucene-solr's branch 
refs/heads/branch_8x from Atri Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7339eb2 ]

LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits (#754)

This commit introduces a new collector which is optimized for
cases when the number of hits is large and/or the actual hits
collected are sparse in comparison to the number of hits
requested.

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-07-10 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882278#comment-16882278
 ] 

ASF subversion and git services commented on LUCENE-8875:
-

Commit ee79a20174528a99b1a805af5ce2212276db1630 in lucene-solr's branch 
refs/heads/master from Atri Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ee79a20 ]

LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits (#754)

This commit introduces a new collector which is optimized for
cases when the number of hits is large and/or the actual hits
collected are sparse in comparison to the number of hits
requested.

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-24 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871732#comment-16871732
 ] 

Adrien Grand commented on LUCENE-8875:
--

Aggregates don't have this issue since they don't track top hits?

+1 to having a separate collector for large N values in sandbox.

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-24 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871620#comment-16871620
 ] 

Atri Sharma commented on LUCENE-8875:
-

I meant Elasticsearch aggregates (although I am not sure if  this new proposed 
collector has a direct improvement in that front,  on second thought).

The meat of the point here is that I believe it is the path of minimal invasion 
if we introduced a new collector which clearly calls out that it is meant for 
cases when N is very large (>10k?), and lists out the benefits and trade offs 
clearly.

Are there any catches that are applicable here?

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-24 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871560#comment-16871560
 ] 

Adrien Grand commented on LUCENE-8875:
--

What do you mean by bucket aggregates?

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-24 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871543#comment-16871543
 ] 

Atri Sharma commented on LUCENE-8875:
-

While I do agree that too many hits are not what top N hits are intended for, 
but some increasing popular use cases are inclined in that direction (bucket 
aggregates?) I think it would be fair to allow such users to use a different 
Collector which optimises their case while not muddling with the commonly used 
code path. WDYT?

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-24 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871508#comment-16871508
 ] 

Adrien Grand commented on LUCENE-8875:
--

I like pre-populating the hit queue mostly because it makes the collector code 
simpler and likely a bit faster. As a comparison TopFieldCollector can't 
pre-populate the hit queue, which forces it to have different code paths for 
the case that the priority queue is full (common path) or that the queue is not 
full yet. In general I'm seeing large number of hits as an abuse case.

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870679#comment-16870679
 ] 

Atri Sharma commented on LUCENE-8875:
-

Another thing to explore is to have a sleek set of arrays instead of ScoreDocs: 
[https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/]

 

Maybe have a new implementation of a PQ using this idea, and a new Collector 
which uses the threshold sentinel filling + the new PQ? Only used for very 
large N?

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org