Mike Sokolov created LUCENE-8681:
------------------------------------

             Summary: Prorated early termination
                 Key: LUCENE-8681
                 URL: https://issues.apache.org/jira/browse/LUCENE-8681
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/search
            Reporter: Mike Sokolov


In this issue we'll exploit the distribution of top K documents among segments 
to extract performance gains when using early termination. The basic idea is we 
do not need to collect K documents from every segment and then merge. Rather we 
can collect a number of documents that is proportional to the segment's size 
plus an error bound derived from the combinatorics seen as a (multinomial) 
probability distribution.

https://github.com/apache/lucene-solr/pull/564 has the proposed change.

[~rcmuir] pointed out on the mailing list that this patch confounds two 
settings: (1) whether to collect all hits, ensuring correct hit counts, and (2) 
whether to guarantee that the top K hits are precisely the top K.

The current patch treats this as the same thing. It takes the position that if 
the user says it's OK to have approximate counts, then it's also OK to 
introduce some small chance of ranking error; occasionally some of the top K we 
return may draw from the top K + epsilon.

Instead we could provide some additional knobs to the user. Currently the 
public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int threshold)}}. 
The threshold parameter controls when to apply early termination; it allows the 
collector to terminate once the given number of documents have been collected.

Instead of using the same threshold to control leaf-level early termination, we 
could provide an additional leaf-level parameter. For example, this could be a 
scale factor on the error bound, eg a number of standard deviations to apply. 
The patch uses 3, but a much more conservative bound would be 4 or even 5. With 
these values, some speedup would still result, but with a much lower level of 
ranking errors. A value of MAX_INT would ensure no leaf-level termination would 
ever occur.

We could also hide the precise numerical bound and offer users a three-way enum 
(EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to apply 
this optimization, using some predetermined error bound.

I posted the patch without any user-level tuning since I think the user has 
already indicated a preference for speed over precision by specifying a finite 
(global) threshold, but if we want to provide finer control, these two options 
seem to make the most sense to me. Providing access to the number of standard 
deviation to allow from the expected distribution gives the user the finest 
control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to