Nikolay Khitrin created LUCENE-8178:
---------------------------------------

             Summary: Bulk operations for LongValues and Sorted[Set]DocValues
                 Key: LUCENE-8178
                 URL: https://issues.apache.org/jira/browse/LUCENE-8178
             Project: Lucene - Core
          Issue Type: Improvement
    Affects Versions: 7.2.1
            Reporter: Nikolay Khitrin


One-by-one DocValues iteration by {{advanceExact}} and {{nextOrd}}/{{ordValue}} 
is really slow for bulk operations like facetting. Reading and unpacking 
integers in blocks is substantially faster but DocValues for now can be queried 
only for single document.

To apply document-based bulk processing {{DocIdSetIterator}} matches have to be 
splitted to sequential docID runs and remapped to underlying {{LongValues}} 
positions.
 After this transformation relatively large linear scans can be performed over 
packed integers.

 

To do this two new interfaces

1. {{LongValuesCollector}} ({{collectValue(long index, long value)}}).
 2. {{OrdStatsCollector}} ({{collectOrd(long ord)}}, {{collectMissing(int 
count)}}).

and three new functions are introduced

1. {{LongValues.forRange(long begin, long end, LongValuesCollector collector)}}
 2. {{SortedDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 
collector)}}
 3. {{SortedSetDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 
collector)}}

with reference implementations.

Optimized versions of these functions are provided for:
 1. {{DirectReader}} for non-32/64 bits per value cases (using 
{{PackedInts.Decoder}}).
 2. {{Lucene70DocValuesProducer}} {{getSorted}} and {{getSortedSet}} (both 
sparse and dense).

 

Measured Solr facetting performance boost is up to 2 - 2.5x on real index.
 Patch for Solr {{DocValuesFacets}} is also provided as separate file.

 

Implementation notes:
 * {{OrdStatsCollector}} does not accept document id because it will ruin 
performance for {{SortedSetDocValues}} due to excessive position lookups.
 * This patch is fully compatible with Lucene 7.0 DocValues format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to