Toke Eskildsen created LUCENE-8585:
--------------------------------------

             Summary: Create jump-tables for DocValues at index-time
                 Key: LUCENE-8585
                 URL: https://issues.apache.org/jira/browse/LUCENE-8585
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/codecs
    Affects Versions: master (8.0)
            Reporter: Toke Eskildsen


As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid 
long iterative walks. This is implemented in LUCENE-8374 at search-time (first 
request for DocValues from a field in a segment), with the benefit of working 
without changes to existing Lucene 7 indexes and the downside of introducing a 
startup time penalty and a memory overhead.

As discussed in LUCENE-8374, the codec should be updated to create these 
jump-tables at index time. This eliminates the segment-open time & memory 
penalties, with the potential downside of increasing index-time for DocValues.

The three elements of LUCENE-8374 should be transferable to index-time without 
much alteration of the core structures:
 * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for every 
65536 documents, containing the offset of the block in 33 bits and the index 
(number of set bits) up to the block in 31 bits.
 It can be build sequentially and should be stored as a simple sequence of 
consecutive longs for caching of lookups.
 As it is fairly small, relative to document count, it might be better to 
simply memory cache it?
 * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 
bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. 
Each {{short}} represents the number of set bits up to right before the 
corresponding sub-block of 512 docIDs.
 The \{{shorts}} can be computed sequentially or when the DENSE block is 
flushed (probably the easiest). They should be stored as a simple sequence of 
consecutive shorts for caching of lookups, one logically independent sequence 
for each DENSE block. The logical position would be one sequence at the start 
of every DENSE block.
 Whether it is best to read all the 16 {{shorts}} up front when a DENSE block 
is accessed or whether it is best to only read any individual {{short}} when 
needed is not clear at this point.
 * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric 
values. Each {{long}} holds the offset to the corresponding block of values.
 The offsets can be computed sequentially and should be stored as a simple 
sequence of consecutive {{longs}} for caching of lookups.
 The vBPV-offsets has the largest space overhead og the 3 jump-tables and a lot 
of the 64 bits in each long are not used for most indexes. They could be 
represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, 
with the downsides of a potential lookup-time overhead and the need for doing 
the compression after all offsets has been determined.

I have no experience with the codec-parts responsible for creating 
index-structures. I'm quite willing to take a stab at this, although I probably 
won't do much about it before January 2019. Should anyone else wish to adopt 
this JIRA-issue or co-work on it, I'll be happy to share.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to