Toke Eskildsen created LUCENE-8374:
--------------------------------------

             Summary: Reduce reads for sparse DocValues
                 Key: LUCENE-8374
                 URL: https://issues.apache.org/jira/browse/LUCENE-8374
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/codecs
    Affects Versions: 7.3.1, master (8.0)
            Reporter: Toke Eskildsen


The {{Lucene70DocValuesProducer}} has the internal classes 
{{SparseNumericDocValues}} and {{BaseSortedSetDocValues}} (sparse code path), 
which again uses {{IndexedDISI}} to handle the docID -> value-ordinal lookup. 
The value-ordinal is the index of the docID assuming an abstract tightly packed 
monotonically increasing list of docIDs: If the docIDs with corresponding 
values are {{[0, 4, 1432]}}, their value-ordinals will be {{[0, 1, 2]}}.
h2. Outer blocks

The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values 
(65536), where each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16 
values) or {{SPARSE}} (< 2^12 values ~= 6%). Consequently blocks vary quite a 
lot in size and ordinal resolving strategy.

When a sparse Numeric DocValue is needed, the code first locates the block 
containing the wanted docID flag. It does so by iterating blocks one-by-one 
until it reaches the needed one, where each iteration requires a lookup in the 
underlying {{IndexSlice}}. For a common memory mapped index, this translates to 
either a cached request or a read operation. If a segment has 6M documents, 
worst-case is 91 lookups. In our web archive, our segments has ~300M values: A 
worst-case of 4577 lookups!

One obvious solution is to use a lookup-table for blocks: A long[]-array with 
an entry for each block. For 6M documents, that is < 1KB and would allow for 
direct jumping (a single lookup) in all instances. Unfortunately this 
lookup-table cannot be generated upfront when the writing of values is purely 
streaming. It can be appended to the end of the stream before it is closed, but 
without knowing the position of the lookup-table the reader cannot seek to it.

One strategy for creating such a lookup-table would be to generate it during 
reads and cache it for next lookup. This does not fit directly into how 
{{IndexedDISI}} currently works (it is created anew for each invocation), but 
could probably be added with a little work. An advantage to this is that this 
does not change the underlying format and thus could be used with existing 
indexes.
h2. The lookup structure inside each block

If {{ALL}} of the 2^16 values are defined, the structure is empty and the 
ordinal is simply the requested docID with some modulo and multiply math. 
Nothing to improve there.

If the block is {{DENSE}} (2^12 to 2^16 values are defined), a bitmap is used 
and the number of set bits up to the wanted index (the docID modulo the block 
origo) are counted. That bitmap is a long[1024], meaning that worst case is to 
lookup and count all set bits for 1024 longs!

One known solution to this is to use a [rank 
structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I 
[implemented 
it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]]
 for a related project and with that (), the rank-overhead for a {{DENSE}} 
block would be long[32] and would ensure a maximum of 9 lookups. It is not 
trivial to build the rank-structure and caching it (assuming all blocks are 
dense) for 6M documents would require 22 KB (3.17% overhead). It would be far 
better to generate the rank-structure at index time and store it immediately 
before the bitset (this is possible with streaming as each block is fully 
resolved before flushing), but of course that would require a change to the 
codec.

If {{SPARSE}} (< 2^12 values ~= 6%) are defined, the docIDs are simply in the 
form of a list. As a comment in the code suggests, a binary search through 
these would be faster, although that would mean seeking backwards. If that is 
not acceptable, I don't have any immediate idea for avoiding the full iteration.

I propose implementing query-time caching of both block-jumps and inner-block 
lookups for {{DENSE}} (using rank) as first improvement and an index-time 
{{DENSE}}-rank structure for future improvement. As query-time caching is 
likely to be too costly for rapidly-changing indexes, it should probably be an 
opt-in in solrconfig.xml.
h2. Some real-world observations

This analysis was triggered by massive (10x) slowdown problems with both simple 
querying and large exports from our webarchive index after upgrading from Solr 
4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but returning the 
top-10 documents takes 5-20 seconds (~50 non-stored DocValues fields), up from 
½-2 seconds in total from Solr 4.10 (more of a mix of stored vs. DocValues, so 
might not be directly comparable).

Measuring with VisualVM points to {{NIOFSIndexInput.readInternal}} as *the* 
hotspot.  We ran some tests with simple queries on a single 307,171,504 
document segment with different single-value DocValued fields in the fl and got

 
||Field||Type||Docs with value||Docs w/ val %||Speed in docs/sec||
|url|String|307,171,504|100%|12,500|
|content_type_ext|String|224,375,378|73%|360|
|author|String|1,506,365|0.5%|1,100|
|crawl_date|DatePoint|307,171,498|~100%|90|
|content_text_length|IntPoint|285,800,212|93%|410|
|content_length|IntPoint|307,016,816|99.9%|100|
|crawl_year|IntPoint|307,171,498|~100%|14,500|
|last_modified|DatePoint|6,835,065|2.2%|570|
|source_file_offset|LongPoint|307,171,504|100%|28,000|

 Note how both url and source_file_offset are very fast and also has a value 
for _all_ documents. Contrary to this, content_type_ext is very slow and 
crawl_date is extremely slow and as they both have _nearly_ all documents, I 
presume they are using {{IndexedDISI#DENSE}}. last_modified is also quite slow 
and presumably uses {{IndexedDISI#SPARSE}}.

The only mystery is crawl_year which is also present in _nearly_ all documents, 
but is very fast. I have no explanation for that one yet.

I hope to take a stab at this around August 2018, but no promises.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to