On Mon, Sep 4, 2017 at 6:38 AM, Toke Eskildsen <t...@kb.dk> wrote:
> On Mon, 2017-09-04 at 13:21 +0300, Ere Maijala wrote:
>> Thanks for the insight, Yonik. I can confirm that #2 is true. I ran
>>
>> <optimize maxSegments="1" waitSearcher="true"/>
>>
>> and after it completed I was able to retrieve 2000 values in 17ms.
>
> Very interesting. Is this on spinning disks or SSD? Is your index data
> cached in memory? What I am aiming at is if this is primarily a "many
> relatively slow random access"-thing or more due to the way DocValues
> are represented in the segments (the codec).

It's due to this (see comments in UnInvertedField):
*   To further save memory, the terms (the actual string values) are
not all stored in
*   memory, but a TermIndex is used to convert term numbers to term values only
*   for the terms needed after faceting has completed.  Only every
128th term value
*   is stored, along with its corresponding term number, and this is used as an
*   index to find the closest term and iterate until the desired number is hit

There's probably a number of ways we can speed this up somewhat:
- optimize how much memory is used to store the term index and use the
savings to store more than every 128th term
- store the terms contiguously in block(s)
- don't store the whole term, only store what's needed to seek to the
Nth term correctly
- when retrieving many terms, sort them first and convert from ord->str in order

-Yonik

Reply via email to