On Mon, Sep 4, 2017 at 6:38 AM, Toke Eskildsen <t...@kb.dk> wrote: > On Mon, 2017-09-04 at 13:21 +0300, Ere Maijala wrote: >> Thanks for the insight, Yonik. I can confirm that #2 is true. I ran >> >> <optimize maxSegments="1" waitSearcher="true"/> >> >> and after it completed I was able to retrieve 2000 values in 17ms. > > Very interesting. Is this on spinning disks or SSD? Is your index data > cached in memory? What I am aiming at is if this is primarily a "many > relatively slow random access"-thing or more due to the way DocValues > are represented in the segments (the codec).
It's due to this (see comments in UnInvertedField): * To further save memory, the terms (the actual string values) are not all stored in * memory, but a TermIndex is used to convert term numbers to term values only * for the terms needed after faceting has completed. Only every 128th term value * is stored, along with its corresponding term number, and this is used as an * index to find the closest term and iterate until the desired number is hit There's probably a number of ways we can speed this up somewhat: - optimize how much memory is used to store the term index and use the savings to store more than every 128th term - store the terms contiguously in block(s) - don't store the whole term, only store what's needed to seek to the Nth term correctly - when retrieving many terms, sort them first and convert from ord->str in order -Yonik