bq: Do I have an incorrect understanding of how this works?

If I take "OS disk cache" to include the OS's memory available as a
result of MMapDirectory, you'r spot on.

I want to quibble a bit with (1) above. If you search on a
docValues=true indexed=false field it's terrible unless you have a
tiny, tiny, tiny data set. Think "table scan" here. DocValues answer
"for doc X, what is the term(s) in field Y" efficiently, which is what
you need for sorting, grouping and faceting since you've already
answered "what doc does term X appear in" through scoring.
Conceptually, docValues are just an array indexed by the internal
Lucene doc ID contains the value for the field. So to _search_ on it
you have to examine every cell in the array. That's the "uninverted"
bit you sometimes see thrown around when people discuss docValues.

The inverted structure built when indexed=true is what makes answering
"for term Y, what documents does it appear in" efficient.

Anyway.... I think the original question is how can Julian assure that
if the fl list specifies a field where docValues=true and stored=true,
the dv value is returned not the stored value. I don't know of any way
off hand either. I can say that the entire streaming world would fall
completely apart since it would slow to to a complete crawl if it had
to export stored fields. Indirect at best.

And frankly I haven't looked at the tests for useDocValuesAsStored and the like.

Best
Erick

On Tue, Oct 17, 2017 at 10:01 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 10/17/2017 2:09 AM, Julian Ohrt wrote:
>>
>> The Solr 6.6 documentation states:
>>
>> In cases where the query is returning only docValues fields performance
>> may improve since returning stored fields requires disk reads and
>> decompression whereas returning docValues fields in the fl list only
>> requires memory access.
>
>
> I'm curious how this guarantee (that docValues are accessed from memory not
> disk) could possibly exist.  I think the only way that this could be
> guaranteed is for Lucene to keep docValues data in the heap, but using
> docValues is supposed to *reduce* heap requirements, not increase them, so I
> don't think that's going to happen.  If the data's not in the heap, then
> you're reliant on the OS disk cache as to whether or not the data is in
> memory, and that would be the case either way.  Do I have an incorrect
> understanding of how this works?
>
> As I understand it, the potential advantage to docValues over stored data is
> two-fold:  1) docValues are accessed differently because all the values for
> one field across the entire Lucene segment are in one place.  This can be a
> good thing or a bad thing depending on the query and the data
> characteristics, and it may not be obvious which way that will go.  2)
> docValues data is not compressed, so there's less CPU required.  In cases
> where OS disk caching is insufficient and the compression ratio is really
> good, stored data might actually be faster.
>
> Thanks,
> Shawn
>

Reply via email to