Re: Costs/benefits of DocValues

2015-11-09 Thread Erick Erickson
bq: But if we are keeping the indexed=true, then docValues=true will STILL
use at least as much memory however efficient docValues are
themselves, right?

AFAIK, kinda. The big difference is that with docValues="false", you're
building these structures in the JVM whereas with docValues="true",
the structures are at least partially in the OS memory thus relieving
the pressure on Java's heap, GC and the rest.

On Mon, Nov 9, 2015 at 9:06 AM, Alexandre Rafalovitch
 wrote:
> Thank you Yonik.
>
> So I would probably advise then to "keep your indexed=true" and think
> about _adding_ docValues when there is a memory pressure or when there
> is clear performance issue for the ...specific... uses.
>
> But if we are keeping the indexed=true, then docValues=true will STILL
> use at least as much memory however efficient docValues are
> themselves, right? Or will something that is normally loaded and use
> memory will stay unloaded in this combination scenario?
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 9 November 2015 at 11:57, Yonik Seeley  wrote:
>> On Mon, Nov 9, 2015 at 11:19 AM, Alexandre Rafalovitch
>>  wrote:
>>> I thought docValues were per segment, so the price of un-inversion was
>>> effectively paid on each commit for all the segments, as opposed to
>>> just the updated one.
>>
>> Both the field cache (i.e. uninverting indexed values) and docValues
>> are mostly per-segment (I say mostly because some uses still require
>> building a global ord map).
>>
>> But even when things are mostly per-segment, you hit major segment
>> merges and the cost of un-inversion (when you aren't using docValues)
>> is non-trivial.
>>
>>> I admit I also find the story around docValues to be very confusing at
>>> the moment. Especially on the interplay with "indexed=false".
>>
>> You still need "indexed=true" for efficient filters on the field.
>> Hence if you're faceting on a field and want to use docValues, you
>> probably want to keep the "indexed=true" on the field as well.
>>
>> -Yonik
>>
>>
>>> It would
>>> make a VERY good article to have this clarified somehow by people in
>>> the know.
>>>
>>> Regards,
>>>Alex.
>>> 
>>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>>> http://www.solr-start.com/
>>>
>>>
>>> On 9 November 2015 at 11:04, Yonik Seeley  wrote:
 On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz  
 wrote:
> I understand that by adding "docValues=true" to some of my fields, I can 
> improve sorting/faceting performance.

 I don't think this is true in the general sense.
 docValues are built at index-time, so what you will save is initial
 un-inversion time (i.e. the first time a field is used after a new
 searcher is opened).
 After that point, docValues may be slightly slower.

 The other advantage of docValues is memory use... much/most of it is
 essentially "off-heap", being memory-mapped from disk.  This cuts down
 on memory issues and helps reduce longer GC pauses.

 docValues are good in general, and I think we should default to them
 more for Solr 6, but they are not better in all ways.

> However, I have a couple of questions:
>
>
> 1.)Will Solr always take proper advantage of docValues when it is 
> turned on

 Yes.

> , or will I gain greater performance by turning of stored/indexed in 
> situations where only docValues are necessary (e.g. a sort-only field)?
>
> 2.)Will adding docValues to a field introduce significant performance 
> penalties for non-docValues uses of that field, beyond the obvious fact 
> that the additional data will consume more disk and memory?

 No, it's a separate part of the index.

 -Yonik


> I'm asking this question because the existing schema has some 
> multi-purpose fields, and I'm trying to determine whether I should just 
> add "docValues=true" wherever it might help, or if I need to take a more 
> thoughtful approach and potentially split some fields with copyFields, 
> etc. This is particularly significant because my schema makes use of some 
> dynamic field suffixes, and I'm not sure if I need to add new suffixes to 
> differentiate docValues/non-docValues fields, or if it's okay to turn on 
> docValues across the board "just in case."
>
> Apologies if these questions have already been answered - I couldn't find 
> a totally clear answer in the places I searched.
>
> Thanks!
>
> - Demian


Re: Costs/benefits of DocValues

2015-11-09 Thread Yonik Seeley
On Mon, Nov 9, 2015 at 12:06 PM, Alexandre Rafalovitch
 wrote:
> Thank you Yonik.
>
> So I would probably advise then to "keep your indexed=true" and think
> about _adding_ docValues when there is a memory pressure or when there
> is clear performance issue for the ...specific... uses.
>
> But if we are keeping the indexed=true, then docValues=true will STILL
> use at least as much memory however efficient docValues are
> themselves, right? Or will something that is normally loaded and use
> memory will stay unloaded in this combination scenario?

Think about it this way: for something like sorting, we need a column
for fast docid->value lookup.
Enabling docValues means building this column at index time.  At
search time, it gets memory mapped, just like most other parts of the
index.  The required memory is off-heap... the OS needs to keep the
file in it's buffer cache for good performance.
If docValues aren't enabled, this means that we need to build the
column on-the-fly on-heap (i.e. FieldCache entry is built from
un-inverting the indexed values).

An indexed field by itself only takes up disk space, just like
docValues.  Of course for searches to be fast, off-heap RAM (in the
form of OS buffer cache / disk cache) is still needed.

-Yonik


Re: Costs/benefits of DocValues

2015-11-09 Thread Yonik Seeley
On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz  wrote:
> I understand that by adding "docValues=true" to some of my fields, I can 
> improve sorting/faceting performance.

I don't think this is true in the general sense.
docValues are built at index-time, so what you will save is initial
un-inversion time (i.e. the first time a field is used after a new
searcher is opened).
After that point, docValues may be slightly slower.

The other advantage of docValues is memory use... much/most of it is
essentially "off-heap", being memory-mapped from disk.  This cuts down
on memory issues and helps reduce longer GC pauses.

docValues are good in general, and I think we should default to them
more for Solr 6, but they are not better in all ways.

> However, I have a couple of questions:
>
>
> 1.)Will Solr always take proper advantage of docValues when it is turned 
> on

Yes.

> , or will I gain greater performance by turning of stored/indexed in 
> situations where only docValues are necessary (e.g. a sort-only field)?
>
> 2.)Will adding docValues to a field introduce significant performance 
> penalties for non-docValues uses of that field, beyond the obvious fact that 
> the additional data will consume more disk and memory?

No, it's a separate part of the index.

-Yonik


> I'm asking this question because the existing schema has some multi-purpose 
> fields, and I'm trying to determine whether I should just add 
> "docValues=true" wherever it might help, or if I need to take a more 
> thoughtful approach and potentially split some fields with copyFields, etc. 
> This is particularly significant because my schema makes use of some dynamic 
> field suffixes, and I'm not sure if I need to add new suffixes to 
> differentiate docValues/non-docValues fields, or if it's okay to turn on 
> docValues across the board "just in case."
>
> Apologies if these questions have already been answered - I couldn't find a 
> totally clear answer in the places I searched.
>
> Thanks!
>
> - Demian


Re: Costs/benefits of DocValues

2015-11-09 Thread Alexandre Rafalovitch
Thank you Yonik.

So I would probably advise then to "keep your indexed=true" and think
about _adding_ docValues when there is a memory pressure or when there
is clear performance issue for the ...specific... uses.

But if we are keeping the indexed=true, then docValues=true will STILL
use at least as much memory however efficient docValues are
themselves, right? Or will something that is normally loaded and use
memory will stay unloaded in this combination scenario?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 9 November 2015 at 11:57, Yonik Seeley  wrote:
> On Mon, Nov 9, 2015 at 11:19 AM, Alexandre Rafalovitch
>  wrote:
>> I thought docValues were per segment, so the price of un-inversion was
>> effectively paid on each commit for all the segments, as opposed to
>> just the updated one.
>
> Both the field cache (i.e. uninverting indexed values) and docValues
> are mostly per-segment (I say mostly because some uses still require
> building a global ord map).
>
> But even when things are mostly per-segment, you hit major segment
> merges and the cost of un-inversion (when you aren't using docValues)
> is non-trivial.
>
>> I admit I also find the story around docValues to be very confusing at
>> the moment. Especially on the interplay with "indexed=false".
>
> You still need "indexed=true" for efficient filters on the field.
> Hence if you're faceting on a field and want to use docValues, you
> probably want to keep the "indexed=true" on the field as well.
>
> -Yonik
>
>
>> It would
>> make a VERY good article to have this clarified somehow by people in
>> the know.
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 9 November 2015 at 11:04, Yonik Seeley  wrote:
>>> On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz  
>>> wrote:
 I understand that by adding "docValues=true" to some of my fields, I can 
 improve sorting/faceting performance.
>>>
>>> I don't think this is true in the general sense.
>>> docValues are built at index-time, so what you will save is initial
>>> un-inversion time (i.e. the first time a field is used after a new
>>> searcher is opened).
>>> After that point, docValues may be slightly slower.
>>>
>>> The other advantage of docValues is memory use... much/most of it is
>>> essentially "off-heap", being memory-mapped from disk.  This cuts down
>>> on memory issues and helps reduce longer GC pauses.
>>>
>>> docValues are good in general, and I think we should default to them
>>> more for Solr 6, but they are not better in all ways.
>>>
 However, I have a couple of questions:


 1.)Will Solr always take proper advantage of docValues when it is 
 turned on
>>>
>>> Yes.
>>>
 , or will I gain greater performance by turning of stored/indexed in 
 situations where only docValues are necessary (e.g. a sort-only field)?

 2.)Will adding docValues to a field introduce significant performance 
 penalties for non-docValues uses of that field, beyond the obvious fact 
 that the additional data will consume more disk and memory?
>>>
>>> No, it's a separate part of the index.
>>>
>>> -Yonik
>>>
>>>
 I'm asking this question because the existing schema has some 
 multi-purpose fields, and I'm trying to determine whether I should just 
 add "docValues=true" wherever it might help, or if I need to take a more 
 thoughtful approach and potentially split some fields with copyFields, 
 etc. This is particularly significant because my schema makes use of some 
 dynamic field suffixes, and I'm not sure if I need to add new suffixes to 
 differentiate docValues/non-docValues fields, or if it's okay to turn on 
 docValues across the board "just in case."

 Apologies if these questions have already been answered - I couldn't find 
 a totally clear answer in the places I searched.

 Thanks!

 - Demian


Re: Costs/benefits of DocValues

2015-11-09 Thread Alexandre Rafalovitch
I thought docValues were per segment, so the price of un-inversion was
effectively paid on each commit for all the segments, as opposed to
just the updated one.

I admit I also find the story around docValues to be very confusing at
the moment. Especially on the interplay with "indexed=false". It would
make a VERY good article to have this clarified somehow by people in
the know.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 9 November 2015 at 11:04, Yonik Seeley  wrote:
> On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz  
> wrote:
>> I understand that by adding "docValues=true" to some of my fields, I can 
>> improve sorting/faceting performance.
>
> I don't think this is true in the general sense.
> docValues are built at index-time, so what you will save is initial
> un-inversion time (i.e. the first time a field is used after a new
> searcher is opened).
> After that point, docValues may be slightly slower.
>
> The other advantage of docValues is memory use... much/most of it is
> essentially "off-heap", being memory-mapped from disk.  This cuts down
> on memory issues and helps reduce longer GC pauses.
>
> docValues are good in general, and I think we should default to them
> more for Solr 6, but they are not better in all ways.
>
>> However, I have a couple of questions:
>>
>>
>> 1.)Will Solr always take proper advantage of docValues when it is turned 
>> on
>
> Yes.
>
>> , or will I gain greater performance by turning of stored/indexed in 
>> situations where only docValues are necessary (e.g. a sort-only field)?
>>
>> 2.)Will adding docValues to a field introduce significant performance 
>> penalties for non-docValues uses of that field, beyond the obvious fact that 
>> the additional data will consume more disk and memory?
>
> No, it's a separate part of the index.
>
> -Yonik
>
>
>> I'm asking this question because the existing schema has some multi-purpose 
>> fields, and I'm trying to determine whether I should just add 
>> "docValues=true" wherever it might help, or if I need to take a more 
>> thoughtful approach and potentially split some fields with copyFields, etc. 
>> This is particularly significant because my schema makes use of some dynamic 
>> field suffixes, and I'm not sure if I need to add new suffixes to 
>> differentiate docValues/non-docValues fields, or if it's okay to turn on 
>> docValues across the board "just in case."
>>
>> Apologies if these questions have already been answered - I couldn't find a 
>> totally clear answer in the places I searched.
>>
>> Thanks!
>>
>> - Demian


Re: Costs/benefits of DocValues

2015-11-09 Thread Yonik Seeley
On Mon, Nov 9, 2015 at 11:19 AM, Alexandre Rafalovitch
 wrote:
> I thought docValues were per segment, so the price of un-inversion was
> effectively paid on each commit for all the segments, as opposed to
> just the updated one.

Both the field cache (i.e. uninverting indexed values) and docValues
are mostly per-segment (I say mostly because some uses still require
building a global ord map).

But even when things are mostly per-segment, you hit major segment
merges and the cost of un-inversion (when you aren't using docValues)
is non-trivial.

> I admit I also find the story around docValues to be very confusing at
> the moment. Especially on the interplay with "indexed=false".

You still need "indexed=true" for efficient filters on the field.
Hence if you're faceting on a field and want to use docValues, you
probably want to keep the "indexed=true" on the field as well.

-Yonik


> It would
> make a VERY good article to have this clarified somehow by people in
> the know.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 9 November 2015 at 11:04, Yonik Seeley  wrote:
>> On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz  
>> wrote:
>>> I understand that by adding "docValues=true" to some of my fields, I can 
>>> improve sorting/faceting performance.
>>
>> I don't think this is true in the general sense.
>> docValues are built at index-time, so what you will save is initial
>> un-inversion time (i.e. the first time a field is used after a new
>> searcher is opened).
>> After that point, docValues may be slightly slower.
>>
>> The other advantage of docValues is memory use... much/most of it is
>> essentially "off-heap", being memory-mapped from disk.  This cuts down
>> on memory issues and helps reduce longer GC pauses.
>>
>> docValues are good in general, and I think we should default to them
>> more for Solr 6, but they are not better in all ways.
>>
>>> However, I have a couple of questions:
>>>
>>>
>>> 1.)Will Solr always take proper advantage of docValues when it is 
>>> turned on
>>
>> Yes.
>>
>>> , or will I gain greater performance by turning of stored/indexed in 
>>> situations where only docValues are necessary (e.g. a sort-only field)?
>>>
>>> 2.)Will adding docValues to a field introduce significant performance 
>>> penalties for non-docValues uses of that field, beyond the obvious fact 
>>> that the additional data will consume more disk and memory?
>>
>> No, it's a separate part of the index.
>>
>> -Yonik
>>
>>
>>> I'm asking this question because the existing schema has some multi-purpose 
>>> fields, and I'm trying to determine whether I should just add 
>>> "docValues=true" wherever it might help, or if I need to take a more 
>>> thoughtful approach and potentially split some fields with copyFields, etc. 
>>> This is particularly significant because my schema makes use of some 
>>> dynamic field suffixes, and I'm not sure if I need to add new suffixes to 
>>> differentiate docValues/non-docValues fields, or if it's okay to turn on 
>>> docValues across the board "just in case."
>>>
>>> Apologies if these questions have already been answered - I couldn't find a 
>>> totally clear answer in the places I searched.
>>>
>>> Thanks!
>>>
>>> - Demian


Re: Costs/benefits of DocValues

2015-11-09 Thread Mikhail Khludnev
On Mon, Nov 9, 2015 at 6:55 PM, Demian Katz 
wrote:

> I have a legacy Solr schema that I would like to update to take advantage
> of DocValues. I understand that by adding "docValues=true" to some of my
> fields, I can improve sorting/faceting performance.


Demian,
If an index has many segments  (let's say more than 5, or 10) docValues
faceting performance is prohibitive for old facet.field=.. .
You either need to wait for Solr 5.4 (see
https://issues.apache.org/jira/browse/SOLR-7730) or switch to JSON Facets.


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics