On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Thanks Mike,
>
>>>But, if you are doing deletions (or updateDocument, which is just a
>>>delete + add under-the-hood), then this will force the terms index of
>>>the segment readers to be loaded, thus consuming more RAM.
>
> Out of 700,000 docs, by the time we get to doc 600,000, there is a good 
> chance a few documents have been updated, which would cause a delete +add.

OK so you should do the .updateDocument not .addDocument.

>>>One workaround for large terms index is to set the terms index divisor
>>>.that IndexWriter should use whenever it loads a terms index (this is
>>>IndexWriter.setReaderTermsIndexDivisor).
>
> I always get confused about the two different divisors and their names in the 
> solrconfig.xml file
>
> We are setting  termInfosIndexDivisor, which I think translates to the Lucene 
> IndexWriter.setReaderTermsIndexDivisor
>
> <indexReaderFactory name="IndexReaderFactory" 
> class="org.apache.solr.core.StandardIndexReaderFactory">
>    <int name="termInfosIndexDivisor">8</int>
>  </indexReaderFactory >
>
> The other one is termIndexInterval which is set on the writer and determines 
> what gets written to the tii file.  I don't remember how to set this in Solr.
>
> Are we setting the right one to reduce RAM usage during merging?

It's even more confusing!

There are three settings.  First tells IW how frequent the index terms
are (default is 128).  Second tells IndexReader whether to sub-sample
these on load (default is 1, meaning load all indexed terms; but if
you set it to 2 then 2*128 = every 256th term is loaded).  Third, IW
has the same setting (subsampling) to be used whenever it internally
must open a reader (eg to apply deletes).

The last two are really the same setting, just that one is passed when
you open IndexReader yourself, and the other is passed whenever IW
needs to open a reader.

But, I'm not sure how these settings are named in solrconfig.xml.

>> So I think the gist is... the RAM usage will be in proportion to the
>> net size of the merge (mergeFactor + how big each merged segment is),
>> how many merges you allow concurrently, and whether you do false or
>> true deletions
>
> Does an optimize do something differently?

No, optimize is the same deal.  But, because it's a big merge
(especially the last one), it's the highest RAM usage of all merges.

Mike

Reply via email to