Re: Commit strategy for Heavy Bulk Indexing into solr

Joel Bernstein Fri, 23 Jul 2021 12:09:59 -0700

First thing to try is turning on softcommits. You need to open new
searchers while indexing to free up the memory used to support
real-time-get queries. Real-time-get supports queries on uncommitted data,
so to support this a memory component is needed for records that are
indexed, but not yet visible. Opening a new searcher will make these
records visible and free the memory.




Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Jul 23, 2021 at 2:54 PM Pratik Patel <pra...@semandex.net> wrote:

> Solr Cloud version is 8.5. I have also attached the solr log with gc
> enabled and our app log which shows that there was SocketTimeoutException.
>
> On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net> wrote:
>
>> Hi All,
>>
>> *tl;dr* : running into long GC pauses and solr client socket timeouts
>> when indexing bulk of documents into solr. Commit strategy in essence is to
>> do hard commits at the interval of 50k documents (maxDocs=50k) and disable
>> soft commit altogether during bulk indexing. Simple solr cloud set up with
>> one node and one shard.
>>
>> *Details*:
>> We have about 6 million documents which we are trying to index into solr.
>> From these, about 500k documents have a text field which holds Abstracts of
>> scientific papers/Articles. We extract keywords from these Abstracts and we
>> index these keywords as well into solr.
>>
>> We have a many to many kind of relationship between Articles and
>> keywords. To store this, we have following structure.
>>
>> Article documents
>> Keyword documents
>> Article-Keyword Join documents
>>
>> We use block join to index Articles with "Article-Keyword" join documents
>> and Keyword documents are indexed independently.
>>
>> In other words, we have blocks of "Article + Article-Keyword Joins" and
>> we have Keyword documents(they hold some additional metadata about
>> keyword ).
>>
>> We have a bulk processing operation which creates these documents and
>> indexes them into solr. During this bulk indexing, we don't need documents
>> to be searchable. We need to search against them only after ALL the
>> documents are indexed.
>>
>> *Based on this, this is our current strategy. *
>> Soft commits are disabled and Hard commits are done at an interval of 50k
>> documents with openSearcher=false. Our code triggers explicit commits 4
>> times after various stages of bulk indexing. Transaction logs are enabled
>> and have default settings.
>>
>>     <autoCommit>
>>       <maxTime>${solr.autoCommit.maxTime:-1}</maxTime>
>>       <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs>
>>       <openSearcher>false</openSearcher>
>>     </autoCommit>
>>
>>     <autoSoftCommit>
>>       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>>     </autoSoftCommit>
>>
>> Other Environmental Details:
>> Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and
>> zkClienttimeout=2 mins
>> Our indexing operation triggers many "add" operations in parallel using
>> RxJava (15 to 30 threads) each "add" operation is passed about 1000
>> documents.
>>
>> Currently, when we run this indexing operation, we notice that after a
>> while solr goes into long GC pauses (longer than our sockeTimeout of 7
>> minutes) and we get SocketTimeoutExceptions.
>>
>> *What could be causing such long GC pauses?*
>>
>> *Does this commit strategy make sense ? If not, what is the recommended
>> strategy that we can look into? *
>>
>> *Any help on this is much appreciated. Thanks.*
>>
>>

Re: Commit strategy for Heavy Bulk Indexing into solr

Reply via email to