Re: Commit strategy for Heavy Bulk Indexing into solr

Joel Bernstein Fri, 23 Jul 2021 15:41:21 -0700

Whether you use real-time-get or not you still need to soft commit to
release the memory used to support real-time-get.



Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Jul 23, 2021 at 3:39 PM Pratik Patel <pra...@semandex.net> wrote:

> Thanks for the response Joel.
>
> We do not use "Real-time-get" queries. Also, we don't query the index while
> a particular stage of bulk indexing is going on. Would it still help to
> enable soft commits?
>
> On Fri, Jul 23, 2021 at 3:16 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> > First thing to try is turning on softcommits. You need to open new
> > searchers while indexing to free up the memory used to support
> > real-time-get queries. Real-time-get supports queries on uncommitted
> data,
> > so to support this a memory component is needed for records that are
> > indexed, but not yet visible. Opening a new searcher will make these
> > records visible and free the memory.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Fri, Jul 23, 2021 at 2:54 PM Pratik Patel <pra...@semandex.net>
> wrote:
> >
> > > Solr Cloud version is 8.5. I have also attached the solr log with gc
> > > enabled and our app log which shows that there was
> > SocketTimeoutException.
> > >
> > > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net>
> > wrote:
> > >
> > >> Hi All,
> > >>
> > >> *tl;dr* : running into long GC pauses and solr client socket timeouts
> > >> when indexing bulk of documents into solr. Commit strategy in essence
> > is to
> > >> do hard commits at the interval of 50k documents (maxDocs=50k) and
> > disable
> > >> soft commit altogether during bulk indexing. Simple solr cloud set up
> > with
> > >> one node and one shard.
> > >>
> > >> *Details*:
> > >> We have about 6 million documents which we are trying to index into
> > solr.
> > >> From these, about 500k documents have a text field which holds
> > Abstracts of
> > >> scientific papers/Articles. We extract keywords from these Abstracts
> > and we
> > >> index these keywords as well into solr.
> > >>
> > >> We have a many to many kind of relationship between Articles and
> > >> keywords. To store this, we have following structure.
> > >>
> > >> Article documents
> > >> Keyword documents
> > >> Article-Keyword Join documents
> > >>
> > >> We use block join to index Articles with "Article-Keyword" join
> > documents
> > >> and Keyword documents are indexed independently.
> > >>
> > >> In other words, we have blocks of "Article + Article-Keyword Joins"
> and
> > >> we have Keyword documents(they hold some additional metadata about
> > >> keyword ).
> > >>
> > >> We have a bulk processing operation which creates these documents and
> > >> indexes them into solr. During this bulk indexing, we don't need
> > documents
> > >> to be searchable. We need to search against them only after ALL the
> > >> documents are indexed.
> > >>
> > >> *Based on this, this is our current strategy. *
> > >> Soft commits are disabled and Hard commits are done at an interval of
> > 50k
> > >> documents with openSearcher=false. Our code triggers explicit commits
> 4
> > >> times after various stages of bulk indexing. Transaction logs are
> > enabled
> > >> and have default settings.
> > >>
> > >>     <autoCommit>
> > >>       <maxTime>${solr.autoCommit.maxTime:-1}</maxTime>
> > >>       <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs>
> > >>       <openSearcher>false</openSearcher>
> > >>     </autoCommit>
> > >>
> > >>     <autoSoftCommit>
> > >>       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> > >>     </autoSoftCommit>
> > >>
> > >> Other Environmental Details:
> > >> Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and
> > >> zkClienttimeout=2 mins
> > >> Our indexing operation triggers many "add" operations in parallel
> using
> > >> RxJava (15 to 30 threads) each "add" operation is passed about 1000
> > >> documents.
> > >>
> > >> Currently, when we run this indexing operation, we notice that after a
> > >> while solr goes into long GC pauses (longer than our sockeTimeout of 7
> > >> minutes) and we get SocketTimeoutExceptions.
> > >>
> > >> *What could be causing such long GC pauses?*
> > >>
> > >> *Does this commit strategy make sense ? If not, what is the
> recommended
> > >> strategy that we can look into? *
> > >>
> > >> *Any help on this is much appreciated. Thanks.*
> > >>
> > >>
> >
>

Re: Commit strategy for Heavy Bulk Indexing into solr

Reply via email to