Re: Commit strategy for Heavy Bulk Indexing into solr

Pratik Patel Fri, 23 Jul 2021 16:11:06 -0700

Interesting! I will certainly test this. What interval would you suggest
for the soft commits? Also, is there a way to disable real-time get so that
we can disable soft commits?


Triggering a soft commit would open new searcher and recreate caches, we
would like to avoid it if possible as there's no need for it during bulk
indexing operation.

Some more clarity on hard and soft commits would be immensely helpful. From
what I know,

soft commit - makes the new documents available without waiting for changes
to be indexed on the disk. Each time a soft commit happens, new searcher
will be opened and caches will be renewed. transaction logs will NOT be
rolled over after a soft commit.
                     based on your comments, the memory allocated for
real-time-get will not be freed until a soft commit happens.

hard commit - updates the indexes on the disk, openSearcher=false can
ensure that new searcher does not open, transaction logs are rolled over
ONLY when hard commit takes place.
                       would it be correct to say that memory allocated for
real-time-get will not be free when a hard commit takes place? (is there
any documentation or post that I can read to learn more about real-time-get
wrt indexing and memory consumption?)

If what I have described above is true then it would mean that during bulk
indexing, we must do hard commits at some interval so that transaction logs
are rolled over and we must do the soft commits to free up the memory used
by the real-time-get feature.

Also, how should we go about deciding the intervals at which soft and hard
commits must take place? Earlier we had hard commits happening at every 15
seconds and soft commits happening every 10 seconds and bulk indexing would
really slow down at times and result in SocketTimeouts or OOM.

Thanks a lot for your help!


On Fri, Jul 23, 2021 at 6:47 PM Joel Bernstein <joels...@gmail.com> wrote:

> Whether you use real-time-get or not you still need to soft commit to
> release the memory used to support real-time-get.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Jul 23, 2021 at 3:39 PM Pratik Patel <pra...@semandex.net> wrote:
>
> > Thanks for the response Joel.
> >
> > We do not use "Real-time-get" queries. Also, we don't query the index
> while
> > a particular stage of bulk indexing is going on. Would it still help to
> > enable soft commits?
> >
> > On Fri, Jul 23, 2021 at 3:16 PM Joel Bernstein <joels...@gmail.com>
> wrote:
> >
> > > First thing to try is turning on softcommits. You need to open new
> > > searchers while indexing to free up the memory used to support
> > > real-time-get queries. Real-time-get supports queries on uncommitted
> > data,
> > > so to support this a memory component is needed for records that are
> > > indexed, but not yet visible. Opening a new searcher will make these
> > > records visible and free the memory.
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Fri, Jul 23, 2021 at 2:54 PM Pratik Patel <pra...@semandex.net>
> > wrote:
> > >
> > > > Solr Cloud version is 8.5. I have also attached the solr log with gc
> > > > enabled and our app log which shows that there was
> > > SocketTimeoutException.
> > > >
> > > > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net>
> > > wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> *tl;dr* : running into long GC pauses and solr client socket
> timeouts
> > > >> when indexing bulk of documents into solr. Commit strategy in
> essence
> > > is to
> > > >> do hard commits at the interval of 50k documents (maxDocs=50k) and
> > > disable
> > > >> soft commit altogether during bulk indexing. Simple solr cloud set
> up
> > > with
> > > >> one node and one shard.
> > > >>
> > > >> *Details*:
> > > >> We have about 6 million documents which we are trying to index into
> > > solr.
> > > >> From these, about 500k documents have a text field which holds
> > > Abstracts of
> > > >> scientific papers/Articles. We extract keywords from these Abstracts
> > > and we
> > > >> index these keywords as well into solr.
> > > >>
> > > >> We have a many to many kind of relationship between Articles and
> > > >> keywords. To store this, we have following structure.
> > > >>
> > > >> Article documents
> > > >> Keyword documents
> > > >> Article-Keyword Join documents
> > > >>
> > > >> We use block join to index Articles with "Article-Keyword" join
> > > documents
> > > >> and Keyword documents are indexed independently.
> > > >>
> > > >> In other words, we have blocks of "Article + Article-Keyword Joins"
> > and
> > > >> we have Keyword documents(they hold some additional metadata about
> > > >> keyword ).
> > > >>
> > > >> We have a bulk processing operation which creates these documents
> and
> > > >> indexes them into solr. During this bulk indexing, we don't need
> > > documents
> > > >> to be searchable. We need to search against them only after ALL the
> > > >> documents are indexed.
> > > >>
> > > >> *Based on this, this is our current strategy. *
> > > >> Soft commits are disabled and Hard commits are done at an interval
> of
> > > 50k
> > > >> documents with openSearcher=false. Our code triggers explicit
> commits
> > 4
> > > >> times after various stages of bulk indexing. Transaction logs are
> > > enabled
> > > >> and have default settings.
> > > >>
> > > >>     <autoCommit>
> > > >>       <maxTime>${solr.autoCommit.maxTime:-1}</maxTime>
> > > >>       <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs>
> > > >>       <openSearcher>false</openSearcher>
> > > >>     </autoCommit>
> > > >>
> > > >>     <autoSoftCommit>
> > > >>       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> > > >>     </autoSoftCommit>
> > > >>
> > > >> Other Environmental Details:
> > > >> Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and
> > > >> zkClienttimeout=2 mins
> > > >> Our indexing operation triggers many "add" operations in parallel
> > using
> > > >> RxJava (15 to 30 threads) each "add" operation is passed about 1000
> > > >> documents.
> > > >>
> > > >> Currently, when we run this indexing operation, we notice that
> after a
> > > >> while solr goes into long GC pauses (longer than our sockeTimeout
> of 7
> > > >> minutes) and we get SocketTimeoutExceptions.
> > > >>
> > > >> *What could be causing such long GC pauses?*
> > > >>
> > > >> *Does this commit strategy make sense ? If not, what is the
> > recommended
> > > >> strategy that we can look into? *
> > > >>
> > > >> *Any help on this is much appreciated. Thanks.*
> > > >>
> > > >>
> > >
> >
>

Re: Commit strategy for Heavy Bulk Indexing into solr

Reply via email to