Whether you use real-time-get or not you still need to soft commit to release the memory used to support real-time-get.
Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jul 23, 2021 at 3:39 PM Pratik Patel <pra...@semandex.net> wrote: > Thanks for the response Joel. > > We do not use "Real-time-get" queries. Also, we don't query the index while > a particular stage of bulk indexing is going on. Would it still help to > enable soft commits? > > On Fri, Jul 23, 2021 at 3:16 PM Joel Bernstein <joels...@gmail.com> wrote: > > > First thing to try is turning on softcommits. You need to open new > > searchers while indexing to free up the memory used to support > > real-time-get queries. Real-time-get supports queries on uncommitted > data, > > so to support this a memory component is needed for records that are > > indexed, but not yet visible. Opening a new searcher will make these > > records visible and free the memory. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > > > On Fri, Jul 23, 2021 at 2:54 PM Pratik Patel <pra...@semandex.net> > wrote: > > > > > Solr Cloud version is 8.5. I have also attached the solr log with gc > > > enabled and our app log which shows that there was > > SocketTimeoutException. > > > > > > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net> > > wrote: > > > > > >> Hi All, > > >> > > >> *tl;dr* : running into long GC pauses and solr client socket timeouts > > >> when indexing bulk of documents into solr. Commit strategy in essence > > is to > > >> do hard commits at the interval of 50k documents (maxDocs=50k) and > > disable > > >> soft commit altogether during bulk indexing. Simple solr cloud set up > > with > > >> one node and one shard. > > >> > > >> *Details*: > > >> We have about 6 million documents which we are trying to index into > > solr. > > >> From these, about 500k documents have a text field which holds > > Abstracts of > > >> scientific papers/Articles. We extract keywords from these Abstracts > > and we > > >> index these keywords as well into solr. > > >> > > >> We have a many to many kind of relationship between Articles and > > >> keywords. To store this, we have following structure. > > >> > > >> Article documents > > >> Keyword documents > > >> Article-Keyword Join documents > > >> > > >> We use block join to index Articles with "Article-Keyword" join > > documents > > >> and Keyword documents are indexed independently. > > >> > > >> In other words, we have blocks of "Article + Article-Keyword Joins" > and > > >> we have Keyword documents(they hold some additional metadata about > > >> keyword ). > > >> > > >> We have a bulk processing operation which creates these documents and > > >> indexes them into solr. During this bulk indexing, we don't need > > documents > > >> to be searchable. We need to search against them only after ALL the > > >> documents are indexed. > > >> > > >> *Based on this, this is our current strategy. * > > >> Soft commits are disabled and Hard commits are done at an interval of > > 50k > > >> documents with openSearcher=false. Our code triggers explicit commits > 4 > > >> times after various stages of bulk indexing. Transaction logs are > > enabled > > >> and have default settings. > > >> > > >> <autoCommit> > > >> <maxTime>${solr.autoCommit.maxTime:-1}</maxTime> > > >> <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs> > > >> <openSearcher>false</openSearcher> > > >> </autoCommit> > > >> > > >> <autoSoftCommit> > > >> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> > > >> </autoSoftCommit> > > >> > > >> Other Environmental Details: > > >> Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and > > >> zkClienttimeout=2 mins > > >> Our indexing operation triggers many "add" operations in parallel > using > > >> RxJava (15 to 30 threads) each "add" operation is passed about 1000 > > >> documents. > > >> > > >> Currently, when we run this indexing operation, we notice that after a > > >> while solr goes into long GC pauses (longer than our sockeTimeout of 7 > > >> minutes) and we get SocketTimeoutExceptions. > > >> > > >> *What could be causing such long GC pauses?* > > >> > > >> *Does this commit strategy make sense ? If not, what is the > recommended > > >> strategy that we can look into? * > > >> > > >> *Any help on this is much appreciated. Thanks.* > > >> > > >> > > >