First thing to try is turning on softcommits. You need to open new searchers while indexing to free up the memory used to support real-time-get queries. Real-time-get supports queries on uncommitted data, so to support this a memory component is needed for records that are indexed, but not yet visible. Opening a new searcher will make these records visible and free the memory.
Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jul 23, 2021 at 2:54 PM Pratik Patel <pra...@semandex.net> wrote: > Solr Cloud version is 8.5. I have also attached the solr log with gc > enabled and our app log which shows that there was SocketTimeoutException. > > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net> wrote: > >> Hi All, >> >> *tl;dr* : running into long GC pauses and solr client socket timeouts >> when indexing bulk of documents into solr. Commit strategy in essence is to >> do hard commits at the interval of 50k documents (maxDocs=50k) and disable >> soft commit altogether during bulk indexing. Simple solr cloud set up with >> one node and one shard. >> >> *Details*: >> We have about 6 million documents which we are trying to index into solr. >> From these, about 500k documents have a text field which holds Abstracts of >> scientific papers/Articles. We extract keywords from these Abstracts and we >> index these keywords as well into solr. >> >> We have a many to many kind of relationship between Articles and >> keywords. To store this, we have following structure. >> >> Article documents >> Keyword documents >> Article-Keyword Join documents >> >> We use block join to index Articles with "Article-Keyword" join documents >> and Keyword documents are indexed independently. >> >> In other words, we have blocks of "Article + Article-Keyword Joins" and >> we have Keyword documents(they hold some additional metadata about >> keyword ). >> >> We have a bulk processing operation which creates these documents and >> indexes them into solr. During this bulk indexing, we don't need documents >> to be searchable. We need to search against them only after ALL the >> documents are indexed. >> >> *Based on this, this is our current strategy. * >> Soft commits are disabled and Hard commits are done at an interval of 50k >> documents with openSearcher=false. Our code triggers explicit commits 4 >> times after various stages of bulk indexing. Transaction logs are enabled >> and have default settings. >> >> <autoCommit> >> <maxTime>${solr.autoCommit.maxTime:-1}</maxTime> >> <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs> >> <openSearcher>false</openSearcher> >> </autoCommit> >> >> <autoSoftCommit> >> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> >> </autoSoftCommit> >> >> Other Environmental Details: >> Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and >> zkClienttimeout=2 mins >> Our indexing operation triggers many "add" operations in parallel using >> RxJava (15 to 30 threads) each "add" operation is passed about 1000 >> documents. >> >> Currently, when we run this indexing operation, we notice that after a >> while solr goes into long GC pauses (longer than our sockeTimeout of 7 >> minutes) and we get SocketTimeoutExceptions. >> >> *What could be causing such long GC pauses?* >> >> *Does this commit strategy make sense ? If not, what is the recommended >> strategy that we can look into? * >> >> *Any help on this is much appreciated. Thanks.* >> >>