There are a number of timeouts that can trip this, the ZK timeout is only one.
For instance, when a leader sends an update to a follower, if that times out the leader may put the follower into “Leader Iniated Recovery” (LIR). 60G heaps are, by and large, not recommended for this very reason. Consider creating more JVMs with less memory and each hosting fewer Solr replicas. Best, Erick > On May 13, 2019, at 9:26 AM, Maulin Rathod <mrat...@asite.com> wrote: > > Hi, > > We are using solr 6.1 version with 2 shards. Each shard have 1 replica > set-up. i.e. We have total 4 server nodes (each node is assigned 60 gb of > RAM). > > Recently we are observing issue where solr node (any random node) > automatically goes into recovery mode and stops responding. > > We have enough memory allocated to Solr (60 gb) and system also have enough > memory (300 gb)... > > We have analyzed GC logs and found that there was GC pause time of 29.6583943 > second when problem happened. Can this GC Pause lead to make the node > unavailable/recovery mode? or there could be some another reason ? > > Please note we have set zkClientTimeout to 10 minutes > (zkClientTimeout=600000) so that zookeeper will not consider this node > unavailable during high GC pause time. > > Solr GC Logs > ========== > > {Heap before GC invocations=10940 (full 14): > par new generation total 17476288K, used 14724911K [0x0000000080000000, > 0x0000000580000000, 0x0000000580000000) > eden space 13981056K, 100% used [0x0000000080000000, 0x00000003d5560000, > 0x00000003d5560000) > from space 3495232K, 21% used [0x00000003d5560000, 0x0000000402bcbdb0, > 0x00000004aaab0000) > to space 3495232K, 0% used [0x00000004aaab0000, 0x00000004aaab0000, > 0x0000000580000000) > concurrent mark-sweep generation total 62914560K, used 27668932K > [0x0000000580000000, 0x0000001480000000, 0x0000001480000000) > Metaspace used 47602K, capacity 48370K, committed 49860K, reserved > 51200K > 2019-05-13T12:23:19.103+0100: 174643.550: [GC (Allocation Failure) > 174643.550: [ParNew > Desired survivor size 3221205808 bytes, new threshold 8 (max 8) > - age 1: 52251504 bytes, 52251504 total > - age 2: 208183784 bytes, 260435288 total > - age 3: 274752960 bytes, 535188248 total > - age 4: 12176528 bytes, 547364776 total > - age 5: 6135968 bytes, 553500744 total > - age 6: 3903152 bytes, 557403896 total > - age 7: 15341896 bytes, 572745792 total > - age 8: 5518880 bytes, 578264672 total > : 14724911K->762845K(17476288K), 24.7822734 secs] > 42393844K->28434889K(80390848K), 24.7825687 secs] [Times: user=157.97 > sys=25.63, real=24.78 secs] > Heap after GC invocations=10941 (full 14): > par new generation total 17476288K, used 762845K [0x0000000080000000, > 0x0000000580000000, 0x0000000580000000) > eden space 13981056K, 0% used [0x0000000080000000, 0x0000000080000000, > 0x00000003d5560000) > from space 3495232K, 21% used [0x00000004aaab0000, 0x00000004d93a76a8, > 0x0000000580000000) > to space 3495232K, 0% used [0x00000003d5560000, 0x00000003d5560000, > 0x00000004aaab0000) > concurrent mark-sweep generation total 62914560K, used 27672043K > [0x0000000580000000, 0x0000001480000000, 0x0000001480000000) > Metaspace used 47602K, capacity 48370K, committed 49860K, reserved > 51200K > } > 2019-05-13T12:23:44.456+0100: 174668.901: Total time for which application > threads were stopped: 29.6583943 seconds, Stopping threads took: 4.3050775 > seconds > > > ============================== > > > > Regards, > > Maulin > > [CC Award Winners!] >