Thanks for your reply. You are right. I checked GC log and use of GC Viewer I noticed that pause time was 111.4546597 secs.
GC Log : 2019-04-08T13:52:09.198+0100: 796799.689: [CMS-concurrent-mark: 1.676/30.552 secs] [Times: user=93.42 sys=34.11, real=30.55 secs] 2019-04-08T13:52:09.198+0100: 796799.689: [CMS-concurrent-preclean-start] 2019-04-08T13:52:09.603+0100: 796800.094: [CMS-concurrent-preclean: 0.387/0.405 secs] [Times: user=8.47 sys=1.13, real=0.40 secs] 2019-04-08T13:52:09.603+0100: 796800.095: [CMS-concurrent-abortable-preclean-start] {Heap before GC invocations=112412 (full 55591): par new generation total 13107200K, used 11580169K [0x0000000080000000, 0x0000000440000000, 0x0000000440000000) eden space 10485760K, 100% used [0x0000000080000000, 0x0000000300000000, 0x0000000300000000) from space 2621440K, 41% used [0x0000000300000000, 0x0000000342cc2600, 0x00000003a0000000) to space 2621440K, 0% used [0x00000003a0000000, 0x00000003a0000000, 0x0000000440000000) concurrent mark-sweep generation total 47185920K, used 28266850K [0x0000000440000000, 0x0000000f80000000, 0x0000000f80000000) Metaspace used 49763K, capacity 50614K, committed 53408K, reserved 55296K 2019-04-08T13:52:09.939+0100: 796800.430: [GC (Allocation Failure) 796800.431: [ParNew Desired survivor size 2415919104 bytes, new threshold 8 (max 8) - age 1: 197413992 bytes, 197413992 total - age 2: 170743472 bytes, 368157464 total - age 3: 218531128 bytes, 586688592 total - age 4: 3636992 bytes, 590325584 total - age 5: 18608784 bytes, 608934368 total - age 6: 163869560 bytes, 772803928 total - age 7: 55349616 bytes, 828153544 total - age 8: 5124472 bytes, 833278016 total : 11580169K->985493K(13107200K), 111.4543849 secs] 39847019K->29253720K(60293120K), 111.4546597 secs] [Times: user=302.38 sys=109.81, real=111.46 secs] Heap after GC invocations=112413 (full 55591): par new generation total 13107200K, used 985493K [0x0000000080000000, 0x0000000440000000, 0x0000000440000000) eden space 10485760K, 0% used [0x0000000080000000, 0x0000000080000000, 0x0000000300000000) from space 2621440K, 37% used [0x00000003a0000000, 0x00000003dc265470, 0x0000000440000000) to space 2621440K, 0% used [0x0000000300000000, 0x0000000300000000, 0x00000003a0000000) concurrent mark-sweep generation total 47185920K, used 28268227K [0x0000000440000000, 0x0000000f80000000, 0x0000000f80000000) Metaspace used 49763K, capacity 50614K, committed 53408K, reserved 55296K } 2019-04-08T13:54:01.394+0100: 796911.885: Total time for which application threads were stopped: 111.4638238 seconds, Stopping threads took: 0.0069189 seconds May I set any max timeout when GC pause 2 second in Solr.xml or any file of Zoo keeper ? what to do when GC pause time more? Sent from Outlook<http://aka.ms/weboutlook> ________________________________ From: Erick Erickson <erickerick...@gmail.com> Sent: Thursday, April 18, 2019 7:36 AM To: solr-user@lucene.apache.org Subject: Re: Replica becomes leader when shard was taking a time to update document - Solr 6.1.0 Specifically a _leader_ being put into the down or recovering state is almost always because ZooKeeper cannot ping it and get a response back before it times out. This also points to large GC pauses no the Solr node. Using something like GCViewer on the GC logs at the time of the problem will help a lot. A _follower_ can go into recovery when an update takes too long but that’s “leader initiated recovery” and originates _from_ the leader, which is much different than the leader going into a down state. Best, Erick > On Apr 17, 2019, at 7:54 AM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 4/17/2019 6:25 AM, vishal patel wrote: >> Why did shard1 take a 1.8 minutes time for update? and if it took time for >> update then why did replica1 try to become leader? Is it required to update >> any timeout? > > There's no information here that can tell us why the update took so long. My > best guess would be long GC pauses due to the heap size being too small. But > there might be other causes. > > Indexing a single document should be VERY fast. Even a large document should > only take a handful of milliseconds. > > If the request included "commit=true" as a parameter, then it might be the > commit that was slow, not the indexing. You'll need to check the logs to > determine that. > > The reason that the leader changed was almost certainly the fact that the > update took so long. SolrCloud would have decided that the node was down if > any operation took that long. > > Thanks, > Shawn