Re: Replica becomes leader when shard was taking a time to update document - Solr 6.1.0

vishal patel Thu, 18 Apr 2019 00:00:50 -0700

Thanks for your reply.

You are right. I checked GC log and use of GC Viewer I noticed that pause time 
was 111.4546597 secs.


GC Log :

2019-04-08T13:52:09.198+0100: 796799.689: [CMS-concurrent-mark: 1.676/30.552 
secs] [Times: user=93.42 sys=34.11, real=30.55 secs]
2019-04-08T13:52:09.198+0100: 796799.689: [CMS-concurrent-preclean-start]
2019-04-08T13:52:09.603+0100: 796800.094: [CMS-concurrent-preclean: 0.387/0.405 
secs] [Times: user=8.47 sys=1.13, real=0.40 secs]
2019-04-08T13:52:09.603+0100: 796800.095: 
[CMS-concurrent-abortable-preclean-start]
{Heap before GC invocations=112412 (full 55591):
 par new generation   total 13107200K, used 11580169K [0x0000000080000000, 
0x0000000440000000, 0x0000000440000000)
  eden space 10485760K, 100% used [0x0000000080000000, 0x0000000300000000, 
0x0000000300000000)
  from space 2621440K,  41% used [0x0000000300000000, 0x0000000342cc2600, 
0x00000003a0000000)
  to   space 2621440K,   0% used [0x00000003a0000000, 0x00000003a0000000, 
0x0000000440000000)
 concurrent mark-sweep generation total 47185920K, used 28266850K 
[0x0000000440000000, 0x0000000f80000000, 0x0000000f80000000)
 Metaspace       used 49763K, capacity 50614K, committed 53408K, reserved 55296K
2019-04-08T13:52:09.939+0100: 796800.430: [GC (Allocation Failure) 796800.431: 
[ParNew
Desired survivor size 2415919104 bytes, new threshold 8 (max 8)
- age   1:  197413992 bytes,  197413992 total
- age   2:  170743472 bytes,  368157464 total
- age   3:  218531128 bytes,  586688592 total
- age   4:    3636992 bytes,  590325584 total
- age   5:   18608784 bytes,  608934368 total
- age   6:  163869560 bytes,  772803928 total
- age   7:   55349616 bytes,  828153544 total
- age   8:    5124472 bytes,  833278016 total
: 11580169K->985493K(13107200K), 111.4543849 secs] 
39847019K->29253720K(60293120K), 111.4546597 secs] [Times: user=302.38 
sys=109.81, real=111.46 secs]
Heap after GC invocations=112413 (full 55591):
 par new generation   total 13107200K, used 985493K [0x0000000080000000, 
0x0000000440000000, 0x0000000440000000)
  eden space 10485760K,   0% used [0x0000000080000000, 0x0000000080000000, 
0x0000000300000000)
  from space 2621440K,  37% used [0x00000003a0000000, 0x00000003dc265470, 
0x0000000440000000)
  to   space 2621440K,   0% used [0x0000000300000000, 0x0000000300000000, 
0x00000003a0000000)
 concurrent mark-sweep generation total 47185920K, used 28268227K 
[0x0000000440000000, 0x0000000f80000000, 0x0000000f80000000)
 Metaspace       used 49763K, capacity 50614K, committed 53408K, reserved 55296K
}
2019-04-08T13:54:01.394+0100: 796911.885: Total time for which application 
threads were stopped: 111.4638238 seconds, Stopping threads took: 0.0069189 
seconds


May I set any max timeout when GC pause 2 second in Solr.xml or any file of Zoo 
keeper ? what to do when GC pause time more?

Sent from Outlook<http://aka.ms/weboutlook>
________________________________
From: Erick Erickson <erickerick...@gmail.com>
Sent: Thursday, April 18, 2019 7:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Replica becomes leader when shard was taking a time to update 
document - Solr 6.1.0

Specifically a _leader_ being put into the down or recovering state is almost 
always because ZooKeeper cannot ping it and get a response back before it times 
out. This also points to large GC pauses no the Solr node. Using something like 
GCViewer on the GC logs at the time of the problem will help a lot.

A _follower_ can go into recovery when an update takes too long but that’s 
“leader initiated recovery” and originates _from_ the leader, which is much 
different than the leader going into a down state.

Best,
Erick

> On Apr 17, 2019, at 7:54 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>
> On 4/17/2019 6:25 AM, vishal patel wrote:
>> Why did shard1 take a 1.8 minutes time for update? and if it took time for 
>> update then why did replica1 try to become leader? Is it required to update 
>> any timeout?
>
> There's no information here that can tell us why the update took so long.  My 
> best guess would be long GC pauses due to the heap size being too small.  But 
> there might be other causes.
>
> Indexing a single document should be VERY fast.  Even a large document should 
> only take a handful of milliseconds.
>
> If the request included "commit=true" as a parameter, then it might be the 
> commit that was slow, not the indexing.  You'll need to check the logs to 
> determine that.
>
> The reason that the leader changed was almost certainly the fact that the 
> update took so long.  SolrCloud would have decided that the node was down if 
> any operation took that long.
>
> Thanks,
> Shawn

Re: Replica becomes leader when shard was taking a time to update document - Solr 6.1.0

Reply via email to