Hi Yizhou,

Yes, this might be causing the failovers. I've seen situations where download 
of large fsimage from SBNN, plus additional requests to ANN led to longer disk 
latency, which caused any Service RPC request that require an HDFS WRITE LOCK 
to take longer to be processed. This can cause failover if all service RPC 
handlers stay busy for longer than the 45 seconds timeout from FC, so that FC 
request stay all that time on the queue.

You may be able to confirm on this further by collecting jstack of ANN (you 
would need a few jstacks from covering the failover period). The pattern in the 
jstacks would be that all but one RPC service handler thread would be waiting 
on same lock, while only one would be runnable.

You might also want to check for processes blocked message on dmesg output. If 
there are no messages there, change hung_task_timeout_secs to 40 secs until the 
next failover, so that you could catch a potential OS pause causing the 
failover. This may be an indication of file system cache flushes, as described 
below:

https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
 
<https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/>

Regards,
Wellington.


> On 26 Apr 2017, at 23:41, Anu Engineer <[email protected]> wrote:
> 
> 1.ANN(active namenode) downloading fsimage.ckpt_* from SNN(standby namenode) 
> leads to very high disk io, at the same time, zkfc fails to monitor the 
> health of ann due to timeout. Is there any releationship between high disk io 
> and zkfc monitor request timeout? Every failover happened when ckpt download, 
> but not every ckpt download leads to failover.

Reply via email to