Re: hdfs namenode fails over frequently due to timeout with zkfc

Wenqi Ma Wed, 18 Sep 2019 23:34:20 -0700

Sure I checked that, and it is namenode health monitoring timing out, like:


2019-09-19 09:15:03,823 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at dphadoop20/192.168.1.20:8020 to
active state
2019-09-19 10:48:55,898 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
dphadoop20/192.168.1.20:8020: java.net.SocketTimeoutException: 45000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
remote=dphadoop20/192.168.1.20:8020] Call From dphadoop20/192.168.1.20 to
dphadoop20:8020 failed on socket timeout exception:
java.net.SocketTimeoutException: 45000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
remote=dphadoop20/192.168.1.20:8020]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
2019-09-19 10:48:55,898 INFO org.apache.hadoop.ha.HealthMonitor: Entering
state SERVICE_NOT_RESPONDING

Then the standby namenode will be transitioned to active state, while the
original active namenode will get following FATAL error and quit:
  IPC's epoch 353 is less than the last promised epoch 354

BTW, the stopped namenode wil be started up immediately, however, since the
fsimage file is huge, about 26GB, so it needs about 30 minutes to load the
fsimage and another 30 minutes to handle block report to quit the safe mode.


HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午12:19写道：

> Are you checking ZKFC process logs and jstack?
> At what stage ZKFC timing out? zk session timing  out? or namenode health
> monitoring timing out?
>
>
>>
-- 
Best Regards!
Wenqi

Re: hdfs namenode fails over frequently due to timeout with zkfc

Reply via email to