Thanks for the helpful advice. I got that "RpcQueueTimeNumOps" : 33612239,
"RpcQueueTimeAvgTime" : 7.782384127056349, "RpcProcessingTimeNumOps" : 33612239, "RpcProcessingTimeAvgTime" : 32.94238776763952, It is very close to the timeout time: 45s. I will monitor these value for a while and then try to find a chance to enable the service-rpc. Thanks again. HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午3:28写道: > Do you push Namenode JMX metrics to somewhere? Please check RPC avg > processing time and RPC queue time avg time. If it is higher than the time > out, health monitor request is waiting more time in the RPC queue to get it > served. > Enabling service RPC will definitely resolve this issue. > You can also enable QOS on 8020, it will make sure heavy users does not > impact the other users. > More info abut QOS, > https://tech.ebayinc.com/engineering/quality-of-service-in-hadoop/ > > -- Hema Kumar > > On Thu, Sep 19, 2019 at 12:31 PM Wenqi Ma <mawenqi...@gmail.com> wrote: > >> More information: >> 1. The balancer is running. And if we stop it, failover would only happen >> about 2-3 times a day. But, we have to run it since the datanodes usage is >> like: 14.65% / 78.37% / 83.18% / 23.27% >> 2. Jvm pause log is not often, and all pauses are less than 2 seconds >> >> Wenqi Ma <mawenqi...@gmail.com> 于2019年9月19日周四 下午2:33写道: >> >>> Sure I checked that, and it is namenode health monitoring timing out, >>> like: >>> >>> 2019-09-19 09:15:03,823 INFO org.apache.hadoop.ha.ZKFailoverController: >>> Successfully transitioned NameNode at dphadoop20/192.168.1.20:8020 to >>> active state >>> 2019-09-19 10:48:55,898 WARN org.apache.hadoop.ha.HealthMonitor: >>> Transport-level exception trying to monitor health of NameNode at >>> dphadoop20/192.168.1.20:8020: java.net.SocketTimeoutException: 45000 >>> millis timeout while waiting for channel to be ready for read. ch : >>> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622 >>> remote=dphadoop20/192.168.1.20:8020] Call From dphadoop20/192.168.1.20 >>> to dphadoop20:8020 failed on socket timeout exception: >>> java.net.SocketTimeoutException: 45000 millis timeout while waiting for >>> channel to be ready for read. ch : >>> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622 >>> remote=dphadoop20/192.168.1.20:8020]; For more details see: >>> http://wiki.apache.org/hadoop/SocketTimeout >>> 2019-09-19 10:48:55,898 INFO org.apache.hadoop.ha.HealthMonitor: >>> Entering state SERVICE_NOT_RESPONDING >>> >>> Then the standby namenode will be transitioned to active state, while >>> the original active namenode will get following FATAL error and quit: >>> IPC's epoch 353 is less than the last promised epoch 354 >>> >>> BTW, the stopped namenode wil be started up immediately, however, since >>> the fsimage file is huge, about 26GB, so it needs about 30 minutes to load >>> the fsimage and another 30 minutes to handle block report to quit the safe >>> mode. >>> >>> >>> HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午12:19写道: >>> >>>> Are you checking ZKFC process logs and jstack? >>>> At what stage ZKFC timing out? zk session timing out? or namenode >>>> health monitoring timing out? >>>> >>>> >>>>> >>> -- >>> Best Regards! >>> Wenqi >>> >>> >> >> -- >> Best Regards! >> Wenqi >> >> -- Best Regards! Wenqi