Do you push Namenode JMX metrics to somewhere? Please check RPC avg
processing time and RPC queue time avg time.  If it is higher than the time
out, health monitor request is waiting more time in the RPC queue to get it
served.
Enabling service RPC will definitely resolve this issue.
You can also enable QOS  on 8020, it will make sure heavy users does not
impact the other users.
More info abut QOS,
https://tech.ebayinc.com/engineering/quality-of-service-in-hadoop/

-- Hema Kumar

On Thu, Sep 19, 2019 at 12:31 PM Wenqi Ma <mawenqi...@gmail.com> wrote:

> More information:
> 1. The balancer is running. And if we stop it, failover would only happen
> about 2-3 times a day. But, we have to run it since the datanodes usage is
> like: 14.65% / 78.37% / 83.18% / 23.27%
> 2. Jvm pause log is not often, and all pauses are less than 2 seconds
>
> Wenqi Ma <mawenqi...@gmail.com> 于2019年9月19日周四 下午2:33写道:
>
>> Sure I checked that, and it is namenode health monitoring timing out,
>> like:
>>
>> 2019-09-19 09:15:03,823 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Successfully transitioned NameNode at dphadoop20/192.168.1.20:8020 to
>> active state
>> 2019-09-19 10:48:55,898 WARN org.apache.hadoop.ha.HealthMonitor:
>> Transport-level exception trying to monitor health of NameNode at
>> dphadoop20/192.168.1.20:8020: java.net.SocketTimeoutException: 45000
>> millis timeout while waiting for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
>> remote=dphadoop20/192.168.1.20:8020] Call From dphadoop20/192.168.1.20
>> to dphadoop20:8020 failed on socket timeout exception:
>> java.net.SocketTimeoutException: 45000 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
>> remote=dphadoop20/192.168.1.20:8020]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>> 2019-09-19 10:48:55,898 INFO org.apache.hadoop.ha.HealthMonitor: Entering
>> state SERVICE_NOT_RESPONDING
>>
>> Then the standby namenode will be transitioned to active state, while the
>> original active namenode will get following FATAL error and quit:
>>   IPC's epoch 353 is less than the last promised epoch 354
>>
>> BTW, the stopped namenode wil be started up immediately, however, since
>> the fsimage file is huge, about 26GB, so it needs about 30 minutes to load
>> the fsimage and another 30 minutes to handle block report to quit the safe
>> mode.
>>
>>
>> HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午12:19写道:
>>
>>> Are you checking ZKFC process logs and jstack?
>>> At what stage ZKFC timing out? zk session timing  out? or namenode
>>> health monitoring timing out?
>>>
>>>
>>>>
>> --
>> Best Regards!
>> Wenqi
>>
>>
>
> --
> Best Regards!
> Wenqi
>
>

Reply via email to