Thanks for the helpful advice.

I got that
    "RpcQueueTimeNumOps" : 33612239,

    "RpcQueueTimeAvgTime" : 7.782384127056349,

    "RpcProcessingTimeNumOps" : 33612239,

    "RpcProcessingTimeAvgTime" : 32.94238776763952,

It is very close to the timeout time: 45s. I will monitor these value for a
while and then try to find a chance to enable the service-rpc.

Thanks again.

HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午3:28写道:

> Do you push Namenode JMX metrics to somewhere? Please check RPC avg
> processing time and RPC queue time avg time.  If it is higher than the time
> out, health monitor request is waiting more time in the RPC queue to get it
> served.
> Enabling service RPC will definitely resolve this issue.
> You can also enable QOS  on 8020, it will make sure heavy users does not
> impact the other users.
> More info abut QOS,
> https://tech.ebayinc.com/engineering/quality-of-service-in-hadoop/
>
> -- Hema Kumar
>
> On Thu, Sep 19, 2019 at 12:31 PM Wenqi Ma <mawenqi...@gmail.com> wrote:
>
>> More information:
>> 1. The balancer is running. And if we stop it, failover would only happen
>> about 2-3 times a day. But, we have to run it since the datanodes usage is
>> like: 14.65% / 78.37% / 83.18% / 23.27%
>> 2. Jvm pause log is not often, and all pauses are less than 2 seconds
>>
>> Wenqi Ma <mawenqi...@gmail.com> 于2019年9月19日周四 下午2:33写道:
>>
>>> Sure I checked that, and it is namenode health monitoring timing out,
>>> like:
>>>
>>> 2019-09-19 09:15:03,823 INFO org.apache.hadoop.ha.ZKFailoverController:
>>> Successfully transitioned NameNode at dphadoop20/192.168.1.20:8020 to
>>> active state
>>> 2019-09-19 10:48:55,898 WARN org.apache.hadoop.ha.HealthMonitor:
>>> Transport-level exception trying to monitor health of NameNode at
>>> dphadoop20/192.168.1.20:8020: java.net.SocketTimeoutException: 45000
>>> millis timeout while waiting for channel to be ready for read. ch :
>>> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
>>> remote=dphadoop20/192.168.1.20:8020] Call From dphadoop20/192.168.1.20
>>> to dphadoop20:8020 failed on socket timeout exception:
>>> java.net.SocketTimeoutException: 45000 millis timeout while waiting for
>>> channel to be ready for read. ch :
>>> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
>>> remote=dphadoop20/192.168.1.20:8020]; For more details see:
>>> http://wiki.apache.org/hadoop/SocketTimeout
>>> 2019-09-19 10:48:55,898 INFO org.apache.hadoop.ha.HealthMonitor:
>>> Entering state SERVICE_NOT_RESPONDING
>>>
>>> Then the standby namenode will be transitioned to active state, while
>>> the original active namenode will get following FATAL error and quit:
>>>   IPC's epoch 353 is less than the last promised epoch 354
>>>
>>> BTW, the stopped namenode wil be started up immediately, however, since
>>> the fsimage file is huge, about 26GB, so it needs about 30 minutes to load
>>> the fsimage and another 30 minutes to handle block report to quit the safe
>>> mode.
>>>
>>>
>>> HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午12:19写道:
>>>
>>>> Are you checking ZKFC process logs and jstack?
>>>> At what stage ZKFC timing out? zk session timing  out? or namenode
>>>> health monitoring timing out?
>>>>
>>>>
>>>>>
>>> --
>>> Best Regards!
>>> Wenqi
>>>
>>>
>>
>> --
>> Best Regards!
>> Wenqi
>>
>>

-- 
Best Regards!
Wenqi

Reply via email to