Tsuyoshi Ozawa updated YARN-4348:
    Attachment: YARN-4348-branch-2.7.002.patch

The test failure I mentioned is caused by using zkResyncWaitTime as the timeout 
value of sync operation - the default value of zkResyncWaitTime is smaller than 
zkSessionTimeout. We should use the timeout value which is larger than 
zkSessionTimeout, so just changing to use zkSessionTimeout * 3.

In addition to this, we should care about the failure of sync operation at 
startup time to preventing RM from continuing to run in illegal state - ZK's 
inconsistent view. 

Attaching a patch to fix the test failure and the error handling at startup 
time(startInternal). [~jianhe], could you take a look?

> ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of 
> zkSessionTimeout
> ----------------------------------------------------------------------------------------
>                 Key: YARN-4348
>                 URL: https://issues.apache.org/jira/browse/YARN-4348
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.2, 2.6.2
>            Reporter: Tsuyoshi Ozawa
>            Assignee: Tsuyoshi Ozawa
>         Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, 
> YARN-4348.001.patch, log.txt
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.

This message was sent by Atlassian JIRA

Reply via email to