[
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725080#comment-16725080
]
Hadoop QA commented on YARN-9151:
---------------------------------
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m
0s{color} | {color:green} The patch appears to include 1 new or modified test
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m
20s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}
15m 38s{color} | {color:green} branch has no errors when building and testing
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m
3s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m
38s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}
1m 27s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch
generated 9 new + 48 unchanged - 0 fixed = 57 total (was 48) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m
0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}
13m 17s{color} | {color:green} patch has no errors when building and testing
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m
1s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 33s{color}
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 25m 36s{color}
| {color:red} hadoop-yarn-client in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m
43s{color} | {color:green} The patch does not generate ASF License warnings.
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}199m 21s{color} |
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests |
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption
|
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9151 |
| JIRA Patch URL |
https://issues.apache.org/jira/secure/attachment/12952343/YARN-9151.001.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient findbugs checkstyle |
| uname | Linux 3cc045645602 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cf57113 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| checkstyle |
https://builds.apache.org/job/PreCommit-YARN-Build/22921/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt
|
| unit |
https://builds.apache.org/job/PreCommit-YARN-Build/22921/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
|
| unit |
https://builds.apache.org/job/PreCommit-YARN-Build/22921/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client.txt
|
| Test Results |
https://builds.apache.org/job/PreCommit-YARN-Build/22921/testReport/ |
| Max. process+thread count | 929 (vs. ulimit of 10000) |
| modules | C:
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client U:
hadoop-yarn-project/hadoop-yarn |
| Console output |
https://builds.apache.org/job/PreCommit-YARN-Build/22921/console |
| Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
> Standby RM hangs (not retry or crash) forever due to forever lost from leader
> election
> --------------------------------------------------------------------------------------
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.9.2
> Reporter: Yuqi Wang
> Assignee: Yuqi Wang
> Priority: Major
> Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
> Standby RM hangs (not retry or crash) forever due to forever lost from
> leader election
>
> {color:#205081}*Issue Repro Steps:*{color}
> # Start multiple RMs in HA mode
> # Modify all hostnames in the zk connect string to different values in DNS.
> (In reality, we need to replace old/bad zk machines to new/good zk machines,
> so their DNS hostname will be changed.)
>
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
> Start to becomeActive
> Start RMActiveServices
> Start CommonNodeLabelsManager failed due to zk connect
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
> Fail to becomeActive
> ReJoin Election
> Failed to Join Election due to zk connect UnknownHostException (Here the
> exception is eat and just send event)
> Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
> Start StandByTransitionThread
> Already in standby state
> ReJoin Election
> Failed to Join Election due to zk connect UnknownHostException (Here the
> exception is eat and just send event)
> Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
> Start StandByTransitionThread
> Found RMActiveServices's StandByTransitionRunnable object has already run
> previously, so immediately return
>
> {noformat}
> The standby RM failed to rejoin the election, but it will never retry or
> crash later, *so afterwards no zk related logs and the standby RM is forever
> hang, even if the zk connect string hostnames are changed back the orignal
> ones in DNS.*
> So, this should be a bug in RM, because *RM should always try to join
> election* (give up join election should only happen on RM decide to crash),
> otherwise, a RM without inside the election can never become active again and
> start real works.
>
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent
> happens, RM should transition to standby, instead of crash.
> *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition
> to standby, instead of crash.* (In contrast, before this change, RM makes all
> to crash instead of to standby)
> So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent
> handler to transition to standby as the default reaction, *with shutdown as a
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the
> failures in {color:#14892c}whitelist{color}:*
> public enum RMFatalEventType {
> {color:#14892c}// Source <- Store{color}
> {color:#14892c}STATE_STORE_FENCED,{color}
> {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
> EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
> {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
> CRITICAL_THREAD_CRASH
> }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and
> future added failure types (until we triaged it to be in whitelist), should
> crash RM, because we *cannot ensure* that they will *never* cause RM cannot
> work in standby state, and the *conservative* way is to crash RM.
> Besides, after crash, the RM's external watchdog service can know this and
> try to repair the RM machine, send alerts, etc.
> And the RM can reload the latest zk connect string config with the latest
> hostnames.
> For more details, please check the patch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]