[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2020-02-26 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045229#comment-17045229
 ] 

Hadoop QA commented on YARN-9198:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 27s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  4s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 55m 
48s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  1m 
13s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}116m 27s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | YARN-9198 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12954806/YARN-9198.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 1cf9f4dcdb70 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 900430b |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/25576/testReport/ |
| asflicense | 
https://builds.apache.org/job/PreCommit-YARN-Build/25576/artifact/out/patch-asflicense-problems.txt
 |
| Max. process+thread count | 819 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Con

[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-16 Thread Kai Zheng (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744766#comment-16744766
 ] 

Kai Zheng commented on YARN-9198:
-

While it's surely desirable to root and fix the underlying cause (this isn't 
always possible though), it's also worthwhile to have the check so that the 
scheduler and RM can recover sooner instead of being blocked by that.

+1 for the patch. Would anybody take an additional look? Thanks.

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-15 Thread Dapeng Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743618#comment-16743618
 ] 

Dapeng Sun commented on YARN-9198:
--

{quote}
Not restoring an application is irreversible. There is no way to get that 
application back. If that would be an application that had been running for 
some time (like days) processing petabytes of data not restoring the 
application could be far more costly than some extra down time.
{quote}

Yes, in this scenario, we should not skip the error application. 

How about adding an config, the key likes 
"xxx.resourcemanager.fair-scheduler.skip-error-apps", so that users could 
choose from the behaviors: "Stoping RM and Recover the error App" or "Skip 
Error and Continue Starting RM". The option could be false by default, when 
meet the exception, the log would show the id(s) of error applications, user 
could make the decision to "fix" or "skip" base on the logs.

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-15 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743604#comment-16743604
 ] 

Wilfred Spiegelenburg commented on YARN-9198:
-

Not restoring an application is irreversible. There is no way to get that 
application back. If that would be an application that had been running for 
some time (like days) processing petabytes of data not restoring the 
application could be far more costly than some extra down time.

Until we have a way to handle changes correctly we should not startup YARN-4000 
makes a lot of other changes to handle a failed restore of the application. We 
need something like that on the FS side which really is YARN-7913. Just 
changing this one thing is not the correct thing to do.

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-14 Thread Dapeng Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742132#comment-16742132
 ] 

Dapeng Sun commented on YARN-9198:
--

Most of your options are reasonable for me, it is better to find and fix the 
underlying issue around FairScheduler. like the queue issue, config or other 
reasons which break restoring of app state.

But in product mode, recovering RM is more important at most time. If RM can't 
work rightly, all the works would be blocked, it would be much worse than an 
application can't be restore. For users who care about why the application is 
not restored, they could also check the reason at log and dig into it. Do you 
have any ideas?

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-14 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742077#comment-16742077
 ] 

Wilfred Spiegelenburg commented on YARN-9198:
-

As I [commented in the previous 
jira|https://issues.apache.org/jira/browse/YARN-7913?focusedCommentId=16483490&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16483490]:
 the CS and FS work differently and this can happen due to a number of reasons. 
ACL changes or a change in queue configuration is one of those. Just removing a 
running application on restore is not correct. It really breaks the restore as 
you can now not rely on the restore to pull back all running application on a 
fail over. We need to go back and fix the underlying issue around the queues 
and config.

BTW: The CS forces you to roll back the configuration change and make sure that 
it always works. That might be a solution but with the FS doing queue 
management in a more dynamic way that might not work.

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-14 Thread Dapeng Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742049#comment-16742049
 ] 

Dapeng Sun commented on YARN-9198:
--

Hi [~wilfreds], thank you for your comments :)
The exception reported here is also thrown by 
[FairScheduler.java#L494|https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494]
  as YARN-7913 mentioned. It happened on RM failover, and RM can not change the 
state from standby to active due to NPE before I reformat the state. I just 
pick up how Capacity Scheduler 
([CapacityScheduler.java#L875|https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L875])
 handle this kind of exception for quick fix.

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-14 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742031#comment-16742031
 ] 

Wilfred Spiegelenburg commented on YARN-9198:
-

[~dapengsun] can you explain a bit more about how you got to the state? The 
mentioned jiras are for the CapacityScheduler and they should not have affected 
the FairScheduler. I don't think we have ever tested restoring a CS generated 
state store in the FS.

This also looks far more like YARN-7998 / YARN-7913. Neither of them have 
anything to do with a corrupt state store. Can you provide a little more 
background (logs)?

I have YARN-7913 on my todo list...

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741993#comment-16741993
 ] 

Hadoop QA commented on YARN-9198:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 18s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 92m 45s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}151m 38s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9198 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12954806/YARN-9198.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a1db98f070c8 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 3bb745d |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/23071/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23071/testReport/ |
| Max. process+thread count | 901 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-

[jira] [Commented] (YARN-9198) Corrupted state from a previous version can still cause RM to fail with NPE on FairScheduler

2019-01-14 Thread Dapeng Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741887#comment-16741887
 ] 

Dapeng Sun commented on YARN-9198:
--

Attach a simple patch to fix it.

> Corrupted state from a previous version can still cause RM to fail with NPE 
> on FairScheduler
> 
>
> Key: YARN-9198
> URL: https://issues.apache.org/jira/browse/YARN-9198
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Major
> Attachments: YARN-9198.001.patch
>
>
> Previously, RM may fail with NPE due to YARN-4347,YARN-4000. After these 
> fixes, FairScheduler still has the same potential issue.
>  
> 201x-xx-xx xx:xx:xx,xxx ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart) - Failed to load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org