[jira] [Commented] (YARN-6827) [ATS1/1.5] NPE exception while publishing recovering applications into ATS during RM restart.

2018-04-19 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445384#comment-16445384
 ] 

Rohith Sharma K S commented on YARN-6827:
-

Cherry-picked to branch-2 as well. thanks to [~sunilg] for review and 
committing the patch. 

> [ATS1/1.5] NPE exception while publishing recovering applications into ATS 
> during RM restart.
> -
>
> Key: YARN-6827
> URL: https://issues.apache.org/jira/browse/YARN-6827
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3
>
> Attachments: YARN-6827.01.patch
>
>
> While recovering application, it is observed that NPE exception is thrown as 
> below.
> {noformat}
> 017-07-13 14:08:12,476 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher:
>  Error when publishing entity 
> [YARN_APPLICATION,application_1499929227397_0001]
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:178)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.putEntity(TimelineServiceV1Publisher.java:368)
> {noformat}
> This is because in RM service start, active services are started first in Non 
> HA case and later ATSv1 services are started. In HA case, tansitionToActive 
> event has come first before ATS service are started.
> This gives sufficient time to active services recover the applications which 
> tries to publish into ATSv1 while recovering. Since ATS services are not 
> started yet, it throws NPE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6827) [ATS1/1.5] NPE exception while publishing recovering applications into ATS during RM restart.

2018-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444580#comment-16444580
 ] 

Hudson commented on YARN-6827:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #14029 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14029/])
YARN-6827. [ATS1/1.5] NPE exception while publishing recovering (sunilg: rev 
7d06806dfdeb3252ac0defe23e8c468eabfa8b5e)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


> [ATS1/1.5] NPE exception while publishing recovering applications into ATS 
> during RM restart.
> -
>
> Key: YARN-6827
> URL: https://issues.apache.org/jira/browse/YARN-6827
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-6827.01.patch
>
>
> While recovering application, it is observed that NPE exception is thrown as 
> below.
> {noformat}
> 017-07-13 14:08:12,476 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher:
>  Error when publishing entity 
> [YARN_APPLICATION,application_1499929227397_0001]
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:178)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.putEntity(TimelineServiceV1Publisher.java:368)
> {noformat}
> This is because in RM service start, active services are started first in Non 
> HA case and later ATSv1 services are started. In HA case, tansitionToActive 
> event has come first before ATS service are started.
> This gives sufficient time to active services recover the applications which 
> tries to publish into ATSv1 while recovering. Since ATS services are not 
> started yet, it throws NPE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6827) [ATS1/1.5] NPE exception while publishing recovering applications into ATS during RM restart.

2018-04-18 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442567#comment-16442567
 ] 

Sunil G commented on YARN-6827:
---

Path looks fine.

If there are no objections, I will commit the patch tomorrow. Thank You.

> [ATS1/1.5] NPE exception while publishing recovering applications into ATS 
> during RM restart.
> -
>
> Key: YARN-6827
> URL: https://issues.apache.org/jira/browse/YARN-6827
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-6827.01.patch
>
>
> While recovering application, it is observed that NPE exception is thrown as 
> below.
> {noformat}
> 017-07-13 14:08:12,476 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher:
>  Error when publishing entity 
> [YARN_APPLICATION,application_1499929227397_0001]
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:178)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.putEntity(TimelineServiceV1Publisher.java:368)
> {noformat}
> This is because in RM service start, active services are started first in Non 
> HA case and later ATSv1 services are started. In HA case, tansitionToActive 
> event has come first before ATS service are started.
> This gives sufficient time to active services recover the applications which 
> tries to publish into ATSv1 while recovering. Since ATS services are not 
> started yet, it throws NPE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6827) [ATS1/1.5] NPE exception while publishing recovering applications into ATS during RM restart.

2018-04-18 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442263#comment-16442263
 ] 

genericqa commented on YARN-6827:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
27s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 30s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 34s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 66m 
27s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
18s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}118m  5s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b |
| JIRA Issue | YARN-6827 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12919580/YARN-6827.01.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux c83b0b21f9d2 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 
11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 034da8f |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_162 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/20389/testReport/ |
| Max. process+thread count | 815 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/20389/console |
| Powered by | Apache Yetus 0.8

[jira] [Commented] (YARN-6827) [ATS1/1.5] NPE exception while publishing recovering applications into ATS during RM restart.

2018-04-18 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442137#comment-16442137
 ] 

Rohith Sharma K S commented on YARN-6827:
-

Updated the patch that does transitioningToActive post RM service start only in 
Non HA deployment. I tested the patch in real cluster. 
[~sunilg] could you review the patch? 

> [ATS1/1.5] NPE exception while publishing recovering applications into ATS 
> during RM restart.
> -
>
> Key: YARN-6827
> URL: https://issues.apache.org/jira/browse/YARN-6827
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-6827.01.patch
>
>
> While recovering application, it is observed that NPE exception is thrown as 
> below.
> {noformat}
> 017-07-13 14:08:12,476 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher:
>  Error when publishing entity 
> [YARN_APPLICATION,application_1499929227397_0001]
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:178)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.putEntity(TimelineServiceV1Publisher.java:368)
> {noformat}
> This is because in RM service start, active services are started first in Non 
> HA case and later ATSv1 services are started. In HA case, tansitionToActive 
> event has come first before ATS service are started.
> This gives sufficient time to active services recover the applications which 
> tries to publish into ATSv1 while recovering. Since ATS services are not 
> started yet, it throws NPE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6827) [ATS1/1.5] NPE exception while publishing recovering applications into ATS during RM restart.

2017-07-14 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088468#comment-16088468
 ] 

Rohith Sharma K S commented on YARN-6827:
-

Attaching the failure trace below. This shows that applications are recovered 
first before ATS services are started. 

{noformat}
2017-07-15 10:19:35,200 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to 
active state
2017-07-15 10:19:35,245 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery started
2017-07-15 10:19:35,253 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded RM 
state version info 1.4
2017-07-15 10:19:35,431 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Unknown 
child node with name: HIERARCHIES
2017-07-15 10:19:35,452 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
 recovering RMDelegationTokenSecretManager.
2017-07-15 10:19:35,455 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 16 
applications
2017-07-15 10:19:35,518 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Priority '0' is acceptable in queue : default for application: 
application_1499929227397_0001
2017-07-15 10:19:35,578 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher:
 Error when publishing entity [YARN_APPLICATION,application_1499929227397_0001]
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:178)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.putEntity(TimelineServiceV1Publisher.java:368)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.appFinished(TimelineServiceV1Publisher.java:156)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalTransition.transition(RMAppImpl.java:1472)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1073)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1062)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:887)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:383)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:590)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1372)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:749)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1131)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1171)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1167)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:317)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:143)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:893)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:472)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:607)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)

{noformat}

> [ATS1/1.5] NPE exception while publishing recovering applications into ATS 
> during RM restart.
> --