[jira] [Updated] (YARN-9855) Fix ApplicationReportProto submitTime id in branch-2.8/branch-2.7

2019-09-25 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9855:
---
Attachment: YARN-9855-branch-2.7.001.patch

> Fix ApplicationReportProto submitTime id in branch-2.8/branch-2.7
> -
>
> Key: YARN-9855
> URL: https://issues.apache.org/jira/browse/YARN-9855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9855-branch-2.7.001.patch, 
> YARN-9855-branch-2.7.001.patch, YARN-9855-branch-2.8.001.patch
>
>
> As per 
> [http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201909.mbox/%3cCAAaVJWUKTBXEYV_-yWs2PT8aqhjQXq=garav+yzjxq0nx36...@mail.gmail.com%3e].
>  Update this field to use the same id as in branch-2.9 and above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9855) Fix ApplicationReportProto submitTime id in branch-2.8/branch-2.7

2019-09-25 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937482#comment-16937482
 ] 

Bibin A Chundatt commented on YARN-9855:


Uploaded  again 2.7 patch to trigger jenkins

> Fix ApplicationReportProto submitTime id in branch-2.8/branch-2.7
> -
>
> Key: YARN-9855
> URL: https://issues.apache.org/jira/browse/YARN-9855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9855-branch-2.7.001.patch, 
> YARN-9855-branch-2.7.001.patch, YARN-9855-branch-2.8.001.patch
>
>
> As per 
> [http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201909.mbox/%3cCAAaVJWUKTBXEYV_-yWs2PT8aqhjQXq=garav+yzjxq0nx36...@mail.gmail.com%3e].
>  Update this field to use the same id as in branch-2.9 and above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9855) Fix ApplicationReportProto submitTime id in branch-2.8/branch-2.7

2019-09-25 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937474#comment-16937474
 ] 

Bibin A Chundatt commented on YARN-9855:


Thank you [~ebadger] for finding issue and [~jhung] for handling the issue

+1 LGTM.



> Fix ApplicationReportProto submitTime id in branch-2.8/branch-2.7
> -
>
> Key: YARN-9855
> URL: https://issues.apache.org/jira/browse/YARN-9855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9855-branch-2.7.001.patch, 
> YARN-9855-branch-2.8.001.patch
>
>
> As per 
> [http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201909.mbox/%3cCAAaVJWUKTBXEYV_-yWs2PT8aqhjQXq=garav+yzjxq0nx36...@mail.gmail.com%3e].
>  Update this field to use the same id as in branch-2.9 and above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9851) Make execution type check compatiable

2019-09-25 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937117#comment-16937117
 ] 

Bibin A Chundatt edited comment on YARN-9851 at 9/25/19 7:10 AM:
-

[~cane]

YARN-9547 didn't fix this issue ?  Could you check with the patch applied.


was (Author: bibinchundatt):
YARN-9547 didn't fix this issue ?

> Make execution type check compatiable
> -
>
> Key: YARN-9851
> URL: https://issues.apache.org/jira/browse/YARN-9851
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9851-001.patch
>
>
> During upgrade from 2.6 to 3.1, we encountered a problem:
> {code:java}
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568719110875_6460_08_01, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_11172_01_62, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_11172_01_63, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_11172_01_64, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_30617_01_06, status: RUNNING, 
> execution type: null
> for (ContainerStatus remoteContainer : containerStatuses) {
>   if (remoteContainer.getState() == ContainerState.RUNNING
>   && remoteContainer.getExecutionType() == ExecutionType.GUARANTEED) {
> nodeContainers.add(remoteContainer.getContainerId());
>   } else {
> LOG.warn("Lost container " + remoteContainer.getContainerId()
> + ", status: " + remoteContainer.getState()
> + ", execution type: " + remoteContainer.getExecutionType());
>   }
> }​
> {code}
> The cause is that we has nm with version 2.6, which do not have executionType 
> for container status.
> We should check here make the upgrade process more tranparently



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9851) Make execution type check compatiable

2019-09-24 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937117#comment-16937117
 ] 

Bibin A Chundatt commented on YARN-9851:


YARN-9547 didn't fix this issue ?

> Make execution type check compatiable
> -
>
> Key: YARN-9851
> URL: https://issues.apache.org/jira/browse/YARN-9851
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9851-001.patch
>
>
> During upgrade from 2.6 to 3.1, we encountered a problem:
> {code:java}
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568719110875_6460_08_01, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_11172_01_62, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_11172_01_63, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_11172_01_64, status: RUNNING, 
> execution type: null
> 2019-09-23,19:29:05,303 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Lost 
> container container_e35_1568886618758_30617_01_06, status: RUNNING, 
> execution type: null
> for (ContainerStatus remoteContainer : containerStatuses) {
>   if (remoteContainer.getState() == ContainerState.RUNNING
>   && remoteContainer.getExecutionType() == ExecutionType.GUARANTEED) {
> nodeContainers.add(remoteContainer.getContainerId());
>   } else {
> LOG.warn("Lost container " + remoteContainer.getContainerId()
> + ", status: " + remoteContainer.getState()
> + ", execution type: " + remoteContainer.getExecutionType());
>   }
> }​
> {code}
> The cause is that we has nm with version 2.6, which do not have executionType 
> for container status.
> We should check here make the upgrade process more tranparently



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9011) Race condition during decommissioning

2019-09-24 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936842#comment-16936842
 ] 

Bibin A Chundatt edited comment on YARN-9011 at 9/24/19 2:19 PM:
-

{quote}
But even if you have to wait, it's a very small tiny window which is probably 
just milliseconds
{quote}
Depends on the time taken to process events. In large clusters we cant expect 
that to be mills.

*Alternate approach*

NodelistManager is the source for *GRACEFUL_DECOMMISSION* event based on which 
state transistion of RMNodeImpl to DECOMMISSIONING happens.I think as per 
YARN-3212 the state avoids the containers getting killed during the period of 
DECOMMISIONING.

* We could maintain in nodelistmanager the list of to be decommissioned list 
for which the *GRACEFUL_DECOMMISSION* was fired.
* HostsFileReader set the refreshed *HostDetails* only after the event is fired.

This way the HostsFileReader and nodeState could be sync. Thoughts??



was (Author: bibinchundatt):
{quote}
But even if you have to wait, it's a very small tiny window which is probably 
just milliseconds
{quote}
Depends on the time taken to process events. In large clusters we cant expect 
that to be mills.

*Alternate approach*

NodelistManager is the source for *GRACEFUL_DECOMMISSION* event based on which 
state transistion of RMNodeImpl to DECOMMISSIONING happens.I think as per 
YARN-3212 the state avoids the containers getting killed during the period of 
DECOMMISIONING.

* We could maintain in nodelistmanager the list of to be decommissioned list 
for which the *GRACEFUL_DECOMMISSION* was fired.
* HostsFileReader set the refreshed *HostDetails* only after the event is fired.

This was the HostsFileReader and nodeState could be sync. Thoughts??


> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> 

[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-09-24 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936842#comment-16936842
 ] 

Bibin A Chundatt commented on YARN-9011:


{quote}
But even if you have to wait, it's a very small tiny window which is probably 
just milliseconds
{quote}
Depends on the time taken to process events. In large clusters we cant expect 
that to be mills.

*Alternate approach*

NodelistManager is the source for *GRACEFUL_DECOMMISSION* event based on which 
state transistion of RMNodeImpl to DECOMMISSIONING happens.I think as per 
YARN-3212 the state avoids the containers getting killed during the period of 
DECOMMISIONING.

* We could maintain in nodelistmanager the list of to be decommissioned list 
for which the *GRACEFUL_DECOMMISSION* was fired.
* HostsFileReader set the refreshed *HostDetails* only after the event is fired.

This was the HostsFileReader and nodeState could be sync. Thoughts??


> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-09-24 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936710#comment-16936710
 ] 

Bibin A Chundatt commented on YARN-9011:


[~pbacsko]

Just glanced through the discussion point.. Could you explain the race in 
detail.

Major impact interms of functionality ??

{quote}
ResourceTrackerService uses NodesListManager to determine what nodes are 
enabled. But sometimes it sees an inconsistent state: NodesListManager returns 
that a certain node is in the excluded list, but it's state is not 
DECOMMISSIONING. *So we have to wait for this state change.*
{quote}

*Points to consider*

* Any wait on ResourceTrackerService is costly, since we have limited number of 
handlers for ResourceTrackerService (consider 10k+ nodes and 100/200 handlers 
each nm heartbeating at 1 sec interval)
* As per the current implementation  the resourcetrackerService handler will 
wait till the state is changed rt ?


> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9424) Change getDeclaredMethods to getMethods in FederationClientInterceptor#invokeConcurrent()

2019-09-24 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9424:
---
Fix Version/s: 3.3.0

> Change getDeclaredMethods to getMethods in 
> FederationClientInterceptor#invokeConcurrent()
> -
>
> Key: YARN-9424
> URL: https://issues.apache.org/jira/browse/YARN-9424
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Reporter: Shen Yinjie
>Assignee: Shen Yinjie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9124_1.patch
>
>
> In YARN-8699, FederationClientInterceptor#invokeConcurrent uses 
> getDeclaredMethods(), which cannot recongnize some methods in 
> ApplicationBaseProtocol (ApplicationClientProtocol extend 
> ApplicationBaseProtocol) .
> We have implemented some methods in FederationClientInterceptor, such as 
> getApplications(), GetQueueUserAclsInfo ...etc, when I run "yarn application 
> -list" by connecting to yarn router, router will throw exception.
> So change getDeclaredMethods() to getMethods().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9627) DelegationTokenRenewer could block transitionToStandy

2019-09-23 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935581#comment-16935581
 ] 

Bibin A Chundatt edited comment on YARN-9627 at 9/23/19 8:47 AM:
-

[~maniraj...@gmail.com] 

This issue is more like what do we do with renewal request submitted with large 
number of  pending apps.



was (Author: bibinchundatt):
[~maniraj...@gmail.com] 

This issue is more like what do we do with renewal submitted if we have lots of 
pending apps.


> DelegationTokenRenewer could block transitionToStandy
> -
>
> Key: YARN-9627
> URL: https://issues.apache.org/jira/browse/YARN-9627
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9627.001.patch, YARN-9627.002.patch, 
> YARN-9627.003.patch
>
>
> Cluster size: 5K
> Running containers: 55K
> *Scenario*: Largenumber of pending applications (around 50K) and performing 
> RM switch over
> Below exception :
> {noformat}
> 2019-06-13 17:39:27,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renew Kind: HDFS_DELEGATION_TOKEN, Service: X:1616, Ident: (token 
> for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, 
> realUser=, issueDate=1560361265181, maxDate=1560966065181, 
> sequenceNumber=104708, masterKeyId=3);exp=1560533965360; 
> apps=[application_1560346941775_20702] in 86397766 ms, appId = 
> [application_1560346941775_20702]
> 2019-06-13 17:39:27,609 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error 
> occurred for the packet 'clientPath:null serverPath:null finished:false 
> header:: 27,4  replyHeader:: 27,4295687588,0  request:: 
> '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F
>   response:: 
> #31ff8a16b74ffe129768ffdbffe949ff8dffd517ffcafffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577}
>  '.
> 2019-06-13 17:58:20,877 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 
> X:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN 
> owner=root/had...@hadoop.com, renewer=yarn, realUser=, 
> issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, 
> masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]]
> 2019-06-13 17:58:20,924 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.IllegalStateException: Timer already cancelled.
> at java.util.Timer.sched(Timer.java:397)
> at java.util.Timer.schedule(Timer.java:208)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> 

[jira] [Commented] (YARN-9627) DelegationTokenRenewer could block transitionToStandy

2019-09-23 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935581#comment-16935581
 ] 

Bibin A Chundatt commented on YARN-9627:


[~maniraj...@gmail.com] 

This issue is more like what do we do with renewal submitted if we have lots of 
pending apps.


> DelegationTokenRenewer could block transitionToStandy
> -
>
> Key: YARN-9627
> URL: https://issues.apache.org/jira/browse/YARN-9627
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9627.001.patch, YARN-9627.002.patch, 
> YARN-9627.003.patch
>
>
> Cluster size: 5K
> Running containers: 55K
> *Scenario*: Largenumber of pending applications (around 50K) and performing 
> RM switch over
> Below exception :
> {noformat}
> 2019-06-13 17:39:27,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renew Kind: HDFS_DELEGATION_TOKEN, Service: X:1616, Ident: (token 
> for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, 
> realUser=, issueDate=1560361265181, maxDate=1560966065181, 
> sequenceNumber=104708, masterKeyId=3);exp=1560533965360; 
> apps=[application_1560346941775_20702] in 86397766 ms, appId = 
> [application_1560346941775_20702]
> 2019-06-13 17:39:27,609 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error 
> occurred for the packet 'clientPath:null serverPath:null finished:false 
> header:: 27,4  replyHeader:: 27,4295687588,0  request:: 
> '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F
>   response:: 
> #31ff8a16b74ffe129768ffdbffe949ff8dffd517ffcafffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577}
>  '.
> 2019-06-13 17:58:20,877 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 
> X:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN 
> owner=root/had...@hadoop.com, renewer=yarn, realUser=, 
> issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, 
> masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]]
> 2019-06-13 17:58:20,924 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.IllegalStateException: Timer already cancelled.
> at java.util.Timer.sched(Timer.java:397)
> at java.util.Timer.schedule(Timer.java:208)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (YARN-8387) Support offline compilation of yarn ui2

2019-09-18 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-8387:
--

Assignee: (was: Bibin A Chundatt)

> Support offline compilation of yarn ui2
> ---
>
> Key: YARN-8387
> URL: https://issues.apache.org/jira/browse/YARN-8387
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Reporter: Bibin A Chundatt
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4029) Update LogAggregationStatus to store on finish

2019-09-18 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-4029:
--

Assignee: (was: Bibin A Chundatt)

> Update LogAggregationStatus to store on finish
> --
>
> Key: YARN-4029
> URL: https://issues.apache.org/jira/browse/YARN-4029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Reporter: Bibin A Chundatt
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-4029.patch, 0002-YARN-4029.patch, 
> 0003-YARN-4029.patch, 0004-YARN-4029.patch, Image.jpg
>
>
> Currently the log aggregation status is not getting updated to Store. When RM 
> is restarted will show NOT_START. 
> Steps to reproduce
> 
> 1.Submit mapreduce application
> 2.Wait for completion
> 3.Once application is completed switch RM
> *Log Aggregation Status* are changing
> *Log Aggregation Status* from SUCCESS to NOT_START



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4029) Update LogAggregationStatus to store on finish

2019-09-18 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932264#comment-16932264
 ] 

Bibin A Chundatt commented on YARN-4029:


[~adam.antal] 

Unassigned from my name.. Please do go ahead.. 

IIRC during  offline discussion with [~rohithsharma] , he had mentioned status 
update on finish will increase the ZK load.

cc:// [~sunilg]/[~rohithsharma] 




> Update LogAggregationStatus to store on finish
> --
>
> Key: YARN-4029
> URL: https://issues.apache.org/jira/browse/YARN-4029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-4029.patch, 0002-YARN-4029.patch, 
> 0003-YARN-4029.patch, 0004-YARN-4029.patch, Image.jpg
>
>
> Currently the log aggregation status is not getting updated to Store. When RM 
> is restarted will show NOT_START. 
> Steps to reproduce
> 
> 1.Submit mapreduce application
> 2.Wait for completion
> 3.Once application is completed switch RM
> *Log Aggregation Status* are changing
> *Log Aggregation Status* from SUCCESS to NOT_START



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9831) NMTokenSecretManagerInRM#createNMToken blocks ApplicationMasterService allocate flow

2019-09-12 Thread Bibin A Chundatt (Jira)
Bibin A Chundatt created YARN-9831:
--

 Summary: NMTokenSecretManagerInRM#createNMToken blocks 
ApplicationMasterService allocate flow
 Key: YARN-9831
 URL: https://issues.apache.org/jira/browse/YARN-9831
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin A Chundatt


Currently attempt's NMToken cannot be generated independently. 

Each attempts allocate flow blocks each other. We should improve the same



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-09-12 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928305#comment-16928305
 ] 

Bibin A Chundatt edited comment on YARN-9830 at 9/12/19 7:36 AM:
-

AbstractLivelinessMonitor methods are synchronized which blocks concurrent 
access based on multiple containerIds

PingThread actually monitors the *running* containers items .

In AbstractLivelinessMonitor#running could be changed to concurrentHashMap and 
remove the synchronization at the Object  level.??

[~rohithsharma]/[~sunil.gov...@gmail.com]



was (Author: bibinchundatt):
AbstractLivelinessMonitor methods are synchronized which blocks concurrent 
access based on multiple containerIds

PingThread actually monitors the *running* containers items .

In AbstractLivelinessMonitor#running could be changed to concurrentHashMap and 
remove the synchronization at the class level.??

[~rohithsharma]/[~sunil.gov...@gmail.com]


> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Priority: Critical
>  Labels: perfomance
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9823) NodeManager cannot get right ResourceTrack address in Federation mode

2019-09-12 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928312#comment-16928312
 ] 

Bibin A Chundatt commented on YARN-9823:


[~lichaojacobs] YARN-8434  should help you.

> NodeManager cannot get right ResourceTrack address in Federation mode
> -
>
> Key: YARN-9823
> URL: https://issues.apache.org/jira/browse/YARN-9823
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation, nodemanager
>Affects Versions: 2.9.2
> Environment: h2. Hadoop:
> Hadoop 2.9.2 (some line number may not be right because we have merged some 
> 3.0+ patch)
> Security with Kerberos
> configure from 
> [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html]
> h2. Java:
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
> Kerberos:
>  
>  
>Reporter: qiwei huang
>Priority: Major
>
> {{the NM will infinitely try to connect the wrong RM's resource tracker port}}
> {quote}{{INFO [main:RetryInvocationHandler@411] - java.net.ConnectException: 
> Call From standby.rm.server/10.122.138.139 to }}{{standby.rm.server}}{{:8031 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ResourceTrackerPBClientImpl.registerNodeManager over dev1 after 19 failover 
> attempts. Trying to failover after sleeping for 40497ms.}}
> {quote}
>  
> {{After change *yarn.client.failover-proxy-provider* to 
> *org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider*,
>  the ** NodeManager cannot find the right ResourceTracker address:}}
> {quote}{{getRMHAId:233, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfKeyForRMInstance:294, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfValueForRMInstance:302, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfValueForRMInstance:314, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getSocketAddr:3341, YarnConfiguration (org.apache.hadoop.yarn.conf)}}
> {{getRMAddress:77, ServerRMProxy (org.apache.hadoop.yarn.server.api)}}
> {{run:144, FederationRMFailoverProxyProvider$1 
> (org.apache.hadoop.yarn.server.federation.failover)}}
> {{doPrivileged:-1, AccessController (java.security)}}
> {{doAs:422, Subject (javax.security.auth)}}
> {{doAs:1893, UserGroupInformation (org.apache.hadoop.security)}}
> {{getProxyInternal:141, FederationRMFailoverProxyProvider 
> (org.apache.hadoop.yarn.server.federation.failover)}}
> {{performFailover:192, FederationRMFailoverProxyProvider 
> (org.apache.hadoop.yarn.server.federation.failover)}}
> {{failover:217, RetryInvocationHandler$ProxyDescriptor 
> (org.apache.hadoop.io.retry)}}
> {{processRetryInfo:149, RetryInvocationHandler$Call 
> (org.apache.hadoop.io.retry)}}
> {{processWaitTimeAndRetryInfo:142, RetryInvocationHandler$Call 
> (org.apache.hadoop.io.retry)}}
> {{invokeOnce:107, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)}}
> {{invoke:359, RetryInvocationHandler (org.apache.hadoop.io.retry)}}
> {{registerNodeManager:-1, $Proxy85 (com.sun.proxy)}}
> {{registerWithRM:378, NodeStatusUpdaterImpl 
> (org.apache.hadoop.yarn.server.nodemanager)}}
> {{serviceStart:252, NodeStatusUpdaterImpl 
> (org.apache.hadoop.yarn.server.nodemanager)}}
> {{start:194, AbstractService (org.apache.hadoop.service)}}
> {{serviceStart:121, CompositeService (org.apache.hadoop.service)}}
> {{start:194, AbstractService (org.apache.hadoop.service)}}
> {{initAndStartNodeManager:864, NodeManager 
> (org.apache.hadoop.yarn.server.nodemanager)}}
> {{main:931, NodeManager (org.apache.hadoop.yarn.server.nodemanager)}}
> {quote}
> {{the Provider will try to find the main RM address on }}*{{getRMHAId:233,}}* 
> {{but it cannot find the right address because it can just return the local 
> Address: }}{{}}
> {quote}{{if (!s.isUnresolved() && NetUtils.isLocalAddress(s.getAddress())) {}}
> {{ currentRMId = rmId.trim();}}
> {{ found++;}}
> {{}}}
> {quote}
> {{If the NM and RM is on the same node, and the this RM is in standby 
> situation, the NM will }}{{infinitely}}{{ call RPC to RM}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-09-12 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928305#comment-16928305
 ] 

Bibin A Chundatt edited comment on YARN-9830 at 9/12/19 7:22 AM:
-

AbstractLivelinessMonitor methods are synchronized which blocks concurrent 
access based on multiple containerIds

PingThread actually monitors the *running* containers items .

In AbstractLivelinessMonitor#running could be changed to concurrentHashMap and 
remove the synchronization at the class level.??

[~rohithsharma]/[~sunil.gov...@gmail.com]



was (Author: bibinchundatt):
AbstractLivelinessMonitor and methods are synchronized which block concurrent 
access based on multiple containerIds

PingThread actually monitors the *running* containers items .

In AbstractLivelinessMonitor#running could be changed to concurrentHashMap and 
remove the synchronization at the class level.??


> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Priority: Critical
>  Labels: perfomance
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-09-12 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928305#comment-16928305
 ] 

Bibin A Chundatt commented on YARN-9830:


AbstractLivelinessMonitor and methods are synchronized which block concurrent 
access based on multiple containerIds

PingThread actually monitors the *running* containers items .

In AbstractLivelinessMonitor#running could be changed to concurrentHashMap and 
remove the synchronization at the class level.??


> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Priority: Critical
>  Labels: perfomance
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-09-12 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9830:
---
Labels: perfomance  (was: )

> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Priority: Critical
>  Labels: perfomance
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-09-12 Thread Bibin A Chundatt (Jira)
Bibin A Chundatt created YARN-9830:
--

 Summary: Improve ContainerAllocationExpirer it blocks scheduling
 Key: YARN-9830
 URL: https://issues.apache.org/jira/browse/YARN-9830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt


{quote}
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
- waiting to lock <0x7fa348749550> (a 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
- locked <0x7fc8852f8200> (a 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
{quote}





--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-09-03 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921334#comment-16921334
 ] 

Bibin A Chundatt commented on YARN-9738:


[~sunilg] Could please you take a look 

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9738-001.patch, YARN-9738-002.patch, 
> YARN-9738-003.patch
>
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-09-03 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921190#comment-16921190
 ] 

Bibin A Chundatt commented on YARN-9797:


Thank you [~sunilg] and [~tangzhankun] for review

Will commit it soon

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch, YARN-9797-002.patch, 
> YARN-9797-003.patch, YARN-9797-004.patch, YARN-9797-005.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-30 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919328#comment-16919328
 ] 

Bibin A Chundatt commented on YARN-9785:


[~leftnoteasy] Added a testcase to verify the zero resource case

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-30 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9785:
---
Attachment: YARN-9785.003.patch

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919135#comment-16919135
 ] 

Bibin A Chundatt commented on YARN-9797:


Thank you [~BilwaST] for working on this.

Few comments

# Can you change to 16 * 1024 define 1024 as GB
{code}
Resource clusterResource = Resource.newInstance(16384L, 64, res);
{code}
# Add assert for memory usage and cpu usage too after activation.



> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch, YARN-9797-002.patch, 
> YARN-9797-003.patch, YARN-9797-004.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918718#comment-16918718
 ] 

Bibin A Chundatt commented on YARN-9797:


+1 LGTM . 

[~sunil.gov...@gmail.com]/[~tangzhankun] please do take a look .

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch, YARN-9797-002.patch, 
> YARN-9797-003.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918718#comment-16918718
 ] 

Bibin A Chundatt edited comment on YARN-9797 at 8/29/19 3:41 PM:
-

+1 LGTM . for 003

[~sunil.gov...@gmail.com]/[~tangzhankun] please do take a look .


was (Author: bibinchundatt):
+1 LGTM . 

[~sunil.gov...@gmail.com]/[~tangzhankun] please do take a look .

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch, YARN-9797-002.patch, 
> YARN-9797-003.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918444#comment-16918444
 ] 

Bibin A Chundatt commented on YARN-9785:


Renamed wip patch to 002.patch and uploaded again.

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9785:
---
Attachment: YARN-9785.002.patch

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918426#comment-16918426
 ] 

Bibin A Chundatt edited comment on YARN-9797 at 8/29/19 9:09 AM:
-

Thank you [~BilwaST] for working on the patch

Changes looks fine to me .

Small clarification is required related to old code 

{code}
if (getNumActiveApplications() < 1
|| (Resources.lessThanOrEqual(resourceCalculator,
lastClusterResource, queueUsage.getAMUsed(partitionName),
Resources.none( {
{code}

I think the second condition is not required.. Either active application can be 
zero / am used resource by partition ==0 
Once 1 application is activated i dont think the queueAMUsage can be zero.

[~sunil.gov...@gmail.com]/[~wangda] 


was (Author: bibinchundatt):
Thank you [~BilwaST] for working on the patch

Changes looks fine to me .

Small clarification is related to old old code 

{code}
if (getNumActiveApplications() < 1
|| (Resources.lessThanOrEqual(resourceCalculator,
lastClusterResource, queueUsage.getAMUsed(partitionName),
Resources.none( {
{code}

I think the second condition is not required.. Either active application can be 
zero / am used resource by partition ==0 
Once 1 application is activated i dont think the queueAMUsage can be zero.

[~sunil.gov...@gmail.com]/[~wangda] 

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918426#comment-16918426
 ] 

Bibin A Chundatt commented on YARN-9797:


Thank you [~BilwaST] for working on the patch

Changes looks fine to me .

Small clarification is related to old old code 

{code}
if (getNumActiveApplications() < 1
|| (Resources.lessThanOrEqual(resourceCalculator,
lastClusterResource, queueUsage.getAMUsed(partitionName),
Resources.none( {
{code}

I think the second condition is not required.. Either active application can be 
zero / am used resource by partition ==0 
Once 1 application is activated i dont think the queueAMUsage can be zero.

[~sunil.gov...@gmail.com]/[~wangda] 

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9627) DelegationTokenRenewer could block transitionToStandy

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918331#comment-16918331
 ] 

Bibin A Chundatt commented on YARN-9627:


[~rohithsharma]

Do you this this issue to be considered  for 3.2.1 . Issue could block switch 
over when HDFS token renewal takes time and we have too many APPS !!

> DelegationTokenRenewer could block transitionToStandy
> -
>
> Key: YARN-9627
> URL: https://issues.apache.org/jira/browse/YARN-9627
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9627.001.patch, YARN-9627.002.patch, 
> YARN-9627.003.patch
>
>
> Cluster size: 5K
> Running containers: 55K
> *Scenario*: Largenumber of pending applications (around 50K) and performing 
> RM switch over
> Below exception :
> {noformat}
> 2019-06-13 17:39:27,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renew Kind: HDFS_DELEGATION_TOKEN, Service: X:1616, Ident: (token 
> for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, 
> realUser=, issueDate=1560361265181, maxDate=1560966065181, 
> sequenceNumber=104708, masterKeyId=3);exp=1560533965360; 
> apps=[application_1560346941775_20702] in 86397766 ms, appId = 
> [application_1560346941775_20702]
> 2019-06-13 17:39:27,609 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error 
> occurred for the packet 'clientPath:null serverPath:null finished:false 
> header:: 27,4  replyHeader:: 27,4295687588,0  request:: 
> '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F
>   response:: 
> #31ff8a16b74ffe129768ffdbffe949ff8dffd517ffcafffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577}
>  '.
> 2019-06-13 17:58:20,877 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 
> X:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN 
> owner=root/had...@hadoop.com, renewer=yarn, realUser=, 
> issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, 
> masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]]
> 2019-06-13 17:58:20,924 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.IllegalStateException: Timer already cancelled.
> at java.util.Timer.sched(Timer.java:397)
> at java.util.Timer.schedule(Timer.java:208)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748
> {noformat}



--
This message was sent by Atlassian Jira

[jira] [Updated] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9797:
---
Target Version/s: 3.2.1, 3.1.3

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918308#comment-16918308
 ] 

Bibin A Chundatt edited comment on YARN-9785 at 8/29/19 6:04 AM:
-

[~sunilg]/[~rohithsharma]

Raised YARN-9797 to fix the AM activation issue


was (Author: bibinchundatt):
[~sunilg]/[~rohithsharma]

Raise YARN-9797 to fix the AM activation issue

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918308#comment-16918308
 ] 

Bibin A Chundatt commented on YARN-9785:


[~sunilg]/[~rohithsharma]

Raise YARN-9797 to fix the AM activation issue

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Bibin A Chundatt (Jira)
Bibin A Chundatt created YARN-9797:
--

 Summary: LeafQueue#activateApplications should use 
resourceCalculator#fitsIn
 Key: YARN-9797
 URL: https://issues.apache.org/jira/browse/YARN-9797
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt


Dominant resource calculator compare function check for dominant resource is 
lessThan.
Incase case of AM limit we should activate application only when all the 
resourceValues are less than the AM limit.




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-28 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9785:
---
Summary: Fix DominantResourceCalculator when one resource is zero  (was: 
Application gets activated even when AM memory has reached)

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-28 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917706#comment-16917706
 ] 

Bibin A Chundatt edited comment on YARN-9785 at 8/28/19 4:18 PM:
-

[~sunilg]

Checked the wip patch . Finding the following issue with approach considering 
only the compare method
|CASE||Cluster Resource||lhs||rhs||Detail||STATUS||
|1|<0,0,0>|<1240,2,0>|<0,0,0>|NA|(/)|
|2|<10240,10,0>|<2048,2,0>|<2048,8,0>|calculateShares->firstShares lhs 
..2,.2,INF
 calculateShares-> secondShares rhs .2,.8,INF
 compareShares -> .2,.2,INF to .2,.8,INF
 compareShares -> INF,..2,.2 INF,.8,.2|(/)|
|3|<10240,10,0>|<4096,2,0>|<2048,8,0>|calculateShares->firstShares lhs 
..4,.2,INF
 calculateShares-> secondShares rhs .2,.8,INF
 compareShares -> .4,.2,INF to .2,.8,INF
 compareShares sort -> INF,.4,.2 INF,.8,.2
 diff -.4 return -1|(x)|

*Detail in case of activate application:*

*rhs* -> amlimit
 *lhs* -> amifStarted

_Resources.lessThanOrEqual(resourceCalculator, lastClusterResource,amIfStarted, 
amLimit)_ -> true in *case 3* and application gets activated.


was (Author: bibinchundatt):
Sunil

Check the wip patch . Finding the following issue with approach considering 
only the compare method
|CASE||Cluster Resource||lhs||rhs||Detail||STATUS||
|1|<0,0,0>|<1240,2,0>|<0,0,0>|NA|(/)|
|2|<10240,10,0>|<2048,2,0>|<2048,8,0>|calculateShares->firstShares lhs 
..2,.2,INF
 calculateShares-> secondShares rhs .2,.8,INF
 compareShares -> .2,.2,INF to .2,.8,INF
 compareShares -> INF,..2,.2 INF,.8,.2|(/)|
|3|<10240,10,0>|<4096,2,0>|<2048,8,0>|calculateShares->firstShares lhs 
..4,.2,INF
 calculateShares-> secondShares rhs .2,.8,INF
 compareShares -> .4,.2,INF to .2,.8,INF
 compareShares sort -> INF,.4,.2 INF,.8,.2
 diff -.4 return -1|(x)|

*Detail in case of activate application:*

*rhs* -> amlimit
 *lhs* -> amifStarted

_Resources.lessThanOrEqual(resourceCalculator, lastClusterResource,amIfStarted, 
amLimit)_ -> true in *case 3* and application gets activated.

> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-28 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917711#comment-16917711
 ] 

Bibin A Chundatt commented on YARN-9785:


Looks to be like we have to do 2 fixes

* Fix compare method 
* For the validation which need all the resourceInformation to be less than 
other ResourceInformation *fitsIn* looks like best fit

> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-28 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917706#comment-16917706
 ] 

Bibin A Chundatt edited comment on YARN-9785 at 8/28/19 11:52 AM:
--

Sunil

Check the wip patch . Finding the following issue with approach considering 
only the compare method
|CASE||Cluster Resource||lhs||rhs||Detail||STATUS||
|1|<0,0,0>|<1240,2,0>|<0,0,0>|NA|(/)|
|2|<10240,10,0>|<2048,2,0>|<2048,8,0>|calculateShares->firstShares lhs 
..2,.2,INF
 calculateShares-> secondShares rhs .2,.8,INF
 compareShares -> .2,.2,INF to .2,.8,INF
 compareShares -> INF,..2,.2 INF,.8,.2|(/)|
|3|<10240,10,0>|<4096,2,0>|<2048,8,0>|calculateShares->firstShares lhs 
..4,.2,INF
 calculateShares-> secondShares rhs .2,.8,INF
 compareShares -> .4,.2,INF to .2,.8,INF
 compareShares sort -> INF,.4,.2 INF,.8,.2
 diff -.4 return -1|(x)|

*Detail in case of activate application:*

*rhs* -> amlimit
 *lhs* -> amifStarted

_Resources.lessThanOrEqual(resourceCalculator, lastClusterResource,amIfStarted, 
amLimit)_ -> true in *case 3* and application gets activated.


was (Author: bibinchundatt):
Sunil

Check the wip patch . Finding the following issue with approach considering 
only the compare method

||Cluster Resource||lhs||rhs||Detail||STATUS||
|<0,0,0>|<1240,2,0>|<0,0,0>|NA|(/)|
|<10240,10,0>|<2048,2,0>|<2048,8,0>|calculateShares->firstShares  lhs   
..2,.2,INF
calculateShares-> secondShares  rhs   .2,.8,INF
compareShares  ->   .2,.2,INF   to  .2,.8,INF
compareShares  ->   INF,..2,.2 INF,.8,.2|(/)|
|<10240,10,0>|<4096,2,0>|<2048,8,0>|calculateShares->firstShares  lhs   
..4,.2,INF
calculateShares-> secondShares  rhs   .2,.8,INF
compareShares  ->   .4,.2,INF   to  .2,.8,INF
compareShares sort ->   INF,.4,.2 INF,.8,.2
diff -.4 return -1 |(x)|

*Detail in case of activate application:*

*rhs* -> amlimit
*lhs* -> amifStarted

_Resources.lessThanOrEqual(resourceCalculator, lastClusterResource,amIfStarted, 
amLimit)_ -> true and application gets activated.













> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-28 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917706#comment-16917706
 ] 

Bibin A Chundatt commented on YARN-9785:


Sunil

Check the wip patch . Finding the following issue with approach considering 
only the compare method

||Cluster Resource||lhs||rhs||Detail||STATUS||
|<0,0,0>|<1240,2,0>|<0,0,0>|NA|(/)|
|<10240,10,0>|<2048,2,0>|<2048,8,0>|calculateShares->firstShares  lhs   
..2,.2,INF
calculateShares-> secondShares  rhs   .2,.8,INF
compareShares  ->   .2,.2,INF   to  .2,.8,INF
compareShares  ->   INF,..2,.2 INF,.8,.2|(/)|
|<10240,10,0>|<4096,2,0>|<2048,8,0>|calculateShares->firstShares  lhs   
..4,.2,INF
calculateShares-> secondShares  rhs   .2,.8,INF
compareShares  ->   .4,.2,INF   to  .2,.8,INF
compareShares sort ->   INF,.4,.2 INF,.8,.2
diff -.4 return -1 |(x)|

*Detail in case of activate application:*

*rhs* -> amlimit
*lhs* -> amifStarted

_Resources.lessThanOrEqual(resourceCalculator, lastClusterResource,amIfStarted, 
amLimit)_ -> true and application gets activated.













> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-27 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916544#comment-16916544
 ] 

Bibin A Chundatt commented on YARN-9785:


Thank you [~BilwaST] for raising this

Issue looks like API interface level issue DominantResourceCalculator#compare 
or wrong usage. *0* should be returned only when all the resources are equal.
 But in this case if one resource is greater and other is less the the compare 
returns *0*


{code:java}
  /**
   * Compare two resources - if the value for every resource type for the lhs
   * is greater than that of the rhs, return 1. If the value for every resource
   * type in the lhs is less than the rhs, return -1. Otherwise, return 0
   *
   * @param lhs resource to be compared
   * @param rhs resource to be compared
   * @return 0, 1, or -1
   */

 private int compare(Resource lhs, Resource rhs) {

  public int compare(Resource clusterResource, Resource lhs, Resource rhs,
  boolean singleType) {
{code}


Cluster resource <10,10,0> memory,cpu,gpu

||lhs||rhs||
|<1,0>|<0,1>|returns --> 0|

ResourceCalculator#compare expects *0* only if values are equal.

All the callers that expects all the fields to be 
lessthanEqual/greaterThanEqual are affected.

[~rohithsharma]/[~tangzhankun]/[~sunil.gov...@gmail.com]


> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9640) Slow event processing could cause too many attempt unregister events

2019-08-27 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9640:
---
Attachment: YARN-9640-branch-3.2.001.patch

> Slow event processing could cause too many attempt unregister events
> 
>
> Key: YARN-9640
> URL: https://issues.apache.org/jira/browse/YARN-9640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>  Labels: scalability
> Fix For: 3.3.0
>
> Attachments: YARN-9640-branch-3.2.001.patch, YARN-9640.001.patch, 
> YARN-9640.002.patch, YARN-9640.003.patch
>
>
> We found in one of our test cluster verification that the number attempt 
> unregister events is about 300k+.
>  # AM all containers completed.
>  # AMRMClientImpl send finishApplcationMaster
>  # AMRMClient check event 100ms the finish Status using 
> finishApplicationMaster request.
>  # AMRMClientImpl#unregisterApplicationMaster
> {code:java}
>   while (true) {
> FinishApplicationMasterResponse response =
> rmClient.finishApplicationMaster(request);
> if (response.getIsUnregistered()) {
>   break;
> }
> LOG.info("Waiting for application to be successfully unregistered.");
> Thread.sleep(100);
>   }
> {code}
>  # ApplicationMasterService finishApplicationMaster interface sends 
> unregister events on every status update.
> We should send unregister event only once and cache event send , ignore and 
> send not unregistered response back to AM not overloading the event queue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-08-26 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916038#comment-16916038
 ] 

Bibin A Chundatt commented on YARN-9738:


[~BilwaST] 

As discussed offline need to handle get to nodes using null key.

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9738-001.patch, YARN-9738-002.patch
>
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9642) Fix Memory Leak in AbstractYarnScheduler caused by timer

2019-08-26 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916014#comment-16916014
 ] 

Bibin A Chundatt edited comment on YARN-9642 at 8/26/19 5:58 PM:
-

Committed to trunk , branch-3.2 and branch-3.1

 

Thank you [~sunilg] , [~rohithsharma] , [~Tao Yang] , [~tangzhankun]  for 
reviews


was (Author: bibinchundatt):
Committed to trunk , branch-3.2 and branch-3.

 

Thank you [~sunilg] , [~rohithsharma] , [~Tao Yang] , [~tangzhankun]  for 
reviews

> Fix Memory Leak in AbstractYarnScheduler caused by timer
> 
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9642) Fix Memory Leak in AbstractYarnScheduler caused by timer

2019-08-26 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9642:
---
Summary: Fix Memory Leak in AbstractYarnScheduler caused by timer  (was: 
AbstractYarnScheduler#clearPendingContainerCache cause memory leak)

> Fix Memory Leak in AbstractYarnScheduler caused by timer
> 
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache cause memory leak

2019-08-26 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9642:
---
Summary: AbstractYarnScheduler#clearPendingContainerCache cause memory leak 
 (was: AbstractYarnScheduler#clearPendingContainerCache could run even after 
transitiontostandby)

> AbstractYarnScheduler#clearPendingContainerCache cause memory leak
> --
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-26 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9785:
---
Target Version/s: 3.2.1, 3.1.3

> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-08-26 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915955#comment-16915955
 ] 

Bibin A Chundatt commented on YARN-9642:


[~tangzhankun] and [~rohithsharma]

Tried running it locally

{code}
INFO] 
[INFO] --- maven-surefire-plugin:3.0.0-M1:test (default-test) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 53.808 s 
- in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
{code}

Failure seems random

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9765) SLS runner crashes when run with metrics turned off.

2019-08-21 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912043#comment-16912043
 ] 

Bibin A Chundatt commented on YARN-9765:


Thank you [~abmodi] for working on this.
Committed to trunk,branch-3.2,branch-3.1

> SLS runner crashes when run with metrics turned off.
> 
>
> Key: YARN-9765
> URL: https://issues.apache.org/jira/browse/YARN-9765
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9765.001.patch
>
>
> When sls metrics is turned off, creation of AM fails with NPE.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9640) Slow event processing could cause too many attempt unregister events

2019-08-21 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9640:
---
Target Version/s: 3.2.1

> Slow event processing could cause too many attempt unregister events
> 
>
> Key: YARN-9640
> URL: https://issues.apache.org/jira/browse/YARN-9640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>  Labels: scalability
> Attachments: YARN-9640.001.patch, YARN-9640.002.patch, 
> YARN-9640.003.patch
>
>
> We found in one of our test cluster verification that the number attempt 
> unregister events is about 300k+.
>  # AM all containers completed.
>  # AMRMClientImpl send finishApplcationMaster
>  # AMRMClient check event 100ms the finish Status using 
> finishApplicationMaster request.
>  # AMRMClientImpl#unregisterApplicationMaster
> {code:java}
>   while (true) {
> FinishApplicationMasterResponse response =
> rmClient.finishApplicationMaster(request);
> if (response.getIsUnregistered()) {
>   break;
> }
> LOG.info("Waiting for application to be successfully unregistered.");
> Thread.sleep(100);
>   }
> {code}
>  # ApplicationMasterService finishApplicationMaster interface sends 
> unregister events on every status update.
> We should send unregister event only once and cache event send , ignore and 
> send not unregistered response back to AM not overloading the event queue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-21 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9714:
---
Target Version/s: 3.2.1

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-08-21 Thread Bibin A Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9642:
---
Target Version/s: 3.2.1, 3.1.3

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9765) SLS runner crashes when run with metrics turned off.

2019-08-20 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911300#comment-16911300
 ] 

Bibin A Chundatt commented on YARN-9765:


+1 LGTM .. Will commit once we have jenkins results

> SLS runner crashes when run with metrics turned off.
> 
>
> Key: YARN-9765
> URL: https://issues.apache.org/jira/browse/YARN-9765
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9765.001.patch
>
>
> When sls metrics is turned off, creation of AM fails with NPE.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9755) RM fails to start with FileSystemBasedConfigurationProvider

2019-08-20 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911215#comment-16911215
 ] 

Bibin A Chundatt commented on YARN-9755:


Good catch [~eyang]

 

 

 

> RM fails to start with FileSystemBasedConfigurationProvider
> ---
>
> Key: YARN-9755
> URL: https://issues.apache.org/jira/browse/YARN-9755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9755-001.patch, YARN-9755-002.patch, 
> YARN-9755-003.patch
>
>
> RM fails to start with below exception when 
> FileSystemBasedConfigurationProvider is used.
> *Exception:*
> {code}
> 2019-08-16 12:05:33,802 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
> ResourceManager
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> java.io.IOException: Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:868)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1281)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1312)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1335)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1328)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1379)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1567)
> Caused by: java.io.IOException: java.io.IOException: Filesystem closed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.FileBasedCSConfigurationProvider.loadConfiguration(FileBasedCSConfigurationProvider.java:64)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:346)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:445)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> ... 14 more
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:475)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1682)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1586)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1598)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1701)
> at 
> org.apache.hadoop.yarn.FileSystemBasedConfigurationProvider.getConfigurationInputStream(FileSystemBasedConfigurationProvider.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.FileBasedCSConfigurationProvider.loadConfiguration(FileBasedCSConfigurationProvider.java:56)
> {code}
> FileSystemBasedConfigurationProvider uses the cached FileSystem causing the 
> issue.
> *Configs:*
> {code}
> yarn.resourcemanager.configuration.provider-classorg.apache.hadoop.yarn.FileSystemBasedConfigurationProvider
> yarn.resourcemanager.configuration.file-system-based-store/yarn/conf
> [yarn@yarndocker-1 yarn]$ hadoop fs -ls /yarn/conf
> -rw-r--r--   3 yarn supergroup   4138 2019-08-16 13:09 
> /yarn/conf/capacity-scheduler.xml
> -rw-r--r--   3 

[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-08-19 Thread Bibin A Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911010#comment-16911010
 ] 

Bibin A Chundatt commented on YARN-9738:


[~BilwaST]

Could you please look  into the testcase failures.

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9738-001.patch, YARN-9738-002.patch
>
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-08-16 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908908#comment-16908908
 ] 

Bibin A Chundatt commented on YARN-9738:


[~sunilg]

Only for the nodes currently its changed to concurrentHashMap will  give bucket 
level locking.
Also the read lock is removed only for getNodeReport invoked from 
ClientRmService,NodeInfo,ApplicationMasterService ,explicitly null check is 
handled too.

Patch looks good to me , can we go ahead ??



> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9738-001.patch, YARN-9738-002.patch
>
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5857) TestLogAggregationService.testFixedSizeThreadPool fails intermittently on trunk

2019-08-16 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908791#comment-16908791
 ] 

Bibin A Chundatt commented on YARN-5857:


Thank you [~BilwaST] for updating the patch.

+1 LGTM.  Will wait for [~adam.antal] comments too.

> TestLogAggregationService.testFixedSizeThreadPool fails intermittently on 
> trunk
> ---
>
> Key: YARN-5857
> URL: https://issues.apache.org/jira/browse/YARN-5857
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-5857-001.patch, YARN-5857-002.patch, 
> testFixedSizeThreadPool failure reproduction
>
>
> {noformat}
> testFixedSizeThreadPool(org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService)
>   Time elapsed: 0.11 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<3> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.testFixedSizeThreadPool(TestLogAggregationService.java:1139)
> {noformat}
> Refer to https://builds.apache.org/job/PreCommit-YARN-Build/13829/testReport/



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-14 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907179#comment-16907179
 ] 

Bibin A Chundatt edited comment on YARN-2599 at 8/14/19 11:15 AM:
--

[~sunilg]

# Change could cause incompatability w.r.t monitoring systems. Redirect need to 
be handled explicitly. Should we make this configurable ??
# Could you point to HTTPServlet handling */metrics*



was (Author: bibinchundatt):
[~sunilg]

# Change could incompatability w.r.t monitoring systems. Redirect need to be 
handled explicitly. Should we make this configurable ??
# Could you point to HTTPServlet handling */metrics*


> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-14 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907179#comment-16907179
 ] 

Bibin A Chundatt commented on YARN-2599:


[~sunilg]

# Change could incompatability w.r.t monitoring systems. Redirect need to be 
handled explicitly. Should we make this configurable ??
# Could you point to HTTPServlet handling */metrics*


> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9747) Reduce additional namenode call by EntityGroupFSTimelineStore#cleanLogs

2019-08-14 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906931#comment-16906931
 ] 

Bibin A Chundatt commented on YARN-9747:


Thank you [~Prabhu Joseph] for patch

+1 LGTM will wait for jenkins results



> Reduce additional namenode call by EntityGroupFSTimelineStore#cleanLogs
> ---
>
> Key: YARN-9747
> URL: https://issues.apache.org/jira/browse/YARN-9747
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9747-001.patch
>
>
> EntityGroupFSTimelineStore#cleanLogs creates additional Namenode RPC call.
> {code}
> cleanLogs:
>  while (iter.hasNext()) {
>   FileStatus stat = iter.next();
>   Path clusterTimeStampPath = stat.getPath();
>   if (isValidClusterTimeStampDir(clusterTimeStampPath)) {
> MutableBoolean appLogDirPresent = new MutableBoolean(false);
> { fs.getFileStatus(clusterTimeStampPath);}} in isValidClusterTimeStampDir* 
> creates additional Namenode RPC call.
> {code}
> cc [~bibinchundatt]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates

2019-08-13 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906913#comment-16906913
 ] 

Bibin A Chundatt commented on YARN-9080:


[~Prabhu Joseph] Thank you for updating the patch.. Could you handle in new 
JIRA..

> Bucket Directories as part of ATS done accumulates
> --
>
> Key: YARN-9080
> URL: https://issues.apache.org/jira/browse/YARN-9080
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, 
> 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, 
> YARN-9080-006.patch, YARN-9080-007.patch, YARN-9080-008.patch, 
> YARN-9080.addendum-001.patch
>
>
> Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 
> as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner 
> removes only the app directories and not the bucket directories.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates

2019-08-13 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906884#comment-16906884
 ] 

Bibin A Chundatt commented on YARN-9080:


Thank you [~Prabhu Joseph] for working on this 

Have a query regarding this . Sorry to come in really late 

{code}
 while (iter.hasNext()) {
  FileStatus stat = iter.next();
  Path clusterTimeStampPath = stat.getPath();
  if (isValidClusterTimeStampDir(clusterTimeStampPath)) {
MutableBoolean appLogDirPresent = new MutableBoolean(false);
{code}
{ fs.getFileStatus(clusterTimeStampPath);}} in  *isValidClusterTimeStampDir** 
creates additional Namenode RPC call.

Can we pass the FileStatus instead of path .. {{if 
(isValidClusterTimeStampDir(clusterTimeStampPath))}} to reduce Namenode RPC 
call.. 

Thoughts??









> Bucket Directories as part of ATS done accumulates
> --
>
> Key: YARN-9080
> URL: https://issues.apache.org/jira/browse/YARN-9080
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, 
> 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, 
> YARN-9080-006.patch, YARN-9080-007.patch, YARN-9080-008.patch
>
>
> Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 
> as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner 
> removes only the app directories and not the bucket directories.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-08-12 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905067#comment-16905067
 ] 

Bibin A Chundatt edited comment on YARN-9738 at 8/12/19 10:59 AM:
--

Did an offline testing with sample code . 

With 10K nodes + concurrent getNodeReport for all nodes the time take ~28 secs 
Vs 88ms when *concurrentHashMap* is used.
[~BilwaST], i think its safe to remove the readlock and make 
ClusterNodeTracker#nodes to concurrenthashMap.

cc: [~sunil.gov...@gmail.com]


was (Author: bibinchundatt):
Did an offline testing with sample code . 

With 10K nodes + concurrent getNodeReport for all nodes the time take ~28 secs 
Vs 88ms when *concurrentHashMap* is used.
[~BilwaST] its safe to remove the readlock and make ClusterNodeTracker#nodes to 
concurrenthashMap.

cc: [~sunil.gov...@gmail.com]

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-08-12 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905067#comment-16905067
 ] 

Bibin A Chundatt commented on YARN-9738:


Did an offline testing with sample code . 

With 10K nodes + concurrent getNodeReport for all nodes the time take ~28 secs 
Vs 88ms when *concurrentHashMap* is used.
[~BilwaST] its safe to remove the readlock and make ClusterNodeTracker#nodes to 
concurrenthashMap.

cc: [~sunil.gov...@gmail.com]

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9644) First RMContext object is always leaked during switch over

2019-07-30 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9644:
---
Priority: Blocker  (was: Critical)

> First RMContext object is always leaked during switch over
> --
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9644-branch-3.2.001.patch, YARN-9644.001.patch, 
> YARN-9644.002.patch, YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-07-30 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896394#comment-16896394
 ] 

Bibin A Chundatt commented on YARN-9714:


[~Tao Yang]

{quote}
ZooKeeper connection in ZKRMStateStore never be closed. 
{quote}
IIUC the zookeer StateStore is not an active service  and zookeeper connection 
is common for leader election too.

Do we really need to close the connection ??

Few other issues in 3.1.1 which got fixed recently are YARN-9644,9639

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-30 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9642:
---
Labels: memory-leak  (was: )

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-30 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896293#comment-16896293
 ] 

Bibin A Chundatt commented on YARN-9642:


Thank you [~Tao Yang] for review. Updated patch to set proper timer thread name.

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-30 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9642:
---
Attachment: YARN-9642.003.patch

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-30 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9642:
---
Priority: Blocker  (was: Critical)

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9644) First RMContext object is always leaked during switch over

2019-07-30 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9644:
---
Labels: memory-leak  (was: )

> First RMContext object is always leaked during switch over
> --
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>  Labels: memory-leak
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9644-branch-3.2.001.patch, YARN-9644.001.patch, 
> YARN-9644.002.patch, YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-07-30 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9714:
---
Labels: memory-leak  (was: )

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-30 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896058#comment-16896058
 ] 

Bibin A Chundatt commented on YARN-9642:


[~sunilg] / [~cheersyang]/[~Tao Yang]

Attached updated patch.

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-30 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9642:
---
Attachment: YARN-9642.002.patch

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896008#comment-16896008
 ] 

Bibin A Chundatt edited comment on YARN-9714 at 7/30/19 11:09 AM:
--

Timer issue is handled by following jira..

Timer start need to be moved to serviceStart in following patch..

 

https://issues.apache.org/jira/browse/YARN-9642


was (Author: bibinchundatt):
ReleaseAll is handled by following jira.. timer start need to be moved to 
serviceStart in following patch..

 

https://issues.apache.org/jira/browse/YARN-9642

> Memory leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896008#comment-16896008
 ] 

Bibin A Chundatt edited comment on YARN-9714 at 7/30/19 11:08 AM:
--

ReleaseAll is handled by following jira.. timer start need to be moved to 
serviceStart in following patch..

 

https://issues.apache.org/jira/browse/YARN-9642


was (Author: bibinchundatt):
https://issues.apache.org/jira/browse/YARN-9642

> Memory leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896008#comment-16896008
 ] 

Bibin A Chundatt commented on YARN-9714:


https://issues.apache.org/jira/browse/YARN-9642

> Memory leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9681) AM resource limit is incorrect for queue

2019-07-25 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-9681:
--

Assignee: ANANDA G B

> AM resource limit is incorrect for queue
> 
>
> Key: YARN-9681
> URL: https://issues.apache.org/jira/browse/YARN-9681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Major
>  Labels: patch
> Attachments: After running job on queue1.png, Before running job on 
> queue1.png, YARN-9681.0001.patch
>
>
> After running the job on Queue1 of Partition1, then Queue1 of 
> DEFAULT_PARTITION's 'Max Application Master Resources' is calculated wrongly. 
> Please find the attachement.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.

2019-07-22 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890681#comment-16890681
 ] 

Bibin A Chundatt commented on YARN-9690:


[~Babbleshack]

Looks like the AM is trying to connect RM . As per the configuration mentioned 
in following document 
[Reference|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html]
 AM should connect to *AMRMProxy* in nodemanager

yarn.resourcemanager.scheduler.address localhost:8049 Redirects jobs to the 
Node Manager’s AMRMProxy port.

This is client side propery in case of mapreduce application. 

> Invalid AMRM token when distributed scheduling is enabled.
> --
>
> Key: YARN-9690
> URL: https://issues.apache.org/jira/browse/YARN-9690
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling, yarn
>Affects Versions: 2.9.2, 3.1.2
> Environment: OS: Ubuntu 18.04
> JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03
>Reporter: Babble Shack
>Priority: Major
> Attachments: applicationlog, yarn-site.xml
>
>
> Applications fail to start due to invalild AMRM from application attempt. 
> I have tested this with 0/100% opportunistic maps and the same issue occurs 
> regardless. 
> {code:java}
> 
> -->
> 
>   
>     mapreduceyarn.nodemanager.aux-services
>     mapreduce_shuffle
>   
>   
>       yarn.resourcemanager.address
>       yarn-master-0.yarn-service.yarn:8032
>   
>   
>       yarn.resourcemanager.scheduler.address
>       0.0.0.0:8049
>   
>   
>     
> yarn.resourcemanager.opportunistic-container-allocation.enabled
>     true
>   
>   
>     yarn.nodemanager.opportunistic-containers-max-queue-length
>     10
>   
>   
>     yarn.nodemanager.distributed-scheduling.enabled
>     true
>   
>  
>   
>     yarn.webapp.ui2.enable
>     true
>   
>   
>       yarn.resourcemanager.resource-tracker.address
>       yarn-master-0.yarn-service.yarn:8031
>   
>   
>     yarn.log-aggregation-enable
>     true
>   
>   
>       yarn.nodemanager.aux-services
>       mapreduce_shuffle
>   
>   
>   
>   
>   
>     yarn.nodemanager.resource.memory-mb
>     7168
>   
>   
>     yarn.scheduler.minimum-allocation-mb
>     3584
>   
>   
>     yarn.scheduler.maximum-allocation-mb
>     7168
>   
>   
>     yarn.app.mapreduce.am.resource.mb
>     7168
>   
>   
>   
>     yarn.app.mapreduce.am.command-opts
>     -Xmx5734m
>   
>   
>   
>     yarn.timeline-service.enabled
>     true
>   
>   
>     yarn.resourcemanager.system-metrics-publisher.enabled
>     true
>   
>   
>     yarn.timeline-service.generic-application-history.enabled
>     true
>   
>   
>     yarn.timeline-service.bind-host
>     0.0.0.0
>   
> 
> {code}
> Relevant logs:
> {code:java}
> 2019-07-22 14:56:37,104 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the 
> mappers will be scheduled using OPPORTUNISTIC containers
> 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: 
> Connecting to ResourceManager at 
> yarn-master-0.yarn-service.yarn/10.244.1.134:8030
> 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken from appattempt_1563805140414_0002_02
> 2019-07-22 14:56:37,152 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while 
> registering
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563805140414_0002_02
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>     at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> 

[jira] [Updated] (YARN-9645) Fix Invalid event FINISHED_CONTAINERS_PULLED_BY_AM at NEW on NM restart

2019-07-10 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9645:
---
Summary: Fix Invalid event FINISHED_CONTAINERS_PULLED_BY_AM at NEW on NM 
restart  (was: Restaring NM's throwing Invalid event: 
FINISHED_CONTAINERS_PULLED_BY_AM at NEW)

> Fix Invalid event FINISHED_CONTAINERS_PULLED_BY_AM at NEW on NM restart
> ---
>
> Key: YARN-9645
> URL: https://issues.apache.org/jira/browse/YARN-9645
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9645-001.patch, YARN-9645-002.patch
>
>
> *Description: *While Restarting NM throughing 
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> FINISHED_CONTAINERS_PULLED_BY_AM at NEW"
> *Environment: *
> Server OS :- UBUNTU
>  No. of Cluster Node:- 2 RM / 4850 NMs
> total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's)
> *Steps:*
> 1. Total number of containers running state : ~53000
> 2. Restart the NM's and check in the log
> {noformat}
> 019-06-24 09:37:35,345 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application 
> with id 32744 submitted by user root
> 2019-06-24 09:37:35,346 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
> IP=255.255.19.245   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1561358926330_32744 
>   QUEUENAME=default
> 2019-06-24 09:37:35,345 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> FINISHED_CONTAINERS_PULLED_BY_AM at NEW
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:669)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:99)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1091)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:221)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:143)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9663) ApplicationID may be duplicated in YARN Federation

2019-07-03 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878313#comment-16878313
 ] 

Bibin A Chundatt commented on YARN-9663:


Application ID change could be incompatible change rt.? Also its not required 
to expose the backend details to client.
I think other statestore implementations should do similar fix as sql.


> ApplicationID may be duplicated in YARN Federation
> --
>
> Key: YARN-9663
> URL: https://issues.apache.org/jira/browse/YARN-9663
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation, yarn
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
>
> ApplicationId represents the globally unique identifier for an application.
> The globally unique nature of the identifier is achieved by using the cluster 
> timestamp. i.e. start-time of the ResourceManager along with a monotonically 
> increasing counter for the application.
> But in yarn federation, the applicationId will be duplicated if the timestamp 
> of subClusters is same.
> Shall we add clusterId in applicationId, like 
> application_cluseterId_timestamp_xxx1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9663) ApplicationID may be duplicated in YARN Federation

2019-07-03 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878005#comment-16878005
 ] 

Bibin A Chundatt commented on YARN-9663:


Duplicate of YARN-9528 ??

> ApplicationID may be duplicated in YARN Federation
> --
>
> Key: YARN-9663
> URL: https://issues.apache.org/jira/browse/YARN-9663
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation, yarn
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
>
> ApplicationId represents the globally unique identifier for an application.
> The globally unique nature of the identifier is achieved by using the cluster 
> timestamp. i.e. start-time of the ResourceManager along with a monotonically 
> increasing counter for the application.
> But in yarn federation, the applicationId will be duplicated if the timestamp 
> of subClusters is same.
> Shall we add clusterId in applicationId, like 
> application_cluseterId_timestamp_xxx1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9644) First RMContext object is always leaked during switch over

2019-07-03 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878001#comment-16878001
 ] 

Bibin A Chundatt commented on YARN-9644:


Thank you [~sunilg] for review and committing .. Attached patch for 3.2

> First RMContext object is always leaked during switch over
> --
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644-branch-3.2.001.patch, YARN-9644.001.patch, 
> YARN-9644.002.patch, YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9644) First RMContext object is always leaked during switch over

2019-07-03 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9644:
---
Attachment: YARN-9644-branch-3.2.001.patch

> First RMContext object is always leaked during switch over
> --
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644-branch-3.2.001.patch, YARN-9644.001.patch, 
> YARN-9644.002.patch, YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9657) AbstractLivelinessMonitor add serviceName to PingChecker thread

2019-07-01 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-9657:
--

 Summary: AbstractLivelinessMonitor add serviceName to PingChecker 
thread
 Key: YARN-9657
 URL: https://issues.apache.org/jira/browse/YARN-9657
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin A Chundatt






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9645) Restaring NM's throwing Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW

2019-06-26 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873797#comment-16873797
 ] 

Bibin A Chundatt commented on YARN-9645:


Thank you [~BilwaST] for updated patch.

+1 LGTM  for YARN-9645-002.patch. I will wait for a day before commiting .

> Restaring NM's throwing Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW
> --
>
> Key: YARN-9645
> URL: https://issues.apache.org/jira/browse/YARN-9645
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9645-001.patch, YARN-9645-002.patch
>
>
> *Description: *While Restarting NM throughing 
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> FINISHED_CONTAINERS_PULLED_BY_AM at NEW"
> *Environment: *
> Server OS :- UBUNTU
>  No. of Cluster Node:- 2 RM / 4850 NMs
> total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's)
> *Steps:*
> 1. Total number of containers running state : ~53000
> 2. Restart the NM's and check in the log
> {noformat}
> 019-06-24 09:37:35,345 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application 
> with id 32744 submitted by user root
> 2019-06-24 09:37:35,346 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
> IP=255.255.19.245   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1561358926330_32744 
>   QUEUENAME=default
> 2019-06-24 09:37:35,345 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> FINISHED_CONTAINERS_PULLED_BY_AM at NEW
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:669)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:99)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1091)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:221)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:143)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9650) Set thread names for CapacityScheduler AsyncScheduleThread

2019-06-26 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-9650:
--

 Summary: Set thread names for CapacityScheduler AsyncScheduleThread
 Key: YARN-9650
 URL: https://issues.apache.org/jira/browse/YARN-9650
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin A Chundatt






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9645) Restaring NM's throwing Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW

2019-06-26 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872965#comment-16872965
 ] 

Bibin A Chundatt commented on YARN-9645:


Thank you [~BilwaST] for working on this

# Fix checksytle issues and whitespace issues.
# Could you  check why Capacity scheduler testcases are failing. 

> Restaring NM's throwing Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW
> --
>
> Key: YARN-9645
> URL: https://issues.apache.org/jira/browse/YARN-9645
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9645-001.patch
>
>
> *Description: *While Restarting NM throughing 
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> FINISHED_CONTAINERS_PULLED_BY_AM at NEW"
> *Environment: *
> Server OS :- UBUNTU
>  No. of Cluster Node:- 2 RM / 4850 NMs
> total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's)
> *Steps:*
> 1. Total number of containers running state : ~53000
> 2. Restart the NM's and check in the log
> {noformat}
> 019-06-24 09:37:35,345 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application 
> with id 32744 submitted by user root
> 2019-06-24 09:37:35,346 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
> IP=255.255.19.245   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1561358926330_32744 
>   QUEUENAME=default
> 2019-06-24 09:37:35,345 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> FINISHED_CONTAINERS_PULLED_BY_AM at NEW
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:669)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:99)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1091)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:221)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:143)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9639) DecommissioningNodesWatcher cause memory leak

2019-06-24 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871504#comment-16871504
 ] 

Bibin A Chundatt commented on YARN-9639:


[~sunilg] . Any other comments. Can i go ahead and commit..?

> DecommissioningNodesWatcher cause memory leak
> -
>
> Key: YARN-9639
> URL: https://issues.apache.org/jira/browse/YARN-9639
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-9639-001.patch
>
>
> Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to 
> memory leak.
> PollTimerTask holds the reference of rmcontext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9639) DecommissioningNodesWatcher cause memory leak

2019-06-23 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870774#comment-16870774
 ] 

Bibin A Chundatt commented on YARN-9639:


[~sunilg] assigning null i think is required. 
{{FileSystemTimelineWriter}},{{MetricsSystemImpl}} does the same..


> DecommissioningNodesWatcher cause memory leak
> -
>
> Key: YARN-9639
> URL: https://issues.apache.org/jira/browse/YARN-9639
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-9639-001.patch
>
>
> Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to 
> memory leak.
> PollTimerTask holds the reference of rmcontext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9640) Slow event processing could cause too many attempt unregister events

2019-06-23 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870768#comment-16870768
 ] 

Bibin A Chundatt commented on YARN-9640:


[~tangzhankun]

IMHO server side mandatory since finishApplicationMaster is an interface given. 
{{AMRMClientImpl}} is only one implementation of it.

As an additional handling  we could make the client side retry time 
configurable, but then the load depends entirely on client configuring the 
property.

> Slow event processing could cause too many attempt unregister events
> 
>
> Key: YARN-9640
> URL: https://issues.apache.org/jira/browse/YARN-9640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>  Labels: scalability
> Attachments: YARN-9640.001.patch, YARN-9640.002.patch, 
> YARN-9640.003.patch
>
>
> We found in one of our test cluster verification that the number attempt 
> unregister events is about 300k+.
>  # AM all containers completed.
>  # AMRMClientImpl send finishApplcationMaster
>  # AMRMClient check event 100ms the finish Status using 
> finishApplicationMaster request.
>  # AMRMClientImpl#unregisterApplicationMaster
> {code:java}
>   while (true) {
> FinishApplicationMasterResponse response =
> rmClient.finishApplicationMaster(request);
> if (response.getIsUnregistered()) {
>   break;
> }
> LOG.info("Waiting for application to be successfully unregistered.");
> Thread.sleep(100);
>   }
> {code}
>  # ApplicationMasterService finishApplicationMaster interface sends 
> unregister events on every status update.
> We should send unregister event only once and cache event send , ignore and 
> send not unregistered response back to AM not overloading the event queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9627) DelegationTokenRenewer could block transitionToStandy

2019-06-23 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9627:
---
Attachment: YARN-9627.003.patch

> DelegationTokenRenewer could block transitionToStandy
> -
>
> Key: YARN-9627
> URL: https://issues.apache.org/jira/browse/YARN-9627
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9627.001.patch, YARN-9627.002.patch, 
> YARN-9627.003.patch
>
>
> Cluster size: 5K
> Running containers: 55K
> *Scenario*: Largenumber of pending applications (around 50K) and performing 
> RM switch over
> Below exception :
> {noformat}
> 2019-06-13 17:39:27,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renew Kind: HDFS_DELEGATION_TOKEN, Service: X:1616, Ident: (token 
> for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, 
> realUser=, issueDate=1560361265181, maxDate=1560966065181, 
> sequenceNumber=104708, masterKeyId=3);exp=1560533965360; 
> apps=[application_1560346941775_20702] in 86397766 ms, appId = 
> [application_1560346941775_20702]
> 2019-06-13 17:39:27,609 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error 
> occurred for the packet 'clientPath:null serverPath:null finished:false 
> header:: 27,4  replyHeader:: 27,4295687588,0  request:: 
> '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F
>   response:: 
> #31ff8a16b74ffe129768ffdbffe949ff8dffd517ffcafffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577}
>  '.
> 2019-06-13 17:58:20,877 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 
> X:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN 
> owner=root/had...@hadoop.com, renewer=yarn, realUser=, 
> issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, 
> masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]]
> 2019-06-13 17:58:20,924 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.IllegalStateException: Timer already cancelled.
> at java.util.Timer.sched(Timer.java:397)
> at java.util.Timer.schedule(Timer.java:208)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, 

[jira] [Commented] (YARN-9639) DecommissioningNodesWatcher cause memory leak

2019-06-23 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870560#comment-16870560
 ] 

Bibin A Chundatt commented on YARN-9639:


Thank you [~BilwaST] for uplading patch.

 Looks good to me.  Will wait for day before getting this in.

> DecommissioningNodesWatcher cause memory leak
> -
>
> Key: YARN-9639
> URL: https://issues.apache.org/jira/browse/YARN-9639
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-9639-001.patch
>
>
> Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to 
> memory leak.
> PollTimerTask holds the reference of rmcontext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9644) First RMContext always leaked during switch over

2019-06-23 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9644:
---
Attachment: YARN-9644.003.patch

> First RMContext always leaked during switch over
> 
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch, 
> YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2019-06-23 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-5867:
--

Assignee: (was: Bibin A Chundatt)

> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Priority: Major
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9644) First RMContext always leaked during switch over

2019-06-23 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870476#comment-16870476
 ] 

Bibin A Chundatt commented on YARN-9644:


cc: [~sunilg]/[~cheersyang]/[~wangda] Could you please review patch..

> First RMContext always leaked during switch over
> 
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9644) First RMContext always leaked during switch over

2019-06-23 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9644:
---
   Priority: Critical  (was: Major)
Description: 
As per my understanding following 2 issues causes the issue.

* WebApp holds the reference to First applicationMasterServer instance, which 
has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
remains to life time of RM process.
* On transistion to active RMNMInfo object is registered in  MBean and never 
unregistered on transitionToStandBy

On transistion to Standby and again based to active new RMContext gets created, 
but above 2 issues causes first RMcontext persist still RMShutdown.



  was:
On transistion to active RMNMInfo object is registered in  MBean and never 
unregistered on transitionToStandBy

Causing RMContext reference since its never unregistered

Summary: First RMContext always leaked during switch over  (was: 
RMNMInfo holds one RMContext causes memory leak)

> First RMContext always leaked during switch over
> 
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9644) First RMContext always leaked during switch over

2019-06-23 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9644:
---
Attachment: YARN-9644.002.patch

> First RMContext always leaked during switch over
> 
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >