[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update

2017-07-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-5197:
--
Priority: Critical  (was: Major)

> RM leaks containers if running container disappears from node update
> 
>
> Key: YARN-5197
> URL: https://issues.apache.org/jira/browse/YARN-5197
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2, 2.6.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 2.8.0, 2.6.5, 2.7.4
>
> Attachments: YARN-5197.001.patch, YARN-5197.002.patch, 
> YARN-5197.003.patch, YARN-5197-branch-2.7.003.patch, 
> YARN-5197-branch-2.8.003.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update

2016-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-5197:
-
Fix Version/s: 2.8.0

> RM leaks containers if running container disappears from node update
> 
>
> Key: YARN-5197
> URL: https://issues.apache.org/jira/browse/YARN-5197
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2, 2.6.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Fix For: 2.8.0, 2.6.5, 2.7.4
>
> Attachments: YARN-5197-branch-2.7.003.patch, 
> YARN-5197-branch-2.8.003.patch, YARN-5197.001.patch, YARN-5197.002.patch, 
> YARN-5197.003.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update

2016-06-20 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5197:
-
Attachment: YARN-5197-branch-2.7.003.patch
YARN-5197-branch-2.8.003.patch

Thanks for the review and commit, Rohith!  Here are patches for branch-2.8 and 
branch-2.7.  I believe the 2.7 patch will work on 2.6 as well.


> RM leaks containers if running container disappears from node update
> 
>
> Key: YARN-5197
> URL: https://issues.apache.org/jira/browse/YARN-5197
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2, 2.6.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-5197-branch-2.7.003.patch, 
> YARN-5197-branch-2.8.003.patch, YARN-5197.001.patch, YARN-5197.002.patch, 
> YARN-5197.003.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update

2016-06-10 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5197:
-
Attachment: YARN-5197.003.patch

Thanks for the review, Rohith!  I updated the patch to add the GUARANTEED check 
in findLostContainers.


> RM leaks containers if running container disappears from node update
> 
>
> Key: YARN-5197
> URL: https://issues.apache.org/jira/browse/YARN-5197
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2, 2.6.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-5197.001.patch, YARN-5197.002.patch, 
> YARN-5197.003.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update

2016-06-06 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5197:
-
Attachment: YARN-5197.002.patch

Updated the patch for the checkstyle issue.  The test failures are tracked by 
HADOOP-12687.

> RM leaks containers if running container disappears from node update
> 
>
> Key: YARN-5197
> URL: https://issues.apache.org/jira/browse/YARN-5197
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2, 2.6.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-5197.001.patch, YARN-5197.002.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update

2016-06-02 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5197:
-
Attachment: YARN-5197.001.patch

RMNodeImpl checks the list of running containers on the node against 
launchedContainers but not vice-versa, so containers that disappear on the node 
are not detected.  Here's a patch that detects when the RM thinks there are 
more containers running on the node than were reported and finds the containers 
that are lost.  Each lost container generates a corresponding aborted 
completion event for the scheduler.  The search for lost containers is only 
performed when one should be found, so it's low cost for the normal case.

I updated MockNM as part of this patch since lots of tests were getting away 
with lazy mocking of a real NM.  They were only specifying container state 
deltas in the heartbeat and sending empty heartbeats in-between those state 
changes.  With this patch, the RM interprets those empty heartbeats as a loss 
of all actively running containers and broke those tests.  The patch therefore 
also updates MockNM to track containers and continue reporting them until they 
have been marked completed just like a real node should.  That was simpler to 
do than update all the users of MockNM to maintain their list of active 
container statuses explicitly.

> RM leaks containers if running container disappears from node update
> 
>
> Key: YARN-5197
> URL: https://issues.apache.org/jira/browse/YARN-5197
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2, 2.6.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-5197.001.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org