[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14719292#comment-14719292 ] Junping Du commented on YARN-3933: -- Hi [~guoshiwei], we should just update the description and title for this JIRA instead of creating a new one. No worry. I will mark YARN-4089 as duplicated one for this JIRA and assign this JIRA to you given you would like to work on this and already have patch to fix it. > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716506#comment-14716506 ] Shiwei Guo commented on YARN-3933: -- I created a new [YARN-4089|https://issues.apache.org/jira/browse/YARN-4089] to describe the race condition bug for FairScheduler. I'm a newbie to the hadoop community, hope didn't do anything bad. Thanks. > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711132#comment-14711132 ] Shiwei Guo commented on YARN-3933: -- I think so, and so is [YRAN-4045|https://issues.apache.org/jira/browse/YARN-4045]. The negative value in root queue is casued by call to updateRootQueueMetrics on same containerId. In our cluster, it has the ability to run 13000+ container, but the WEB UI says that: - Containers Running: -26546 - Memory Used: -82.38 TB - VCores Used: -26451 Lucky that it haven't affect scheduling yet. > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711056#comment-14711056 ] Lavkesh Lahngir commented on YARN-3933: --- Is it related to this ? https://issues.apache.org/jira/browse/YARN-4067 > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711051#comment-14711051 ] Shiwei Guo commented on YARN-3933: -- So I should better open a new issue instead? > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711050#comment-14711050 ] Shiwei Guo commented on YARN-3933: -- So I should better open a new issue instead? > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709572#comment-14709572 ] Hadoop QA commented on YARN-3933: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | patch | 0m 1s | The patch file was not named according to hadoop's naming conventions. Please see https://wiki.apache.org/hadoop/HowToContribute for instructions. | | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12752004/patch.BUGFIX-JIRA-YARN-3933.txt | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / feaf034 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8896/console | This message was automatically generated. > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709251#comment-14709251 ] Junping Du commented on YARN-3933: -- I think the title here is a bit misleading. Available resource being negative shouldn't be a problem (e.g. enabling feature NM resource configuration - YARN-291) which means resource are over-commit although we shouldn't see it in most of cases. It is actually a race condition bug for FairScheduler, please mention it explicitly or developer/user may have impression that resource shouldn't be negative in any cases which we never have this assumption. > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Labels: patch > Attachments: patch.BUGFIX-JIRA-YARN-3933.txt > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709146#comment-14709146 ] Shiwei Guo commented on YARN-3933: -- We also seeing this problems, and it may make the RM never allocate resource for a queue that has used negative resource. I did some research and found the this is mainly caused by a race condition of calling AbstractYarnScheduler.completedContainer. Lets take FairScheduler as an example: {code:title=FairSchedular.java} protected synchronized void completedContainer(RMContainer rmContainer, ContainerStatus containerStatus, RMContainerEventType event) { if (rmContainer == null) { LOG.info("Null container completed..."); return; } Container container = rmContainer.getContainer(); // Get the application for the finished container FSAppAttempt application = getCurrentAttemptForContainer(container.getId()); ApplicationId appId = container.getId().getApplicationAttemptId().getApplicationId(); if (application == null) { LOG.info("Container " + container + " of" + " unknown application attempt " + appId + " completed with event " + event); return; } if(!application.getLiveContainersMap().containsKey(container.getId())){ LOG.info("Container " + container + " of application attempt " + appId + " is not alive, skip do completedContainer operation on event " + event); return; } // Get the node on which the container was allocated FSSchedulerNode node = getFSSchedulerNode(container.getNodeId()); if (rmContainer.getState() == RMContainerState.RESERVED) { application.unreserve(rmContainer.getReservedPriority(), node); } else { application.containerCompleted(rmContainer, containerStatus, event); node.releaseContainer(container); updateRootQueueMetrics(); } LOG.info("Application attempt " + application.getApplicationAttemptId() + " released container " + container.getId() + " on node: " + node + " with event: " + event); } {code} completedContainer method will call application.containerCompleted, which will subtraction the resources used by this container from the usedResource counter of the application. So, if the completedContainer are called twice with the same container, the counter is subtracted too much values. So is the updateRootQueueMetrics call, so we can see negative allocatedMemory on rootQueue. The solution is to check whether the container being supplied is still live *inside* the completedContainer (as shown in the patch). There is some check before calling completedContainer, but that's not enough. For a more deeply discussion, the completedContainer may be called from two place: 1. Trigered by RMContainerEventType.FINISHED event: {code:title=FairScheduler.nodeUpdate} // Process completed containers for (ContainerStatus completedContainer : completedContainers) { ContainerId containerId = completedContainer.getContainerId(); LOG.debug("Container FINISHED: " + containerId); completedContainer(getRMContainer(containerId), completedContainer, RMContainerEventType.FINISHED); } {code} 2. Trigered by RMContainerEventType.RELEASED {code:title=AbstractYarnScheduler.releaseContainers} completedContainer(rmContainer, SchedulerUtils.createAbnormalContainerStatus(containerId, SchedulerUtils.RELEASED_CONTAINER), RMContainerEventType.RELEASED); {code} RMContainerEventType.RELEASED is not triggered by MapReduce ApplicationMaster, so we won't see this problem on MR jobs. But TEZ will triggered it when it do not need this this container, while the NodeManger will also report a container complete message to RM ,which in turn trigger the RMContainerEventType.FINISHED event. If RMContainerEventType.FINISHED event comes to RM early than TEZ AM, the problem happens. > Resources(both core and memory) are being negative > -- > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.2 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This messag