[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14719292#comment-14719292
 ] 

Junping Du commented on YARN-3933:
--

Hi [~guoshiwei], we should just update the description and title for this JIRA 
instead of creating a new one. No worry. I will mark YARN-4089 as duplicated 
one for this JIRA and assign this JIRA to you given you would like to work on 
this and already have patch to fix it.

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-27 Thread Shiwei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716506#comment-14716506
 ] 

Shiwei Guo commented on YARN-3933:
--

I created a new [YARN-4089|https://issues.apache.org/jira/browse/YARN-4089] to 
describe the  race condition bug for FairScheduler. I'm a newbie to the hadoop 
community, hope didn't do anything bad. Thanks.

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-25 Thread Shiwei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711132#comment-14711132
 ] 

Shiwei Guo commented on YARN-3933:
--

I think so, and so is 
[YRAN-4045|https://issues.apache.org/jira/browse/YARN-4045]. The negative value 
in root queue is casued by call to updateRootQueueMetrics on same containerId. 
In our cluster, it has the ability to run 13000+ container, but the WEB UI says 
that:

- Containers Running: -26546
- Memory Used: -82.38 TB
- VCores Used: -26451

Lucky that it haven't affect scheduling yet.

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-25 Thread Lavkesh Lahngir (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711056#comment-14711056
 ] 

Lavkesh Lahngir commented on YARN-3933:
---

Is it related to this ?
https://issues.apache.org/jira/browse/YARN-4067

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-25 Thread Shiwei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711051#comment-14711051
 ] 

Shiwei Guo commented on YARN-3933:
--

So I should better open a new issue instead?

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-25 Thread Shiwei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711050#comment-14711050
 ] 

Shiwei Guo commented on YARN-3933:
--

So I should better open a new issue instead?

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709572#comment-14709572
 ] 

Hadoop QA commented on YARN-3933:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | patch |   0m  1s | The patch file was not named 
according to hadoop's naming conventions. Please see 
https://wiki.apache.org/hadoop/HowToContribute for instructions. |
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12752004/patch.BUGFIX-JIRA-YARN-3933.txt
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / feaf034 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8896/console |


This message was automatically generated.

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-24 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709251#comment-14709251
 ] 

Junping Du commented on YARN-3933:
--

I think the title here is a bit misleading. Available resource being negative 
shouldn't be a problem (e.g. enabling feature NM resource configuration - 
YARN-291) which means resource are over-commit although we shouldn't see it in 
most of cases. It is actually a race condition bug for FairScheduler, please 
mention it explicitly or developer/user may have impression that resource 
shouldn't be negative in any cases which we never have this assumption.

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>  Labels: patch
> Attachments: patch.BUGFIX-JIRA-YARN-3933.txt
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3933) Resources(both core and memory) are being negative

2015-08-24 Thread Shiwei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709146#comment-14709146
 ] 

Shiwei Guo commented on YARN-3933:
--

We also seeing this problems, and it may make the RM never allocate resource 
for a queue that has used negative resource.

I did some research and found the this is mainly caused by a race condition of 
calling AbstractYarnScheduler.completedContainer. Lets take FairScheduler as an 
example:
{code:title=FairSchedular.java}
protected synchronized void completedContainer(RMContainer rmContainer,
  ContainerStatus containerStatus, RMContainerEventType event) {
if (rmContainer == null) {
  LOG.info("Null container completed...");
  return;
}


Container container = rmContainer.getContainer();

// Get the application for the finished container
FSAppAttempt application =
getCurrentAttemptForContainer(container.getId());
ApplicationId appId =
container.getId().getApplicationAttemptId().getApplicationId();
if (application == null) {
  LOG.info("Container " + container + " of" +
  " unknown application attempt " + appId +
  " completed with event " + event);
  return;
}
if(!application.getLiveContainersMap().containsKey(container.getId())){
  LOG.info("Container " + container + " of application attempt " + appId
  + " is not alive, skip do completedContainer operation on event " + 
event);
  return;
}

// Get the node on which the container was allocated
FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());

if (rmContainer.getState() == RMContainerState.RESERVED) {
  application.unreserve(rmContainer.getReservedPriority(), node);
} else {
  application.containerCompleted(rmContainer, containerStatus, event);
  node.releaseContainer(container);
  updateRootQueueMetrics();
}

LOG.info("Application attempt " + application.getApplicationAttemptId()
+ " released container " + container.getId() + " on node: " + node
+ " with event: " + event);
  }
{code}

completedContainer method will call application.containerCompleted, which will 
subtraction the resources used by this container from the usedResource counter 
of the application. So, if the completedContainer are called twice with the 
same container, the counter is subtracted too much values. So is the 
updateRootQueueMetrics call, so we can see negative allocatedMemory on 
rootQueue.

The solution is to check whether the container being supplied is still live 
*inside* the completedContainer (as shown in the patch). There is some check 
before calling completedContainer, but that's not enough.

For a more deeply discussion, the completedContainer may be called from two 
place:

1. Trigered by RMContainerEventType.FINISHED event:
{code:title=FairScheduler.nodeUpdate}
// Process completed containers
for (ContainerStatus completedContainer : completedContainers) {
  ContainerId containerId = completedContainer.getContainerId();
  LOG.debug("Container FINISHED: " + containerId);
  completedContainer(getRMContainer(containerId),
  completedContainer, RMContainerEventType.FINISHED);
}
{code}

2. Trigered by  RMContainerEventType.RELEASED
{code:title=AbstractYarnScheduler.releaseContainers}
completedContainer(rmContainer,
SchedulerUtils.createAbnormalContainerStatus(containerId,
  SchedulerUtils.RELEASED_CONTAINER), RMContainerEventType.RELEASED);
{code}

RMContainerEventType.RELEASED is not triggered by MapReduce ApplicationMaster, 
so we won't see this problem on MR jobs. But TEZ will triggered it when it do 
not need this this container, while the NodeManger will also report a container 
complete message to RM ,which in turn trigger the RMContainerEventType.FINISHED 
event. If RMContainerEventType.FINISHED event comes to RM early than TEZ AM, 
the problem happens.

> Resources(both core and memory) are being negative
> --
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.2
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This messag