[jira] [Commented] (YARN-5221) Expose UpdateResourceRequest API to allow AM to request for change in container properties

2016-06-10 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324560#comment-15324560
 ] 

MENG DING commented on YARN-5221:
-

Hi, [~leftnoteasy]

I have not been following this for a while, but it does make sense to merge any 
change to a container into one unified API. Just to be sure, this won't have 
any compatibility issue since the container resize feature is not out 
officially yet, right? 

Also, are users allowed to increase AND decrease different resource index in 
one update API, since we call it "update" now?

[~asuresh], when this ticket is completed, will you be able to update YARN-4175 
for an update example? This may be useful for people who are already 
prototyping with the container resize feature.

Thanks,
Meng

> Expose UpdateResourceRequest API to allow AM to request for change in 
> container properties
> --
>
> Key: YARN-5221
> URL: https://issues.apache.org/jira/browse/YARN-5221
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5221.001.patch, YARN-5221.002.patch
>
>
> YARN-1197 introduced APIs to allow an AM to request for Increase and Decrease 
> of Container Resources after initial allocation.
> YARN-5085 proposes to allow an AM to request for a change of Container 
> ExecutionType.
> This JIRA proposes to unify both of the above into an Update Container API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4671) There is no need to acquire CS lock when completing a container

2016-02-26 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4671:

Attachment: YARN-4671.2.patch

Rebased against trunk

> There is no need to acquire CS lock when completing a container
> ---
>
> Key: YARN-4671
> URL: https://issues.apache.org/jira/browse/YARN-4671
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4671.1.patch, YARN-4671.2.patch
>
>
> In YARN-4519, we discovered that there is no need to acquire CS lock in 
> CS#completedContainerInternal, because:
> * Access to critical section are already guarded by queue lock.
> * It is not essential to guard {{schedulerHealth}} with cs lock in 
> completedContainerInternal. All maps in schedulerHealth are concurrent maps. 
> Even if schedulerHealth is not consistent at the moment, it will be 
> eventually consistent.
> With this fix, we can truly claim that CS#allocate doesn't require CS lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-08 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4138:

Attachment: YARN-4138.7.patch

Fix the failed unit test case 
{{testDecreaseAfterIncreaseWithAllocationExpiration}}.

Attaching the latest patch.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch, YARN-4138.6.patch, YARN-4138.7.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-08 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15137639#comment-15137639
 ] 

MENG DING commented on YARN-4138:
-

The checkstyle warnings are not fixable.
The failed tests are not related to this issue.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch, YARN-4138.6.patch, YARN-4138.7.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-05 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4138:

Attachment: YARN-4138.6.patch

Attaching a new patch that updates {{lastConfirmedResource}} based on NM 
reported increased containers.

Since the {{containerIncreasedOnNode}} function updates 
{{lastConfirmedResource}}, we will guard the content with a queue lock, but 
will drop the cs lock (this is consistent with other functions like 
{{rollbackContainerResource}}, {{updateIncreaseRequests}} and 
{{decreaseContainer}}).

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch, YARN-4138.6.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-05 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134878#comment-15134878
 ] 

MENG DING commented on YARN-4138:
-

The difference between the two allocationExpirationInfo is that the second one 
resets the start time for timeout. But honestly, you are right that there is 
really no harm done if we set lastConfirmedResource as nmContainerResource when 
rmContainerResource > nmContainerContainer.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-05 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134926#comment-15134926
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~jianhe] and [~sandflee]

After more thoughts, I think we should be able to update last confirmed 
resource every time we get a report of increased containers from NM, like the 
following:
{code}
if (rmContainerResource == nmContainerResource) {
  lastConfirmedResource = nmContainerResource
  containerAllocationExpirer.unregister
} else if (rmContainerResource < nmContainerResource) { // sandflee's use case
  lastConfirmedResource = rmContainerResource
  containerAllocationExpirer.unregister
  handle(RMNodeDecreaseContainerEvent)
} else if (nmContainerResource < rmContainerResource ) { // consecutive 
increase use case
  lastConfirmedResource = max(nmContainerResource, lastConfirmedResource )
}
{code}

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4671) There is no need to acquire CS lock when completing a container

2016-02-04 Thread MENG DING (JIRA)
MENG DING created YARN-4671:
---

 Summary: There is no need to acquire CS lock when completing a 
container
 Key: YARN-4671
 URL: https://issues.apache.org/jira/browse/YARN-4671
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: MENG DING
Assignee: MENG DING


In YARN-4519, we discovered that there is no need to acquire CS lock in 
CS#completedContainerInternal, because:

* Access to critical section are already guaranteed by queue lock.
* It is not essential to guard {{schedulerHealth}} with cs lock. All maps in 
schedulerHealth are concurrent maps.

With this fix, we can truly claim that CS#allocate doesn't require CS lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132466#comment-15132466
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~jianhe]

I think with a bit of explanation this won't cause confusion. The key message 
here is that if a user issues two increase requests in a row, but does not use 
the latest token eventually, we consider this a user/app error, because we 
don't really understand what the user wants. From Resource Manager's 
perspective, at any time, a {{ContainerAllocationExpirer}} can only track one 
increase allocation. If two consecutive increase allocations are made, the 
second allocation expiration info overwrites the first, which effectively 
*cancels* the first allocation.

Let me know if you have further concerns.


> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4671) There is no need to acquire CS lock when completing a container

2016-02-04 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4671:

Description: 
In YARN-4519, we discovered that there is no need to acquire CS lock in 
CS#completedContainerInternal, because:

* Access to critical section are already guarded by queue lock.
* It is not essential to guard {{schedulerHealth}} with cs lock in 
completedContainerInternal. All maps in schedulerHealth are concurrent maps. 
Even if schedulerHealth is not consistent at the moment, it will be eventually 
consistent.

With this fix, we can truly claim that CS#allocate doesn't require CS lock.

  was:
In YARN-4519, we discovered that there is no need to acquire CS lock in 
CS#completedContainerInternal, because:

* Access to critical section are already guaranteed by queue lock.
* It is not essential to guard {{schedulerHealth}} with cs lock. All maps in 
schedulerHealth are concurrent maps.

With this fix, we can truly claim that CS#allocate doesn't require CS lock.


> There is no need to acquire CS lock when completing a container
> ---
>
> Key: YARN-4671
> URL: https://issues.apache.org/jira/browse/YARN-4671
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: MENG DING
>Assignee: MENG DING
>
> In YARN-4519, we discovered that there is no need to acquire CS lock in 
> CS#completedContainerInternal, because:
> * Access to critical section are already guarded by queue lock.
> * It is not essential to guard {{schedulerHealth}} with cs lock in 
> completedContainerInternal. All maps in schedulerHealth are concurrent maps. 
> Even if schedulerHealth is not consistent at the moment, it will be 
> eventually consistent.
> With this fix, we can truly claim that CS#allocate doesn't require CS lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4671) There is no need to acquire CS lock when completing a container

2016-02-04 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4671:

Attachment: YARN-4671.1.patch

Attaching the initial patch for review.

* I need to make {{lastNodeUpdateTime}} volatile, otherwise findbugs will 
complain that access to {{lastNodeUpdateTime}} are not synchronized all the 
time (as a result of removing the CS lock for {{completedContainerInternal}})

* Update the test case of {{testAllocateDoesNotBlockOnSchedulerLock}}. The test 
sequence is:
** Submit an application, and wait for AM to be launched
** AM registers with RM
** AM allocates a new container
** Wait until the container is acquired and launched
** Grab the CS scheduler lock from another thread
** AM allocates with a release request
** Without this fix, the allocate call would block at 
{{CapacityScheduler.completedContainerInternal}}.  With this fix, the allocate 
call will not block

> There is no need to acquire CS lock when completing a container
> ---
>
> Key: YARN-4671
> URL: https://issues.apache.org/jira/browse/YARN-4671
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4671.1.patch
>
>
> In YARN-4519, we discovered that there is no need to acquire CS lock in 
> CS#completedContainerInternal, because:
> * Access to critical section are already guarded by queue lock.
> * It is not essential to guard {{schedulerHealth}} with cs lock in 
> completedContainerInternal. All maps in schedulerHealth are concurrent maps. 
> Even if schedulerHealth is not consistent at the moment, it will be 
> eventually consistent.
> With this fix, we can truly claim that CS#allocate doesn't require CS lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131201#comment-15131201
 ] 

MENG DING commented on YARN-4138:
-

The failed tests are not related.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-03 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4138:

Attachment: YARN-4138.5.patch

Hi, [~jianhe]

bq. After step 6, rmContainer.getLastConfirmedResource() will return 3G, when 
the expire event gets triggered, won't it reset it back to 3G?

No, it won't reset it back to 3G. rmContainer.getLastConfirmedResource() will 
not return 3G after step 6, it is still 1G. We only confirm resource when NM 
reported resource is the same as RM resource. In this test case, NM reported 
resource is 3G, but RM allocated resource is 6G, so 3G is NOT confirmed. This 
issues was discussed in this thread a while ago: 
https://issues.apache.org/jira/browse/YARN-4138?focusedCommentId=14737229=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737229

bq. I think RMContainerImpl will not receive EXPIRE event at RUNNING state 
after this patch ? if so, we can remove this.

You are right, we can remove this. Attaching the latest patch that remove this.


> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, 
> YARN-4138.5.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-02 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4138:

Attachment: YARN-4138.4.patch

Attaching new patch now that YARN-4519 is completed.

In {{rollbackContainerResource}} function, we will grab queue lock first, 
calculate the delta resource, and then call {{decreaseContainer}}.  There is no 
need to grab the cs lock.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2016-02-02 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129300#comment-15129300
 ] 

MENG DING commented on YARN-4599:
-

I have a question regarding when OOM controller will kick in. Is it true that:

* If memory.swappiness is set to 0, then OOM controller will kick in when  
{{memory.limit_in_bytes}} (hard limit) is reached.
* If memory.swappiness is not set to 0, then OOM controller will kick in only 
when all available swap in the system is used up, as 
{{memory.memsw.limit_in_bytes}} is not a configurable parameter in YARN right 
now?

Looking at this link, I am a little bit confused: 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

bq. Consider the following example: setting memory.limit_in_bytes = 2G and 
memory.memsw.limit_in_bytes = 4G for a certain cgroup will allow processes in 
that cgroup to allocate 2 GB of memory and, once exhausted, allocate another 2 
GB of swap only. The memory.memsw.limit_in_bytes parameter represents the sum 
of memory and swap. Processes in a cgroup that does not have the 
memory.memsw.limit_in_bytes parameter set can potentially use up all the 
available swap (after exhausting the set memory limitation) and trigger an Out 
Of Memory situation caused by the lack of available swap.

bq. *memory.swappiness* Note that a value of 0 does not prevent process memory 
being swapped out; swap out might still happen when there is a shortage of 
system memory because the global virtual memory management logic does not read 
the cgroup value.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2016-02-02 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129309#comment-15129309
 ] 

MENG DING commented on YARN-4599:
-

[~aw]:

bq. We need this as something that can be set and I'd propose that the default 
be off given that administrators are going to be very confused when they see 
container usage go above the limit.

I thought the container process will be paused by the OOM controller when the 
limit is reached, so why will the memory usage go above the limit?

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2016-01-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121857#comment-15121857
 ] 

MENG DING commented on YARN-4519:
-

Thanks [~leftnoteasy]. I will log a separate ticket for {{completedContaine}} 
once the fix of this issue is approved.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
> Attachments: YARN-4519.1.patch, YARN-4519.2.patch, YARN-4519.3.patch
>
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2016-01-26 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117468#comment-15117468
 ] 

MENG DING commented on YARN-4519:
-

Hi, [~leftnoteasy]

bq. IIUC, after this patch, increase/decrease container logic needs to acquire 
LeafQueue's lock. Since container allocation/release acquires Leafqueue's lock 
too, race condition of container/resource will be avoided.
Yes, exactly.

bq. One question not related to the patch, it looks safe to remove synchronized 
lock of CS#completedContainerInternal, correct?
I think we don't need to synchronize the entire function with cs lock, only the 
part that updates the {{schedulerHealth}}. If you think this is worth fixing, I 
will log a separate ticket.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
> Attachments: YARN-4519.1.patch, YARN-4519.2.patch, YARN-4519.3.patch
>
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2016-01-25 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4519:

Attachment: YARN-4519.3.patch

[~jianhe] and I had a discussion offline. Instead of grabbing the CS lock for 
all three actions (i.e., update increase requests, decrease, and rollback), we 
only need to grab the queue lock for those actions. 

Attaching a new patch that implement this. It is a little bit more complicated 
than I had thought:
* {{AbstractYarnScheduler.checkAndNormalizeContainerChangeRequest}} is changed 
to {{AbstractYarnScheduler.createSchedContainerChangeRequests}}, and it will 
NOT perform the {{RMServerUtils.checkAndNormalizeContainerChangeRequest}}, as 
the check will access the container resource. It should not be done without a 
queue lock. 
* The {{RMServerUtils.checkAndNormalizeContainerChangeRequest}} is changed to 
{{RMServerUtils.checkSchedContainerChangeRequest}}. It is called in 
{{LeafQueue.decreaseContainer}} and 
{{CapacityScheduelr.updateIncreaseRequests}}, and is synchronized with queue 
lock.
* The {{CapacityScheduler.updateIncreaseRequests}} and 
{{CapacityScheduler.decreaseContainer}} are not synchronized with CS lock 
anymore.
* The normalization of the target resource is moved to 
{{RMServerUtils.validateIncreaseDecreaseRequest}}.
* The bulk of the decrease resource logic is moved to 
{{LeafQueue.decreaseContainer}} (guarded by queue lock), which does the 
following:
** Make sure target resource <= original resource
** If there is an existing increase request for the same container, remove it
** If target resource == original resource, don't do anything, otherwise, 
release the delta resource, and notify the app, node, and then the parent queue.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
> Attachments: YARN-4519.1.patch, YARN-4519.2.patch, YARN-4519.3.patch
>
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2016-01-14 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098974#comment-15098974
 ] 

MENG DING commented on YARN-4108:
-

Hi, [~leftnoteasy], will there be a separate ticket to track the issue of 
selecting to-be-preempted containers based on pending new/increase resource 
request?

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-V3.pdf, 
> YARN-4108-design-doc-v1.pdf, YARN-4108-design-doc-v2.pdf, 
> YARN-4108.poc.1.patch, YARN-4108.poc.2-WIP.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2016-01-11 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092054#comment-15092054
 ] 

MENG DING commented on YARN-4519:
-

The JavaDoc and unit tests error are not related to this issue.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
> Attachments: YARN-4519.1.patch, YARN-4519.2.patch
>
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2016-01-08 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4519:

Attachment: YARN-4519.2.patch

Attaching new patch that addressed the failed test case. 

We only need to grab cs lock when decrease/increase requests are not empty.


> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
> Attachments: YARN-4519.1.patch, YARN-4519.2.patch
>
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2016-01-07 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088361#comment-15088361
 ] 

MENG DING commented on YARN-4519:
-

Please ignore the previous patch. I think there is way for improvement.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
> Attachments: YARN-4519.1.patch
>
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2016-01-07 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4519:

Attachment: YARN-4519.1.patch

Attaching the latest patch that addresses this issue:

bq. We need to make sure following operations are under same CS synchronization 
lock:
1. Compute delta resource for increase request and insert to application
2. Compute delta resource for decrease request and call CS.decreaseContainer
3. Rollback action

1 and 2 are addressed in this patch. 3 will be addressed in YARN-4138.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
> Attachments: YARN-4519.1.patch
>
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081550#comment-15081550
 ] 

MENG DING commented on YARN-4528:
-

Hi, [~sandflee]

With current logic, I think RM won't know if a container decrease msg has 
really been persisted in NM state store or not, even if you decrease resource 
synchronously in NM. For example, suppose we now synchronously decrease 
resource in NM, and something goes wrong when writing the NM state store, then 
an exception will be thrown, and will be caught by the following statement 
during status update in NM:

{code}
catch (Throwable e) {

// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error("Caught exception in status-updater", e);
  } 
{code}

So to me, there is really no benefit of decreasing container resource 
synchronously in NM, is it?

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082083#comment-15082083
 ] 

MENG DING commented on YARN-4528:
-

Honestly I don't think the design needs to be changed, unless other people 
think differently. As you said, this RARELY, if ever happens. Also, we 
acknowledged that AM only issues decrease request when it knows that a 
container doesn't need the original amount of resource, and a failed decrease 
message in NM is not at all fatal (unlike a failed increase message, which may 
cause the container to be killed by the resource enforcement). 

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081267#comment-15081267
 ] 

MENG DING commented on YARN-4528:
-

Hi, [~sandflee]

I am not quite sure about the benefit of directly decreasing resource in NM 
(point #2 in your comment). The targetResource is already being persisted in NM 
state store for NM recovery, and RM does not need to check the status of the NM 
decrease anyway. 
{code}
// Persist container resource change for recovery
this.context.getNMStateStore().storeContainerResourceChanged(
containerId, targetResource);
{code}


> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-29 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074082#comment-15074082
 ] 

MENG DING commented on YARN-4495:
-

I guess my point is if you are not going to do any automated action upon 
exception, it might be sufficient to just look at AM log to see why a resource 
request has failed via the exception message. The key point we are addressing 
here is to not let AMRMClientAsync stop while invalid resource request 
exception occurs, which is fatal to user logic.

If you have use case to do automated actions based on various failed reasons, 
then it may be needed to enhance the protocol, which I believe the community 
tend not to do unless absolutely necessary.

> add a way to tell AM container increase/decrease request is invalid
> ---
>
> Key: YARN-4495
> URL: https://issues.apache.org/jira/browse/YARN-4495
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
> Attachments: YARN-4495.01.patch
>
>
> now RM may pass InvalidResourceRequestException to AM or just ignore the 
> change request, the former will cause AMRMClientAsync down. and the latter 
> will leave AM waiting for the relay.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-29 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073991#comment-15073991
 ] 

MENG DING commented on YARN-4519:
-

Hi, [~sandflee]

The delta resource is the difference between currently allocated resource and 
the target resource (or in the case of rollback, the difference between 
currently allocated resource and the last confirmed resource). For decrease 
request, for example, we need to put a CS lock around computing delta resource 
and CS.decreaseContainer, otherwise the currently allocated resource might have 
been changed in the middle by the scheduling thread causing the delta resource 
to be outdated.

decreaseContainer updates core scheduler statistics, it must be locked.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-29 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074028#comment-15074028
 ] 

MENG DING commented on YARN-4495:
-

Hi, [~sandflee]

What is your specific use case? Do you plan to catch detailed information 
regarding which container change request is causing the exception, and do 
something about it? If so, what will the action be? 

> add a way to tell AM container increase/decrease request is invalid
> ---
>
> Key: YARN-4495
> URL: https://issues.apache.org/jira/browse/YARN-4495
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
> Attachments: YARN-4495.01.patch
>
>
> now RM may pass InvalidResourceRequestException to AM or just ignore the 
> change request, the former will cause AMRMClientAsync down. and the latter 
> will leave AM waiting for the relay.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073355#comment-15073355
 ] 

MENG DING commented on YARN-4495:
-

Hi, [~sandflee]

I was just thinking would it be simpler to modify the AMRMClientAsync to catch 
the InvalidResourceRequestException, and  then to NOT stop the 
heartbeat/callback handler threads? I feel that it is unnecessary to stop the 
AMRMClientAsync just because an invalid resource request is being submitted. AM 
can still be notified through onError(), and handle the exception accordingly. 
This should cover your main use case, right? My 2 cents.

> add a way to tell AM container increase/decrease request is invalid
> ---
>
> Key: YARN-4495
> URL: https://issues.apache.org/jira/browse/YARN-4495
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
> Attachments: YARN-4495.01.patch
>
>
> now RM may pass InvalidResourceRequestException to AM or just ignore the 
> change request, the former will cause AMRMClientAsync down. and the latter 
> will leave AM waiting for the relay.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072873#comment-15072873
 ] 

MENG DING commented on YARN-4519:
-

I feel that the correct solution would be simply put all decrease requests into 
a pendingDecrease list in the allocate() call (after some initial sanity 
checks, of course). And in the allocateContainersToNode() call, process all the 
pendingDecrease requests first before allocating new/increase resource. This 
would make it easy for the resource rollback too.

Also, the following code may have issues?
{code:title=CapacityScheduler.allocate}
// Pre-process increase requests
List normalizedIncreaseRequests =
checkAndNormalizeContainerChangeRequests(increaseRequests, true);

// Pre-process decrease requests
List normalizedDecreaseRequests =
checkAndNormalizeContainerChangeRequests(decreaseRequests, false);
{code}
There could be race conditions when calculating the delta resource for the 
SchedContainerchangeRequest, since the above code is not synchronized with the 
scheduler?

Thoughts, [~leftnoteasy]?

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072880#comment-15072880
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~sandflee]

I think this issue depends on YARN-4519. Will suspend this ticket until 
YARN-4519 is resolved.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073067#comment-15073067
 ] 

MENG DING commented on YARN-4519:
-

This approach would be simpler, at the expense of acquiring a CS lock in the 
allocate call (though no worse than existing logic). 

I also think that it is necessary to move the logic of creating 
normalizedDecreaseRequests (i.e. SchedContainerChangeRequest) into the 
decreaseContainer() call (under the scope of CS lock), otherwise there would be 
race condition when creating the delta resources. What do you think?

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073316#comment-15073316
 ] 

MENG DING commented on YARN-4519:
-

Yes the race only happens when computing delta resource, and yes it also 
happens to increase request.
bq.  If so, can we set delta resource of SchedContainerChangeRequest when we 
enter decreaseContainer?
I guess whichever way we take, we need to make sure the delta resource is 
computed in the scope of a CS lock (i.e., delta resource for decrease request, 
increase request, and rollback action)

I can take a shot at working out a patch if nobody is working on that yet.

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073321#comment-15073321
 ] 

MENG DING commented on YARN-4519:
-

Agreed :-)

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-28 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING reassigned YARN-4519:
---

Assignee: MENG DING

> potential deadlock of CapacityScheduler between decrease container and assign 
> containers
> 
>
> Key: YARN-4519
> URL: https://issues.apache.org/jira/browse/YARN-4519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: sandflee
>Assignee: MENG DING
>
> In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and 
> may be get CapacityScheduler's sync lock in decreaseContainer()
> In scheduler thread,  first get CapacityScheduler's sync lock in 
> allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in 
> FicaSchedulerApp.assignContainers(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-27 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072426#comment-15072426
 ] 

MENG DING commented on YARN-4138:
-

Release containers may have the same issue too. Strange that there has been no 
reports from the field so far? Looks like we need to implement a pending 
release/decrease list in the scheduler ...

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-27 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072480#comment-15072480
 ] 

MENG DING commented on YARN-4138:
-

You are right, I remembered that wrong.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-25 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071682#comment-15071682
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~sandflee]

In your step 3, the container will NOT be removed from allocation expirer. 
Please see the following code in the patch:

{code}
+// Only unregister from the containerAllocationExpirer when target
+// resource is less than or equal to the last confirmed resource.
+if (Resources.fitsIn(targetResource, lastConfirmedResource)) {
+  container.lastConfirmedResource = targetResource;
+  container.containerAllocationExpirer.unregister(
+  new AllocationExpirationInfo(event.getContainerId()));
+}
{code}

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-25 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071684#comment-15071684
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~sandflee]

Can you provide a test case if you believe there is a deadlock?

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-24 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071080#comment-15071080
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~jianhe]

I just tried to apply the patch on the latest trunk, and it seemed ok:

{code}
vagrant@mdinglin02:~/workspace/hadoop-test$ git status
On branch trunk
Your branch is up-to-date with 'origin/trunk'.

nothing to commit, working directory clean
vagrant@mdinglin02:~/workspace/hadoop-test$ git apply -p0 
/vagrant/YARN-4138.3.patch
vagrant@mdinglin02:~/workspace/hadoop-test$ git status
On branch trunk
Your branch is up-to-date with 'origin/trunk'.

Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/ContainerAllocationExpirer.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainer.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/ContainerExpiredSchedulerEvent.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerResizing.java
modified:   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestUtils.java

Untracked files:
  (use "git add ..." to include in what will be committed)


hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/AllocationExpirationInfo.java

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestIncreaseAllocationExpirer.java

no changes added to commit (use "git add" and/or "git commit -a")
{code}

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-23 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070534#comment-15070534
 ] 

MENG DING commented on YARN-4138:
-

Hi [~jianhe], which file(s) are you referring to in particular?

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-18 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064134#comment-15064134
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~sandflee]

1. Yes, this is the expected behavior. If you take a look at the discussion 
from the beginning of this thread, we have decided that if multiple increase 
tokens are granted by RM in a row for a container before AM uses any of the 
token, the last token will take effect, and any previous tokens will be 
effectively cancelled. If RM sees a difference between its own number and the 
number reported from NM, it will consider it as an *unconfirmed* state, and 
won't set the lastConfirmed value. Besides, if AM issues multiple increase 
requests, but doesn't use the last token, it is considered an user error.

2. If I understand your question correctly, then you are right that you should 
pass container B. In fact, the container B you are talking about is technically 
still container A, as uniquely identified by the container ID. When resource 
increase request of container A is granted by RM, RM still sends back container 
A, but with updated resource and token. As an Application Master developer, you 
are expected to track all live containers in AM, and in the 
onContainersResourceChanged(List changedContainers) callback 
function, you need to replace the original container A with the updated 
container A.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, YARN-4138-YARN-1197.2.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-18 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4138:

Attachment: YARN-4138.3.patch

Attach latest patch that addresses [~jianhe] and [~sandflee]'s comments.

I think the issue brought up by [~jianhe] is about race conditions between a 
normal resource decrease and a resource rollback. The proposed fix is to guard 
resource rollback with the same sequence of locks as the normal resource 
decrease, i.e., lock on application first, then on scheduler.

So with the proposed fix, we can walk through the original example:
1. AM asks increase 2G -> 8G, and is approved by RM
2. AM does not increase the container, AM asks to decrease to 1G, and in the 
same time, increase expiration logic is triggered:
* If the normal decrease is processed first: RM decrease 8G -> 1G (allocated 
and lastConfirmed are now set to 1G), and then rollback is processed: RM 
rollback 1G -> 1G (skip)
* If rollback is processed first: RM rollback 8G -> 2G (allocated and 
lastConfirmed are now set to 2G), and then normal decrease is processed: RM 
decrease 2G -> 1G


> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-17 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063066#comment-15063066
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~jianhe]

Thanks for reviewing the code. 

* I think you are right that there could be a race condition where 
rmContainer.getLastConfirmedResource() (called first) is 2G, and the 
rmContainer.getAllocatedResource() (called next) becomes 1G, causing the 
resource delta to become positive.

I think the solution is to synchronize the following block, such that both 
rmContainer.getLastConfirmedResource() and rmContainer.getAllocatedResource() 
will be 1G, so the resource delta is 0, and the decreaseContainer call will be 
skipped.
{code}
SchedContainerChangeRequest decreaseRequest =
new SchedContainerChangeRequest(
schedulerNode, rmContainer,
rmContainer.getLastConfirmedResource());
decreaseContainer(decreaseRequest,
getCurrentAttemptForContainer(containerId));
{code}

* I don't quite understand the concern about the API semantics. If the above is 
fixed, is the API semantics still a concern to you?

* bq. revert format only changes in RMContainerChangeResourceEvent
Will do


> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, YARN-4138-YARN-1197.2.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-17 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063209#comment-15063209
 ] 

MENG DING commented on YARN-4138:
-

Thanks [~sandflee] for the review.

bq. use Resources.fitsin(targetResource, lastConfirmedResource)?
Will do

bq. update lastConfirmedResource in RMContainer? and log debug to log info?
We should not update lastConfirmedResource in this scenario. This is the exact 
case we want to cover in this ticket, where the resource increase token may 
expire, and we need to roll back to the old resource. The only time we want to 
update lastConfirmedResource during resource increase is when 
Resources.equals(nmContainerResource, rmContainerResource).

bq. If am increase a containerA 1G -> 2G, and recieved a new container B, and 
have not told NM if am wants to decrease it to 500M, when using 
requestContainerResourceChange(Container container, Resource capability) , 
seems we should use container B?
Sorry I don't understand the question. Can you elaborate?

Thanks,
Meng


> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, YARN-4138-YARN-1197.2.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061393#comment-15061393
 ] 

MENG DING commented on YARN-1197:
-

[~sandflee], for now you can achieve the goal of increasing and decreasing 
different resource indices by sending separate resource change requests, with 
each request only changing one index.

> Support changing resources of an allocated container
> 
>
> Key: YARN-1197
> URL: https://issues.apache.org/jira/browse/YARN-1197
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: api, graceful, nodemanager, resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Wangda Tan
> Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
> YARN-1197_Design.2015.06.24.pdf, YARN-1197_Design.2015.07.07.pdf, 
> YARN-1197_Design.2015.08.21.pdf, YARN-1197_Design.pdf
>
>
> The current YARN resource management logic assumes resource allocated to a 
> container is fixed during the lifetime of it. When users want to change a 
> resource 
> of an allocated container the only way is releasing it and allocating a new 
> container with expected size.
> Allowing run-time changing resources of an allocated container will give us 
> better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-14 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056056#comment-15056056
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~sandflee]

The proposed implementation of the token expiration and resource allocation 
rollback is effectively the same as resource allocation decrease. When the 
resource allocation of a container is decreased in RM, the AM will be notified 
in the next AM-RM heartbeat response. So AM should have a consistent view of 
the resource allocation eventually.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, YARN-4138-YARN-1197.2.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-11 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052890#comment-15052890
 ] 

MENG DING commented on YARN-4138:
-

Hi, [~sandflee]

Not sure if I fully understand your question. If resource is successfully 
increased in NM, NM will report the increase to RM in the next heartbeat, so 
there will be no token expiration.

For token expiration to occur, the AM needs to acquire the increase token, and 
to NOT call the NMClient.increaseContainerResource API. When RM rolls back the 
resource allocation (implemented in this patch), it follows the same logic of 
decrease resource allocation. When that is done, the AM should get a 
notification from the heartbeat response for resource decrease.

> Roll back container resource allocation after resource increase token expires
> -
>
> Key: YARN-4138
> URL: https://issues.apache.org/jira/browse/YARN-4138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4138-YARN-1197.1.patch, YARN-4138-YARN-1197.2.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989637#comment-14989637
 ] 

MENG DING commented on YARN-1510:
-

I just ran these tests locally with latest trunk and YARN-1510 applied, and 
they all passed:

{code}
---
 T E S T S
---

---
 T E S T S
---
Running org.apache.hadoop.yarn.client.TestGetGroups
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.886 sec - in 
org.apache.hadoop.yarn.client.TestGetGroups
Running org.apache.hadoop.yarn.client.api.impl.TestYarnClient
Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.187 sec - 
in org.apache.hadoop.yarn.client.api.impl.TestYarnClient

Results :

Tests run: 28, Failures: 0, Errors: 0, Skipped: 0
{code}

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989648#comment-14989648
 ] 

MENG DING commented on YARN-1510:
-

Also ran the following tests, they passed:

{code}
---
 T E S T S
---
Running org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 51.842 sec - 
in org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
Running org.apache.hadoop.yarn.client.api.impl.TestNMClient
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 73.733 sec - in 
org.apache.hadoop.yarn.client.api.impl.TestNMClient

Results :

Tests run: 12, Failures: 0, Errors: 0, Skipped: 0

{code}


> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987876#comment-14987876
 ] 

MENG DING commented on YARN-1510:
-

The mvn install test is flawed. It goes directly into 
{{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell}}
 directory and does a {{mvn install}} which still uses the old local maven repo 
for yarn client. This causes the build to fail. The mvn install test should be 
done in the root hadoop directory.

{code}
Mon Nov  2 17:08:05 UTC 2015
cd 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
mvn -Dmaven.repo.local=/home/jenkins/yetus-m2/hadoop-trunk-1 -DskipTests -fae 
clean install -DskipTests=true -Dmaven.javadoc.skip=true
[INFO] Scanning for projects...
...
...
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
hadoop-yarn-applications-distributedshell ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 4 source files to 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/classes
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[861,55]
 cannot find symbol
  symbol:   class AbstractCallbackHandler
  location: class org.apache.hadoop.yarn.client.api.async.NMClientAsync
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[565,21]
 no suitable constructor found for 
NMClientAsyncImpl(org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.NMCallbackHandler)
constructor 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.NMClientAsyncImpl(java.lang.String,org.apache.hadoop.yarn.client.api.NMClient,org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler)
 is not applicable
  (actual and formal argument lists differ in length)
constructor 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.NMClientAsyncImpl(java.lang.String,org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler)
 is not applicable
  (actual and formal argument lists differ in length)
constructor 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.NMClientAsyncImpl(org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler)
 is not applicable
  (actual argument 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.NMCallbackHandler
 cannot be converted to 
org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler by method 
invocation conversion)
...
...
{code}

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987868#comment-14987868
 ] 

MENG DING commented on YARN-1509:
-

The test failure should be related to YARN-4326.

In addition, the mvn install test script is flawed. It goes directly into 
{{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell}}
 directory and does a {{mvn install}} which still uses the old local maven repo 
for yarn client. This causes the build to fail. The mvn install test should be 
done in the root hadoop directory.

{code}
Tue Nov  3 16:18:38 UTC 2015
cd 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
mvn -Dmaven.repo.local=/home/jenkins/yetus-m2/hadoop-trunk-0 -DskipTests -fae 
clean install -DskipTests=true -Dmaven.javadoc.skip=true
[INFO] Scanning for projects...
...
...
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
hadoop-yarn-applications-distributedshell ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 4 source files to 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/classes
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[735,50]
 cannot find symbol
  symbol:   class AbstractCallbackHandler
  location: class org.apache.hadoop.yarn.client.api.async.AMRMClientAsync
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[559,20]
 cannot find symbol
  symbol:   class AbstractCallbackHandler
  location: class org.apache.hadoop.yarn.client.api.async.AMRMClientAsync
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[737,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[805,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[838,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[841,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[846,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[849,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[857,5]
 method does not override or implement a method from a supertype
[INFO] 9 errors 
[INFO] -
[INFO] 
[INFO] BUILD FAILURE
{code}


> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.10.patch, 
> YARN-1509.2.patch, 

[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988091#comment-14988091
 ] 

MENG DING commented on YARN-1510:
-

I logged YETUS-159 for this issue.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987966#comment-14987966
 ] 

MENG DING commented on YARN-1510:
-

[~leftnoteasy], should I log this in the Apache Yetus community?

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.9.patch

Thanks [~jianhe] for reviewing the patch and giving feedback offline. To 
summarize, in the following function:

{code: title=AMRMClientImpl.java}
+  protected void removePendingChangeRequests(
+  List changedContainers, boolean isIncrease) {
+for (Container changedContainer : changedContainers) {
+  ContainerId containerId = changedContainer.getId();
+  if (pendingChange.get(containerId) == null) {
+continue;
+  }
+  Resource target = pendingChange.get(containerId).getValue();
+  if (target == null) {
+continue;
+  }
+  Resource changed = changedContainer.getResource();
+  if (isIncrease) {
+if (Resources.fitsIn(target, changed)) {
+  if (LOG.isDebugEnabled()) {
+LOG.debug("RM has confirmed increased resource allocation for "
++ "container " + containerId + ". Current resource allocation:"
++ changed + ". Remove pending change request:"
++ target);
+  }
+  pendingChange.remove(containerId);
+}
+  } else {
+if (Resources.fitsIn(changed, target)) {
+  if (LOG.isDebugEnabled()) {
+LOG.debug("RM has confirmed decreased resource allocation for "
++ "container " + containerId + ". Current resource allocation:"
++ changed + ". Remove pending change request:"
++ target);
+  }
+  pendingChange.remove(containerId);
+}
+  }
+}
+  }
{code}
* There is no need to check null for {{target}}, as under no circumstance will 
it become null.
* Better yet, there is even no need to compare {{changed}} with {{target}}. 
Because {{Resources.fitsIn(changed, target)}} will always be true for confirmed 
increase request, and same with {{Resources.fitsIn(changed, target)}} for 
confirmed decreased request. I added these checks originally to be defensive, 
but after all, there is really no need for them.

Attaching latest patch that addresses the above.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch, YARN-1509.6.patch, YARN-1509.7.patch, 
> YARN-1509.8.patch, YARN-1509.9.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.10.patch

Please ignore the previous patch, and see the latest one.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.10.patch, 
> YARN-1509.2.patch, YARN-1509.3.patch, YARN-1509.4.patch, YARN-1509.5.patch, 
> YARN-1509.6.patch, YARN-1509.7.patch, YARN-1509.8.patch, YARN-1509.9.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1510) Make NMClient support change container resources

2015-11-02 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1510:

Attachment: YARN-1510.7.patch

Thank you so much for catching this, [~jianhe]!

Attaching latest patch that ignores INCREASE_CONTAINER_RESOURCE event when 
container is in DONE or FAILED state.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-02 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986162#comment-14986162
 ] 

MENG DING commented on YARN-1510:
-

The mvninstall error and failed tests should not be related to this ticket.

It seems that Hadoop QA now uses a different test framework? Not sure why 
mvninstall failed, it runs ok in my environment, see below. Maybe the 
mvninstall test should be placed in the last instead of the first?
{code}
Total Elapsed time:  47m  0s



+1 overall

 __
< Success! >
 --
 \ /\  ___  /\
  \   // \/   \/ \\
 ((O O))
  \\ / \ //
   \/  | |  \/
|  | |  |
|  | |  |
|   o   |
| |   | |
|m|   |m|


| Vote |   Subsystem |  Runtime   | Comment

|   0  |  pre-patch  |  20m 34s   | Pre-patch trunk compilation is
|  | || healthy.
|  +1  |@author  |  0m 0s | The patch does not contain any
|  | || @author tags.
|  +1  | tests included  |  0m 0s | The patch appears to include 2 new
|  | || or modified test files.
|  +1  |  javac  |  9m 39s| There were no new javac warning
|  | || messages.
|  +1  |javadoc  |  10m 43s   | There were no new javadoc warning
|  | || messages.
|  +1  |  release audit  |  0m 32s| The applied patch does not increase
|  | || the total number of release audit
|  | || warnings.
|  +1  | checkstyle  |  1m 5s | There were no new checkstyle
|  | || issues.
|  +1  | whitespace  |  0m 5s | The patch has no lines that end in
|  | || whitespace.
|  +1  |install  |  1m 54s| mvn install still works.
|  +1  |eclipse:eclipse  |  0m 43s| The patch built with
|  | || eclipse:eclipse.
|  +1  |   findbugs  |  1m 43s| The patch does not introduce any
|  | || new Findbugs (version 3.0.0)
|  | || warnings.
|  | |  47m 0s|
{code}

In addition, the {{TestDistributedShell}} timeout may have the same root cause 
of YARN-4320, as the ApplicationMaster reports the following error in my test 
environment:
{code}
2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
(TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
http://mdinglin02:0/ws/v1/timeline/
2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
(TimelineClientImpl.java:logException(213)) - Exception caught by 
TimelineClientConnectionRetry, will try 30 more time(s).
...
...
java.lang.RuntimeException: Failed to connect to timeline server. Connection 
retries limit exceeded. The posted timeline event may be missing
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at 
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
{code}

> Make NMClient support change container resources
> 

[jira] [Created] (YARN-4326) TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-02 Thread MENG DING (JIRA)
MENG DING created YARN-4326:
---

 Summary: TestDistributedShell timeout as AHS in MiniYarnCluster no 
longer binds to default port 8188
 Key: YARN-4326
 URL: https://issues.apache.org/jira/browse/YARN-4326
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: MENG DING
Assignee: MENG DING


The timeout originates in ApplicationMaster, where it fails to connect to 
timeline server, and retry exceeds limits:

{code}
2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
(TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
http://mdinglin02:0/ws/v1/timeline/
2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
(TimelineClientImpl.java:logException(213)) - Exception caught by 
TimelineClientConnectionRetry, will try 30 more time(s).
...
...
java.lang.RuntimeException: Failed to connect to timeline server. Connection 
retries limit exceeded. The posted timeline event may be missing
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at 
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-02 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986336#comment-14986336
 ] 

MENG DING commented on YARN-1510:
-

Logged YARN-4326 for the TestDistributedShell timeout issue.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4326) TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-02 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4326:

Attachment: YARN-4326.patch

Fix the problem by setting the {{TIMELINE_SERVICE_WEBAPP_ADDRESS}} after 
MiniYARNCluster is started.

The TestDistributedShell tests are passed now:
{code}
---
 T E S T S
---
Running 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 364.886 sec - 
in org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
Running 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShellWithNodeLabels
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 37.699 sec - in 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShellWithNodeLabels

Results :

Tests run: 12, Failures: 0, Errors: 0, Skipped: 0
{code}

> TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4287) Capacity Scheduler: Rack Locality improvement

2015-10-29 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980886#comment-14980886
 ] 

MENG DING commented on YARN-4287:
-

Looking at this issue, I have to admit that I had been frustrated with the 
existing {{getLocalityWaitFactor}}, and had the same question as [~nroberts]:
bq. This made no sense to me - Accept OFF-SWITCH without delay, yet don't 
accept RACK-LOCAL??

IMHO, although it makes sense to introduce a configurable rack-locality delay, 
it doesn't help when the cluster is really busy as described in YARN-4189 and 
YARN-3309. As an interim solution, I am in favor of the 
YARN-4287-minimal.patch, but I think the default configuration of 
DEFAULT_RACK_LOCALITY_FULL_RESET should be set to true to be backward 
compatible. 

> Capacity Scheduler: Rack Locality improvement
> -
>
> Key: YARN-4287
> URL: https://issues.apache.org/jira/browse/YARN-4287
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: YARN-4287-minimal.patch, YARN-4287-v2.patch, 
> YARN-4287-v3.patch, YARN-4287-v4.patch, YARN-4287.patch
>
>
> YARN-4189 does an excellent job describing the issues with the current delay 
> scheduling algorithms within the capacity scheduler. The design proposal also 
> seems like a good direction.
> This jira proposes a simple interim solution to the key issue we've been 
> experiencing on a regular basis:
>  - rackLocal assignments trickle out due to nodeLocalityDelay. This can have 
> significant impact on things like CombineFileInputFormat which targets very 
> specific nodes in its split calculations.
> I'm not sure when YARN-4189 will become reality so I thought a simple interim 
> patch might make sense. The basic idea is simple: 
> 1) Separate delays for rackLocal, and OffSwitch (today there is only 1)
> 2) When we're getting rackLocal assignments, subsequent rackLocal assignments 
> should not be delayed
> Patch will be uploaded shortly. No big deal if the consensus is to go 
> straight to YARN-4189. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4287) Capacity Scheduler: Rack Locality improvement

2015-10-29 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981232#comment-14981232
 ] 

MENG DING commented on YARN-4287:
-

Agreed.

> Capacity Scheduler: Rack Locality improvement
> -
>
> Key: YARN-4287
> URL: https://issues.apache.org/jira/browse/YARN-4287
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: YARN-4287-minimal.patch, YARN-4287-v2.patch, 
> YARN-4287-v3.patch, YARN-4287-v4.patch, YARN-4287.patch
>
>
> YARN-4189 does an excellent job describing the issues with the current delay 
> scheduling algorithms within the capacity scheduler. The design proposal also 
> seems like a good direction.
> This jira proposes a simple interim solution to the key issue we've been 
> experiencing on a regular basis:
>  - rackLocal assignments trickle out due to nodeLocalityDelay. This can have 
> significant impact on things like CombineFileInputFormat which targets very 
> specific nodes in its split calculations.
> I'm not sure when YARN-4189 will become reality so I thought a simple interim 
> patch might make sense. The basic idea is simple: 
> 1) Separate delays for rackLocal, and OffSwitch (today there is only 1)
> 2) When we're getting rackLocal assignments, subsequent rackLocal assignments 
> should not be delayed
> Patch will be uploaded shortly. No big deal if the consensus is to go 
> straight to YARN-4189. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4175) Example of use YARN-1197

2015-10-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978444#comment-14978444
 ] 

MENG DING commented on YARN-4175:
-

Correct a typo in the previous post. It should be {{app_id}} instead of 
{{application_id}}

\\
* once the application has started, user can start a new client and specify the 
*appmaster* option to set the client to the appmaster mode. Under this mode, 
the client will talk directly with appmaster, and user can specify *app_id*, 
*container_id*, *action*, *container_memory*, *container_vcores* options to 
request container resizing. For example, to change a container resource, the 
user can do:
{code}
hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -appmaster 
-app_id= -container_id= -action=CHANGE_CONTAINER 
-container_memory=2048 -container_vcores=1
{code}

> Example of use YARN-1197
> 
>
> Key: YARN-4175
> URL: https://issues.apache.org/jira/browse/YARN-4175
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-4175.1.patch, YARN-4175.2.patch
>
>
> Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 
> from end-to-end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-28 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978430#comment-14978430
 ] 

MENG DING commented on YARN-1509:
-

The failed tests are not related.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch, YARN-1509.6.patch, YARN-1509.7.patch, 
> YARN-1509.8.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-27 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.8.patch

Thanks [~leftnoteasy] for the comments. Your concern is valid. I have updated 
the patch to use {{AbstractMap.SimpleEntry}} instead.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch, YARN-1509.6.patch, YARN-1509.7.patch, 
> YARN-1509.8.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-26 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.7.patch

Update the patch to include comments on deprecated interface/methods to refer 
to the new class/methods

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch, YARN-1509.6.patch, YARN-1509.7.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-10-26 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975294#comment-14975294
 ] 

MENG DING commented on YARN-1510:
-

Failed tests are not related.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1510) Make NMClient support change container resources

2015-10-26 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1510:

Attachment: YARN-1510.6.patch

Added comments to refer to the new class/methods for the deprecated interface 
and methods.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1510) Make NMClient support change container resources

2015-10-19 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1510:

Attachment: YARN-1510.5.patch

Attaching latest patch that deprecated the {{NMClientAsync.CallbackHandler}}.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4175) Example of use YARN-1197

2015-10-19 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4175:

Attachment: YARN-4175.2.patch

Update the example program as client APIs have been changed in YARN-1509 and 
YARN-1510. Now the client program does not need to distinguish between Increase 
or Decrease action, it just needs to pass in the container ID, and the target 
resource capability.

How to use the example program:
* To enable IPC service in the application master, user needs to specify the 
*enable_ipc* option. For example:
{code}hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.0.jar
 -shell_command "sleep 10" -num_containers 10 -enable_ipc{code}
* Once the application has started, user can start a *new* client and specify 
the *appmaster* option to set the client to the appmaster mode. Under this 
mode, the client will talk directly with appmaster, and user can specify 
*application_id*, *container_id*, *action*, *container_memory*, 
*container_vcores* options to request container resizing. For example, to 
change a container resource, the user can do:
{code}hadoop org.apache.hadoop.yarn.applications.distributedshell.Client 
-appmaster -application_id= -container_id= 
-action=CHANGE_CONTAINER -container_memory=2048 -container_vcores=1{code}

> Example of use YARN-1197
> 
>
> Key: YARN-4175
> URL: https://issues.apache.org/jira/browse/YARN-4175
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-4175.1.patch, YARN-4175.2.patch
>
>
> Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 
> from end-to-end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-16 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960927#comment-14960927
 ] 

MENG DING commented on YARN-1509:
-

Hi, [~bikassaha]

Thanks for the comments.

I probably didn't make myself clear. We are on the SAME page that for the sake 
of point 2 alone, it already makes sense to combine increase/decrease API into 
one change API: {code}public abstract void 
requestContainerResourceChange(Container container, Resource capability);{code} 
What I was trying to say in the previous post is that to support mix of 
increase and decrease in one change request (point 1) doesn't seem to be very 
feasible (even at a later date). But I don't think we need to worry about that 
for now.

Since we are combining increase/decrease API, we definitely should combine the 
callback methods into one as well: onContainerResourceChanged(). At this point, 
I am inclined to simply do the following which doesn't incur much code changes. 
I will discuss further with [~leftnoteasy] on this. {code}public abstract void 
onContainersResourceChanged(List containers);{code}

bq. Or are you saying that invalid container resource change requests are 
immediately rejected by the RM synchronously in the allocate RPC?
Yes, the ApplicationMasterService will perform a series of sanity checks (e.g., 
requested resource <= maximum allocation, etc), and reject invalid requests 
immediately. This is the same for other requests too.

bq. Having a simple cancel request regardless of increase or decrease is 
preferable since then we are not leaking the current state of the 
implementation to the user. It is future safe
Make sense to me. We can probably have something like 
{{cancelContainerResourceChange(Container container)}}, which applies to the 
container that has an outstanding pending increase sitting in the pendingChange 
map. There is no explicit protocol to support cancellation of resource change 
yet. For now we can achieve that by issuing a backend decrease request with the 
target resource set to the same as the current resource, which effectively 
cancels any outstanding increase request. 

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-16 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.6.patch

Attaching new patch that address the following issues:
* Combine increase/decrease requests into one method
* Combine increase/decrease callback methods into one method
* Deprecate the CallbackHandler interface and other related methods
* Remove pending change requests of a container when that container is 
released, or is completed
* Update related test cases
* Add a test case to test recovery resource change requests on RM restart

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch, YARN-1509.6.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-15 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959077#comment-14959077
 ] 

MENG DING commented on YARN-1509:
-

Hi, [~bikassaha]

Apologize for the late response, I was out traveling and just came back.

bq. A change container request (maybe not supported now) can be increase cpu + 
decrease memory. Hence a built in concept of increase and decrease in the API 
is something I am wary off
>From the design stage of this project, I believe the semantics of "changing 
>container resource" was meant to be either "increase" or "decrease", this was 
>re-enforced with the design choices that successful increase and decrease of 
>resource go through different paths. I have some concerns of extending the 
>semantics to something like "increase cpu + decrease memory" inside one change 
>request:
* Decrease resource happens immediately, while increase resource involves 
handing out a token together with a user action to increase on NM. If we extend 
the semantics, we need to educate the user that once a change request is 
approved, it means that the decrease part of the request is effective 
immediately, while the increase part of the request is still pending on user 
action. Could it be too confusing?
* To make matters worse, if the increase token expires and the RM rolls back 
the allocation of the increase part of the request, we end up with a partially 
fulfilled request, as we are not able to rollback the decrease part of the 
request.

IMHO, it is much cleaner to clearly separate increase and decrease requests at 
the user API level. If a user wants to increase cpu and decrease memory, he 
should send out two separate requests. Thoughts?

bq. So how about {code}public abstract void 
onContainersResourceChanged(Map oldToNewContainers); 
{code} OR {code}public abstract void 
onContainersResourceChanged(List 
updatedContainerInfo);{code}

I thought about providing the old containers in the callback method. Right now 
{{AMRMClientImpl}} remembers old containers in the {{pendingChange}} map, but 
the problem is, in the {{AMRMClientImpl.allocate}} call, once an 
increase/decrease approval is received, the old containers are immediately 
removed from the pending map. So by the time the {{AMRMClientAsyncImpl}} 
callback handler thread starts to process the response, the old containers 
won't be there any more:
{code}
+if (!pendingIncrease.isEmpty() && 
!allocateResponse.getIncreasedContainers().isEmpty()) {
+  
removePendingChangeRequests(allocateResponse.getDecreasedContainers(), true);
+}
+if (!pendingDecrease.isEmpty() && 
!allocateResponse.getDecreasedContainers().isEmpty()) {
+  
removePendingChangeRequests(allocateResponse.getDecreasedContainers(), false);
+}
{code}
My thought is, since we already ask the user to provide the old container when 
he sends out the change request, he should have the old container already, so 
we don't necessarily have to provide the old container info in the callback 
method. Thoughts?

bq. Would there be a case (maybe not currently) when a change container request 
can fail on the RM? Should the callback allow notifying about a failure to 
change the container?
The {{AbstractCallbackHandler.onError}} will be called when the change 
container request throws exception on the RM side.

bq. What is the RM notifies AMRMClient about a container completed. That 
container happens to have a pending change request? What should happen in this 
case? Should the AMRM client clear that pending request? Should it also notify 
the user that pending container change request has failed or just rely on 
onContainerCompleted() to let the AM get that information.
I think in this case AMRMClient should clear all pending requests that belong 
to this container. I will add that logic in. Thanks!

bq. I would be wary of overloading cancel with a second container change 
request. To be clear, here we are discussing user facing semantics and API. 
Having clear semantics is important vs implicit or overloaded behavior.
I am not against providing a separate cancel API. But I think the API needs to 
be clear that the cancel is only for increase request, NOT decrease request 
(just like we don't have something like cancel release container). For example, 
 we can have something like the following. Thoughts?
{code}
  public abstract void cancelContainerResourceIncrease(Container container)
{code} 

bq. Is there an existing test for that code path that could be augmented to 
make sure that the new changes are tested?

I didn't find existing tests that test the pending list on RM restart, I will 
try to add a test case for that. Thanks.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
>   

[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-15 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959502#comment-14959502
 ] 

MENG DING commented on YARN-1509:
-

bq. I didn't find existing tests that test the pending list on RM restart, I 
will try to add a test case for that

Correction, I found an existing test (i.e. 
testAMRMClientResendsRequestsOnRMRestart) that test the pending list on RM 
restart. Will augment that test to test pendingChange map.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-08 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948709#comment-14948709
 ] 

MENG DING commented on YARN-1509:
-

Hi, [~bikassaha]

Thanks a lot for the valuable comments!

bq. Why are there separate methods for increase and decrease instead of a 
single method to change the container resource size? By comparing the existing 
resource allocation to a container and the new requested resource allocation, 
it should be clear whether an increase or decrease is being requested.

As discussed in the design stage, and also described in the design doc, the 
reason to separate the increase/decrease requests in the APIs and AMRM protocol 
is to make sure that users will make a conscious decision when they are making 
these requests. It is also much easier to catch any potential mistakes that the 
user could make. For example, if a user intends to increase resource of a 
container, but for whatever reason mistakenly specifies a target resource that 
is smaller than the current resource, RM can catch that and throw exception.

bq. Also, for completeness, is there a need for a 
cancelContainerResourceChange()? After a container resource change request has 
been submitted, what are my options as a user other than to wait for the 
request to be satisfied by the RM?

For container resource decrease request, there is practically no chance (and 
probably no need) to cancel the request, as it happens immediately when 
scheduler process the request (this is similar to the release container 
request). For container resource increase, the user can cancel any pending 
increase request still sitting in RM by sending a decrease request of the same 
size of the current container size. I will improve the Javadoc description to 
make it clear on this.

bq. If I release the container, then does it mean all pending change requests 
for that container should be removed? From a quick look at the patch, it does 
not look like that is being covered, unless I am missing something.

You are right that releasing a container should cancel all pending change 
requests for that container. This is missing in the current implementation, I 
will add that.

bq. What will happen if the AM restarts after submitting a change request. Does 
the AM-RM re-register protocol need an update to handle the case of 
re-synchronizing on the change requests? Whats happens if the RM restarts? If 
these are explained in a document, then please point me to the document. The 
patch did not seem to have anything around this area. So I thought I would ask

The current implementation handles RM restarts by maintaining a pendingIncrease 
and pendingDecrease map, just like the pendingRelease list. This is covered in 
the design doc.
For AM restarts, I am not sure what we need to do here. Does AM-RM re-register 
protocol currently handle the re-synchronize of outstanding new container 
requests after AM is restarted? Will you be able to elaborate a little bit on 
this?

bq. Also, why have the callback interface methods been made non-public? Would 
that be an incompatible change?

All interface methods are implicitly public and abstract. The existing public 
modifier on these methods are redundant, so I removed them.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-08 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949639#comment-14949639
 ] 

MENG DING commented on YARN-1509:
-

Had an offline discussion with [~leftnoteasy] and [~bikassaha]. Overall we 
agreed that we can combine the separate increase/decrease requests into one API 
in the client:

* Combine {{requestContainerResourceIncrease}} and 
{{requestContainerResourceDecrease}} into one API. For example:
{code}
  /**
   * Request container resource change before calling allocate.
   * Any previous pending resource change request of the same container will be
   * cancelled.
   *
   * @param container The container returned from the last successful resource
   *  allocation or resource change
   * @param capability  The target resource capability of the container
   */
  public abstract void requestContainerResourceChange(
  Container container, Resource capability);
{code}
User must pass in a container object (instead of just a container ID), and the 
target resource capability. Because the container object contains the existing 
container Resource, the AMRMClient can use that information to compare against 
the target resource to figure out if this is an increase or decrease request.

* There is *NO* need to change the AMRM protocol. 

* For the CallbackHandler methods, we can also combine 
{{onContainersResourceDecreased}} and {{onContainersResourceIncreased}} into 
one API:
{code}
public abstract void onContainersResourceChanged(
List containers);
{code}
The user can compare the passed-in containers with the containers they have 
remembered to determine if this is an increase or decrease request. Or maybe we 
can make it even simpler by doing something like the following? Thoughts?
{code}
public abstract void onContainersResourceChanged(
List increasedContainers,  List 
decreasedContainers);
{code}

* We can *deprecate* the existing CallbackHandler interface and use the 
AbstractCallbackHandler instead.

[~bikassaha], [~leftnoteasy], any comments?

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3026) Move application-specific container allocation logic from LeafQueue to FiCaSchedulerApp

2015-10-07 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947680#comment-14947680
 ] 

MENG DING commented on YARN-3026:
-

Hi, [~leftnoteasy]

I've got another question while studying this patch. Regarding the following 
code change:

{code}
@@ -1106,6 +958,11 @@ private Resource computeUserLimit(FiCaSchedulerApp 
application,
 queueCapacities.getAbsoluteCapacity(nodePartition),
 minimumAllocation);
 
+// Assume we have required resource equals to minimumAllocation, this can
+// make sure user limit can continuously increase till queueMaxResource
+// reached.
+Resource required = minimumAllocation;
+
{code}

Before this patch, the required resource is passed into the 
{{computeUserLimit}} function as a parameter, indicating the actual resource 
requirement. Now it is always set to minimumAllocation. I understand that the 
leaf queue won't know the required resource any more since the patch moves the 
application specific logic out of the {{LeafQueue.java}}, but the fact that the 
required resource can be set to some arbitrary value seems quite odd.

More specifically, it seems that the *required* resource only applies when the 
queue is over capacity when calculating the userLimit, but I am not sure how 
useful this userLimit is under this circumstance (i.e., over capacity 
situation)? I know it is used to calculate application headroom, but this 
headroom is not being checked for resource allocation (which only checks the 
{{ResourceLimits.headroom}} set in {{AbstractCSQueue.canAssignToThisQueue}}). 

Not sure if I've made myself clear... Thanks in advance for shedding some light 
on this :-)

> Move application-specific container allocation logic from LeafQueue to 
> FiCaSchedulerApp
> ---
>
> Key: YARN-3026
> URL: https://issues.apache.org/jira/browse/YARN-3026
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: capacityscheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0
>
> Attachments: YARN-3026.1.patch, YARN-3026.2.patch, YARN-3026.3.patch, 
> YARN-3026.4.patch, YARN-3026.5.patch, YARN-3026.6.patch
>
>
> Have a discussion with [~vinodkv] and [~jianhe]: 
> In existing Capacity Scheduler, all allocation logics of and under LeafQueue 
> are located in LeafQueue.java in implementation. To make a cleaner scope of 
> LeafQueue, we'd better move some of them to FiCaSchedulerApp.
> Ideal scope of LeafQueue should be: when a LeafQueue receives some resources 
> from ParentQueue (like 15% of cluster resource), and it distributes resources 
> to children apps, and it should be agnostic to internal logic of children 
> apps (like delayed-scheduling, etc.). IAW, LeafQueue shouldn't decide how 
> application allocating container from given resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-07 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.5.patch

Thanks [~leftnoteasy]. Attaching the patch that addresses the comments.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-07 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947056#comment-14947056
 ] 

MENG DING commented on YARN-1509:
-

Test failure is not related to this patch.
Checkstyle warning is the same as before.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4175) Example of use YARN-1197

2015-10-06 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945988#comment-14945988
 ] 

MENG DING commented on YARN-4175:
-

Update on my testing result.

Based on my tests of this feature against a 4 node cluster using the modified 
distributed shell app, the only critical issue I found is an NPE issue of 
resourcemanager when there is not enough headroom. The issue has been logged in 
YARN-4230. The only other minor issue I can think of is that some logging 
information can be improved, for which I will log a separate (low priority) 
issue.

The tests I performed so far include:
* Verify container resource increase/decrease when there are resources 
available, and no limits are exceeded. Verify container sizes are reported 
correctly on Web UI.
* Verify container resource increase reservation when host doesn't have enough 
resource for the additional allocation. Verify resource reservation information 
on Web UI (Memory Reserved, Lasts Reservation, etc)
* Verify that while an increase reservation is in place on a host, regular and 
increase allocation requests from other application will be skipped on this 
host.
* Verify that an increase reservation will be fulfilled when enough resource is 
freed up on the host.
* Verify that while increase reservation is in place for a container, a 
decrease request to the same container (with target resource <= original 
resource) will cancel the reservation.
* Verify that pending resource increase request will not be processed when 
there is no headroom left (after applying patch from YARN-4230).
* Verify that invalid resource increase/decrease request will throw exception 
in AMRMClient and distributed shell application master onError callback handler 
will be called.
* Verify that resource monitoring is changed on NM after container 
increase/decrease is completed.
* Verify that killing and restarting NM will recover increased/decreased 
containers if NM work preserving restart is enabled.
* All tests are verified using both DefaultResourceCalculator and 
DominantResourceCalculator.

Let me know if you have any comments or suggestions.

> Example of use YARN-1197
> 
>
> Key: YARN-4175
> URL: https://issues.apache.org/jira/browse/YARN-4175
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-4175.1.patch
>
>
> Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 
> from end-to-end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4230) Increasing container resource while there is no headroom left will cause ResourceManager to crash

2015-10-06 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4230:

Attachment: YARN-4230.1.patch

The fix is simple. Attaching the patch with an added test case.

> Increasing container resource while there is no headroom left will cause 
> ResourceManager to crash
> -
>
> Key: YARN-4230
> URL: https://issues.apache.org/jira/browse/YARN-4230
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: MENG DING
>Assignee: MENG DING
>Priority: Critical
> Attachments: YARN-4230.1.patch
>
>
> This issue was found while doing end-to-end test of YARN-1197 in YARN-4175.
> When increasing resource of a container, if there is no headroom left for the 
> user, the ResourceManager crashes with NPE.
> The following is the stack trace:
> {code}
> 15/10/05 20:35:21 INFO capacity.ParentQueue: assignedContainer queue=root 
> usedCapacity=0.9375 absoluteUsedCapacity=0.9375 used= 
> cluster=
> 15/10/05 20:35:49 FATAL resourcemanager.ResourceManager: Error in handling 
> event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.IncreaseContainerAllocator.assignContainers(IncreaseContainerAllocator.java:327)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:819)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:572)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1274)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:134)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:691)
> at java.lang.Thread.run(Thread.java:745)
> 15/10/05 20:35:49 INFO resourcemanager.ResourceManager: Exiting, bbye..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4230) Increasing container resource while there is no headroom left will cause ResourceManager to crash

2015-10-06 Thread MENG DING (JIRA)
MENG DING created YARN-4230:
---

 Summary: Increasing container resource while there is no headroom 
left will cause ResourceManager to crash
 Key: YARN-4230
 URL: https://issues.apache.org/jira/browse/YARN-4230
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: MENG DING
Assignee: MENG DING
Priority: Critical


This issue was found while doing end-to-end test of YARN-1197 in YARN-4175.

When increasing resource of a container, if there is no headroom left for the 
user, the ResourceManager crashes with NPE.

The following is the stack trace:

{code}
15/10/05 20:35:21 INFO capacity.ParentQueue: assignedContainer queue=root 
usedCapacity=0.9375 absoluteUsedCapacity=0.9375 used= 
cluster=
15/10/05 20:35:49 FATAL resourcemanager.ResourceManager: Error in handling 
event type NODE_UPDATE to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.IncreaseContainerAllocator.assignContainers(IncreaseContainerAllocator.java:327)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:66)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:474)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:819)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:572)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:423)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1177)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1274)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:134)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:691)
at java.lang.Thread.run(Thread.java:745)
15/10/05 20:35:49 INFO resourcemanager.ResourceManager: Exiting, bbye..
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4175) Example of use YARN-1197

2015-10-05 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944039#comment-14944039
 ] 

MENG DING commented on YARN-4175:
-

I am using the example application to test the container increase/decrease 
function against a 4 node cluster. Will collect and report all problems when 
the tests are completed.

Just a quick note in case someone also wants to do the test:
* The application master IPC server now listens on a fixed port 8686. If 
multiple app masters are started on the same host with *-enable_ipc* option 
specified, there will be port conflicts, but YARN should be able to start new 
app attempts and try to launch app master on a different host.
* If there are invalid container resource change request (e.g., target resource 
is smaller than original resource for increase), the AMRMClient will throw 
exception (i.e. InvalidResourceRequestException) at the allocate call, and 
current implementation of the distributed shell appmaster will exit, causing 
the entire application to exit.

> Example of use YARN-1197
> 
>
> Key: YARN-4175
> URL: https://issues.apache.org/jira/browse/YARN-4175
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-4175.1.patch
>
>
> Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 
> from end-to-end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-05 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.4.patch

Submit the new patch that fixes the whitespace issue

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-10-05 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943406#comment-14943406
 ] 

MENG DING commented on YARN-1510:
-

* The release audit is not related.
* The failed test passed in my own environment after applying the patch, so it 
is not related.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-05 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943587#comment-14943587
 ] 

MENG DING commented on YARN-1509:
-

* release audit is not related
* will apply for exception for checkstyle:
** relaxed visibility is for testing purposes.
** function length exceeding limit is caused by long comments.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-10-02 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.3.patch

Attaching latest patch:
* Added AbstractCallbackHandler in AMRMClientAsync
* Added more debug logs

[~leftnoteasy], do you have any question/concern regarding my previous response?

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1510) Make NMClient support change container resources

2015-10-02 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1510:

Attachment: YARN-1510.4.patch

Attach latest patch which adds an {{NMClientAsync.AbstractCallbackHandler}} 
class that implements the original  {{NMClientAsync.CallbackHandler}} 
interface. New methods are all defined in the abstract class. This makes sure 
that the build will not break when old applications compile with the new client 
API.

The same will be done for {{AMRMClientAsync}}.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4175) Example of use YARN-1197

2015-10-02 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4175:

Attachment: YARN-4175.1.patch

Submit the initial patch for review. Will not kick jenkins as it depends on 
YARN-1509 and YARN-1510.

As mentioned in the previous post, this ticket enhances the DistributedShell 
application to enable the option to start an IPC service in the application 
master. Once started, user can use the Client program to talk to IPC service to 
issue the increase/decrease container resource requests.

More specifically:
* To enable IPC service in the application master, user needs to specify the 
*enable_ipc* option. For example:
{{hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.0.jar
 -shell_command "sleep 10" -num_containers 10 *-enable_ipc*}}

* Once the application has started, user can start a new client and specify the 
*appmaster* option to set the client to the appmaster mode. Under this mode, 
user can specify *application_id*, *container_id*, *action*, 
*container_memory*, *container_vcores* options to request container resizing. 
For example, to increase a container resource, the user can do:
{{hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -appmaster 
-application_id= -container_id= 
-action=INCREASE_CONTAINER -container_memory=2048 -container_vcores=1}} 

If you want to try this patch, you need to apply YARN-1510 and YARN-1509 first, 
or wait until these two patches are committed.

> Example of use YARN-1197
> 
>
> Key: YARN-4175
> URL: https://issues.apache.org/jira/browse/YARN-4175
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-4175.1.patch
>
>
> Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 
> from end-to-end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4175) Example of use YARN-1197

2015-09-30 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938919#comment-14938919
 ] 

MENG DING commented on YARN-4175:
-

Update on the progress of this ticket:

The example will be based on the existing DistributedShell application. The 
idea is to add an RPC service to the DistributedShell application master, and 
also a client to issue requests to this service to increase/decrease container 
resources after the application is started.

The patch is almost ready and under testing. Will post it for review soon.

> Example of use YARN-1197
> 
>
> Key: YARN-4175
> URL: https://issues.apache.org/jira/browse/YARN-4175
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
>
> Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 
> from end-to-end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-09-29 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935681#comment-14935681
 ] 

MENG DING commented on YARN-1509:
-

Thanks for the review [~leftnoteasy]!

bq. I think we can simply add decreaseList to decrease and increaseList to 
increase.

In most cases, the current logic effectively adds decreaseList to decrease map, 
and increaseList to increase map. But since the allocate call 
{{allocateResponse = allocate(progressIndicator)}} is not synchronized, during 
the allocation, new increase/decrease requests may have been added to the 
increase/decrease table, which IMO should not be overwritten by the old 
requests cached in increaseList and decreaseList. This is similar to the new 
container requests logic when allocation fails. Let me know if you think 
otherwise.

bq,  if request matches, we can print some logs to show this

Will do.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-09-28 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.2.patch

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-09-28 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.1.patch

Submit the first patch for review.

[~leftnoteasy], recall during design stage we discussed the requirement for an 
AMRMClient API to get the latest approved increase request. I think at that 
time the reason was because we want to get the latest approved increase request 
and use that to poll NM to see if the increase has been completed or not. But 
since we have changed the increase action on NM to be blocking, I can't think 
of any real use case of this API anymore. What do you think?

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4175) Example of use YARN-1197

2015-09-25 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908669#comment-14908669
 ] 

MENG DING commented on YARN-4175:
-

Hi, [~leftnoteasy], if you are not working on this right now, I will be happy 
to take this one after YARN-1509 and YARN-1510 is done. I understand the 
urgency of the end-to-end test, and will  treat this as the highest priority.

> Example of use YARN-1197
> 
>
> Key: YARN-4175
> URL: https://issues.apache.org/jira/browse/YARN-4175
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 
> from end-to-end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-09-25 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908436#comment-14908436
 ] 

MENG DING commented on YARN-1510:
-

I forgot to say that the failed tests are not related, as they all passed in my 
local environment.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-09-25 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908456#comment-14908456
 ] 

MENG DING commented on YARN-1510:
-

One thing I want to bring up for discussion is that this ticket adds two 
methods to the public callback interface {{NMClientAsync.CallbackHandler}}. 
This means if user wants to rebuild their ApplicationMaster against the new 
client library, they will have to modify their code to implement these two 
methods. I don't see a way around this unless we don't want to add these two 
methods.

{code}
+/**
+ * The API is called when NodeManager responds to indicate
+ * the container resource has been successfully increased.
+ * @param containerId the Id of the container
+ * @param resource the target resource of the container
+ */
+void onContainerResourceIncreased(
+ContainerId containerId, Resource resource);


+/**
+ * The API is called when an exception is raised in the process of
+ * increasing container resource.
+ * @param containerId the Id of the container
+ * @param t the raised exception
+ */
+void onIncreaseContainerResourceError(
+ContainerId containerId, Throwable t);
{code}

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1510) Make NMClient support change container resources

2015-09-24 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1510:

Attachment: YARN-1510.3.patch

YARN-1197 has been merged into trunk. Attaching new patch based on trunk.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >