[jira] [Commented] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2018-03-07 Thread Tao Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390857#comment-16390857
 ] 

Tao Yang commented on YARN-8011:


Thanks Weiwei Yang for your suggestions.

Yes, we can fix this problem in a more efficient way like 
\{{rm.drainEvents()}}. I replaced serveral unnecessary \{{Thread.sleep}} with 
\{{rm.drainEvents()}} and cost time of 
\{{TestOpportunisticContainerAllocatorAMService}} can be decreased from 18s to 
13s in my local test.

There are still serveral places using Thread.sleep because they need wait for 
periodical nodes sorting in NodeQueueLoadMonitor.

V2 patch is uploaded, please help to review again. Thanks.

> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  fails sometimes in trunk
> ---
>
> Key: YARN-8011
> URL: https://issues.apache.org/jira/browse/YARN-8011
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: YARN-8011.001.patch, YARN-8011.002.patch
>
>
> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  often pass, but the following errors sometimes occur:
> {noformat}
> java.lang.AssertionError: 
> Expected :15360
> Actual :14336
> 
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
>  
> This problem is caused by that deducting resource is a little behind the 
> assertion. To solve this problem, It can sleep a while before this assertion 
> as below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2018-03-07 Thread Tao Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8011:
---
Attachment: YARN-8011.002.patch

> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  fails sometimes in trunk
> ---
>
> Key: YARN-8011
> URL: https://issues.apache.org/jira/browse/YARN-8011
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: YARN-8011.001.patch, YARN-8011.002.patch
>
>
> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  often pass, but the following errors sometimes occur:
> {noformat}
> java.lang.AssertionError: 
> Expected :15360
> Actual :14336
> 
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
>  
> This problem is caused by that deducting resource is a little behind the 
> assertion. To solve this problem, It can sleep a while before this assertion 
> as below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8002) Support NOT_SELF and ALL namespace types for allocation tag

2018-03-07 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390843#comment-16390843
 ] 

Weiwei Yang commented on YARN-8002:
---

Hi [~asuresh], [~leftnoteasy], [~kkaranasos], any chance to help to review this 
patch? 

> Support NOT_SELF and ALL namespace types for allocation tag
> ---
>
> Key: YARN-8002
> URL: https://issues.apache.org/jira/browse/YARN-8002
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8002.001.patch, YARN-8002.002.patch
>
>
> This is a continua task after YARN-7972, YARN-7972 adds support to specify 
> tags with namespace SELF and APP_ID, like following
>  * self/
>  * app-id//
> this task is to track the work to support 2 of remaining namespace types 
> *NOT_SELF* & *ALL* (we'll support app-label later),
>  * not-self/
>  * all/
> this will require a bit refactoring in {{AllocationTagsManager}} as it needs 
> to do some proper aggregation on tags for multiple apps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-8012:

Description: 
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Note, they are caused or things become worse if work-preserving NM restart 
enabled, see YARN-1336

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN on the node:
 ** Cause YARN on the node resource leak
 ** Cannot kill the container to release YARN resource on the node to free up 
resource for other urgent computations on the node.
 # Container and App killing is not eventually consistent for App user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

  was:
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Note, they are caused or things become worse if work-preserving NM restart 
enabled, see YARN-1336

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN on the node:
 ** Cause YARN on the node resource leak
 ** Cannot kill the container to release YARN resource on the node to free up 
resource for other native processes on the node.
 # Container and App killing is not eventually consistent for App user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time


> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Note, they are caused or things become worse if work-preserving NM restart 
> enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node to free up 
> resource for other urgent computations on the node.
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-8012:

Description: 
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Note, they are caused or things become worse if work-preserving NM restart 
enabled, see YARN-1336

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN on the node:
 ** Cause YARN on the node resource leak
 ** Cannot kill the container to release YARN resource on the node to free up 
resource for other native processes on the node.
 # Container and App killing is not eventually consistent for App user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

  was:
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Note, they are caused or things become worse if work-preserving NM restart 
enabled, see YARN-1336

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN on the node:
 ** Cause YARN on the node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for App user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time


> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Note, they are caused or things become worse if work-preserving NM restart 
> enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node to free up 
> resource for other native processes on the node.
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-8012:

Description: 
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Note, they are caused or things become worse if work-preserving NM restart 
enabled, see YARN-1336

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN on the node:
 ** Cause YARN on the node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for App user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

  was:
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Things become worse if work-preserving NM restart enabled, see YARN-1336

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN on the node:
 ** Cause YARN on the node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for App user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time


> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Note, they are caused or things become worse if work-preserving NM restart 
> enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390782#comment-16390782
 ] 

Yuqi Wang edited comment on YARN-8012 at 3/8/18 6:14 AM:
-

*Initial patch for review:*

For the initial patch, the unmanaged container cleanup feature on Windows, only 
can cleanup the container job object of the unmanaged container.

{color:#f79232}The UT will be added if the design is agreed.{color}

The current container will be considered as unmanaged when:
 # NM is dead:
 ** Failed to check whether container is managed by NM within timeout.
 # NM is alive but container is
 org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
 or not found:
 ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE 
or
 not found in the NM container list.


was (Author: yqwang):
*Initial patch for review:*

For the initial patch, the unmanaged container cleanup feature on Windows, only 
can cleanup the container job object of the unmanaged container. 
{color:#f79232}{color:#59afe1}Cleanup for more container resources will be 
supported. And the UT will be added if the design is agreed.{color}{color}

The current container will be considered as unmanaged when:
 # NM is dead:
 ** Failed to check whether container is managed by NM within timeout.
 # NM is alive but container is
 org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
 or not found:
 ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE 
or
 not found in the NM container list.

> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Things become worse if work-preserving NM restart enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390782#comment-16390782
 ] 

Yuqi Wang edited comment on YARN-8012 at 3/8/18 6:13 AM:
-

*Initial patch for review:*

For the initial patch, the unmanaged container cleanup feature on Windows, only 
can cleanup the container job object of the unmanaged container. 
{color:#f79232}{color:#59afe1}Cleanup for more container resources will be 
supported. And the UT will be added if the design is agreed.{color}{color}

The current container will be considered as unmanaged when:
 # NM is dead:
 ** Failed to check whether container is managed by NM within timeout.
 # NM is alive but container is
 org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
 or not found:
 ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE 
or
 not found in the NM container list.


was (Author: yqwang):
*Initial patch for review:*

For the initial patch, the unmanaged container cleanup feature on Windows, only 
can cleanup the container job object of the unmanaged container. 
{color:#59afe1}Cleanup for more container resources will be supported. And the 
UT will be added if the design is agreed.{color}

The current container will be considered as unmanaged when:
 # NM is dead:
 ** Failed to check whether container is managed by NM within timeout.
 # NM is alive but container is
org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
or not found:
 ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE 
or
not found in the NM container list.

> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Things become worse if work-preserving NM restart enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2018-03-07 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390796#comment-16390796
 ] 

Weiwei Yang commented on YARN-8011:
---

Hi [~Tao Yang]

Thanks for the patch, just one comment. Can we replace sleep with waitFor which 
should be robust for testing ?

Thank you.

> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  fails sometimes in trunk
> ---
>
> Key: YARN-8011
> URL: https://issues.apache.org/jira/browse/YARN-8011
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: YARN-8011.001.patch
>
>
> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  often pass, but the following errors sometimes occur:
> {noformat}
> java.lang.AssertionError: 
> Expected :15360
> Actual :14336
> 
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
>  
> This problem is caused by that deducting resource is a little behind the 
> assertion. To solve this problem, It can sleep a while before this assertion 
> as below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-8012:

Description: 
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Things become worse if work-preserving NM restart enabled, see YARN-1336

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN on the node:
 ** Cause YARN on the node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for App user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

  was:
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Things become worse if work-preserving NM restart enabled, see 
[YARN-1336|https://issues.apache.org/jira/browse/YARN-1336]

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN and the node:
 ** Cause YARN and node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time


> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Things become worse if work-preserving NM restart enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-8012:

Description: 
An *unmanaged container / leaked container* is a container which is no longer 
managed by NM. Thus, it is cannot be managed / leaked by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 * NM service is disabled or removed on the node.
 * NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 * NM local leveldb store is corrupted or lost, such as bad disk sectors.
 * NM has bugs, such as wrongly mark live container as complete.

Things become worse if work-preserving NM restart enabled, see 
[YARN-1336|https://issues.apache.org/jira/browse/YARN-1336]

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN and the node:
 ** Cause YARN and node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

  was:
An *unmanaged container* is a container which is no longer managed by NM. Thus, 
it is cannot be managed by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 # For container resource managed by YARN, such as container job object
 and disk data:
 ** NM service is disabled or removed on the node.
 ** NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 ** NM local leveldb store is corrupted or lost, such as bad disk sectors.
 ** NM has bugs, such as wrongly mark live container as complete.
 #  For container resource unmanaged by YARN:
 ** User breakaway processes from container job object.
 ** User creates VMs from container job object.
 ** User acquires other resource on the machine which is unmanaged by
 YARN, such as produce data outside Container folder.

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN and the node:
 ** Cause YARN and node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time


> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Things become worse if work-preserving NM restart enabled, see 
> [YARN-1336|https://issues.apache.org/jira/browse/YARN-1336]
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN and the node:
>  ** Cause YARN and node resource leak
>  ** Cannot kill the container to release YARN resource on the node
>  # Container and App killing is not eventually consistent for user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-03-07 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390792#comment-16390792
 ] 

Yufei Gu commented on YARN-5028:


It is a nice-to-have one. Trunk is good enough. Thanks [~leftnoteasy].

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007-addendum.patch, 
> YARN-5028.007-addendum.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-8012:

Description: 
An *unmanaged container* is a container which is no longer managed by NM. Thus, 
it is cannot be managed by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 # For container resource managed by YARN, such as container job object
 and disk data:
 ** NM service is disabled or removed on the node.
 ** NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 ** NM local leveldb store is corrupted or lost, such as bad disk sectors.
 ** NM has bugs, such as wrongly mark live container as complete.
 #  For container resource unmanaged by YARN:
 ** User breakaway processes from container job object.
 ** User creates VMs from container job object.
 ** User acquires other resource on the machine which is unmanaged by
 YARN, such as produce data outside Container folder.

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN and the node:
 ** Cause YARN and node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

  was:
An *unmanaged container* is a container which is no longer managed by NM. Thus, 
it is cannot be managed by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 # For container resource managed by YARN, such as container job object
 and disk data:
 ** NM service is disabled or removed on the node.
 ** NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 ** NM local leveldb store is corrupted or lost, such as bad disk sectors.
 ** NM has bugs, such as wrongly mark live container as complete.
 #  For container resource unmanaged by YARN:
 ** User breakaway processes from container job object.
 ** User creates VMs from container job object.
 ** User acquires other resource on the machine which is unmanaged by
 YARN, such as produce data outside Container folder.

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN and the node:
 ** Cause YARN and node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

*Initial patch for review:*

For the initial patch, the unmanaged container cleanup feature on Windows, only 
can cleanup the container job object of the unmanaged container. Cleanup for 
more container resources will be supported. And the UT will be added if the 
design is agreed.

The current container will be considered as unmanaged when:
 # NM is dead:
 ** Failed to check whether container is managed by NM within timeout.
 # NM is alive but container is
 org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
 or not found:
 ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE 
or
 not found in the NM container list.


> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container* is a container which is no longer managed by NM. 
> Thus, it is cannot be managed by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  # For container resource managed by YARN, such as container job object
>  and disk data:
>  ** NM service is disabled or removed on the node.
>  ** NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  ** NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  ** NM has bugs, such as wrongly mark live container as complete.
>  #  For container resource unmanaged by YARN:
>  ** User breakaway processes from container job object.
>  ** User creates VMs from container job object.
>  ** User acquires other resource on the machine which is unmanaged by
>  YARN, such as produce data outside Container folder.
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN and the node:
>  ** Cause YARN and node resource leak
>  ** Cannot kill the container to release YARN resource on the node
>  # Container and App 

[jira] [Updated] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-8012:

Attachment: YARN-8012-branch-2.7.1.001.patch

> Support Unmanaged Container Cleanup
> ---
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container* is a container which is no longer managed by NM. 
> Thus, it is cannot be managed by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  # For container resource managed by YARN, such as container job object
>  and disk data:
>  ** NM service is disabled or removed on the node.
>  ** NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  ** NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  ** NM has bugs, such as wrongly mark live container as complete.
>  #  For container resource unmanaged by YARN:
>  ** User breakaway processes from container job object.
>  ** User creates VMs from container job object.
>  ** User acquires other resource on the machine which is unmanaged by
>  YARN, such as produce data outside Container folder.
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN and the node:
>  ** Cause YARN and node resource leak
>  ** Cannot kill the container to release YARN resource on the node
>  # Container and App killing is not eventually consistent for user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time
> *Initial patch for review:*
> For the initial patch, the unmanaged container cleanup feature on Windows, 
> only can cleanup the container job object of the unmanaged container. Cleanup 
> for more container resources will be supported. And the UT will be added if 
> the design is agreed.
> The current container will be considered as unmanaged when:
>  # NM is dead:
>  ** Failed to check whether container is managed by NM within timeout.
>  # NM is alive but container is
>  org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
>  or not found:
>  ** The container is 
> org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE or
>  not found in the NM container list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)
Yuqi Wang created YARN-8012:
---

 Summary: Support Unmanaged Container Cleanup
 Key: YARN-8012
 URL: https://issues.apache.org/jira/browse/YARN-8012
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Yuqi Wang
Assignee: Yuqi Wang
 Fix For: 2.7.1


An *unmanaged container* is a container which is no longer managed by NM. Thus, 
it is cannot be managed by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 # For container resource managed by YARN, such as container job object
 and disk data:
 ** NM service is disabled or removed on the node.
 ** NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 ** NM local leveldb store is corrupted or lost, such as bad disk sectors.
 ** NM has bugs, such as wrongly mark live container as complete.
 #  For container resource unmanaged by YARN:
 ** User breakaway processes from container job object.
 ** User creates VMs from container job object.
 ** User acquires other resource on the machine which is unmanaged by
 YARN, such as produce data outside Container folder.

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN and the node:
 ** Cause YARN and node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

*Initial patch for review:*

For the initial patch, the unmanaged container cleanup feature on Windows, only 
can cleanup the container job object of the unmanaged container. Cleanup for 
more container resources will be supported. And the UT will be added if the 
design is agreed.

The current container will be considered as unmanaged when:
 # NM is dead:
 ** Failed to check whether container is managed by NM within timeout.
 # NM is alive but container is
 org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
 or not found:
 ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE 
or
 not found in the NM container list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7872) labeled node cannot be used to satisfy locality specified request

2018-03-07 Thread Yuqi Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-7872:

Target Version/s: 3.0.0, 2.7.2  (was: 2.7.2)

> labeled node cannot be used to satisfy locality specified request
> -
>
> Key: YARN-7872
> URL: https://issues.apache.org/jira/browse/YARN-7872
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Blocker
> Fix For: 2.7.2
>
> Attachments: YARN-7872-branch-2.7.2.001.patch
>
>
> *Issue summary:*
> labeled node (i.e. node with 'not empty' node label) cannot be used to 
> satisfy locality specified request (i.e. container request with 'not ANY' 
> resource name and the relax locality is false).
>  
> *For example:*
> The node with available resource:
> [Resource: [MemoryMB: [100] CpuNumber: [12]] {color:#14892c}NodeLabel: 
> [persistent]{color} {color:#f79232}HostName: \{SRG}{color} RackName: 
> \{/default-rack}]
> The container request:
>  [Priority: [1] Resource: [MemoryMB: [1] CpuNumber: [1]] 
> {color:#14892c}NodeLabel: [null]{color} {color:#f79232}HostNames: 
> \{SRG}{color} RackNames: {} {color:#59afe1}RelaxLocality: [false]{color}]
> Current RM capacity scheduler's behavior is that (at least for version 2.7 
> and 2.8), the node cannot allocate container for the request, because the 
> node label is not matched when the leaf queue assign container.
>  
> *Possible solution:*
> However, node locality and node label should be two orthogonal dimensions to 
> select candidate nodes for container request. And the node label matching 
> should only be executed for container request with ANY resource name, since 
> only this kind of container request is allowed to have 'not empty' node label.
> So, for container request with 'not ANY' resource name (so, we clearly know 
> it should not have node label), we should use the requested resource name to 
> match with the node instead of using the requested node label to match with 
> the node. And this resource name matching should be safe, since the node 
> whose node label is not accessible for the queue will not be sent to the leaf 
> queue.
>  
> *Discussion:*
> Attachment is the fix according to this principle, please help to review.
> Without it, we cannot use locality to request container within these labeled 
> nodes.
> If the fix is acceptable, we should also recheck whether the same issue 
> happens in trunk and other hadoop versions.
> If not acceptable (i.e. the current behavior is by designed), so, how can we 
> use locality to request container within these labeled nodes?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8010) add config in FederationRMFailoverProxy to not bypass facade cache when failing over

2018-03-07 Thread Botong Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8010:
---
Attachment: YARN-8010.v1.patch

> add config in FederationRMFailoverProxy to not bypass facade cache when 
> failing over
> 
>
> Key: YARN-8010
> URL: https://issues.apache.org/jira/browse/YARN-8010
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8010.v1.patch
>
>
> Today when YarnRM is failing over, the FederationRMFailoverProxy running in 
> AMRMProxy will perform failover, try to get latest subcluster info from 
> FederationStateStore and then retry connect to the latest YarnRM master. When 
> calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache 
> with a flush flag. When YarnRM is failing over, every AM heartbeat thread 
> creates a different thread inside FederationInterceptor, each of which keeps 
> performing failover several times. This leads to a big spike of getSubCluster 
> call to FederationStateStore. 
> Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), 
> YarnRM master slave change might not result in RM addr change. In other 
> cases, a small delay of getting latest subcluster information may be 
> acceptable. This patch thus creates a config option, so that it is possible 
> to ask the FederationRMFailoverProxy to not flush cache when calling 
> getSubCluster(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8010) add config in FederationRMFailoverProxy to not bypass facade cache when failing over

2018-03-07 Thread Botong Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8010:
---
Issue Type: Sub-task  (was: Task)
Parent: YARN-5597

> add config in FederationRMFailoverProxy to not bypass facade cache when 
> failing over
> 
>
> Key: YARN-8010
> URL: https://issues.apache.org/jira/browse/YARN-8010
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8010.v1.patch
>
>
> Today when YarnRM is failing over, the FederationRMFailoverProxy running in 
> AMRMProxy will perform failover, try to get latest subcluster info from 
> FederationStateStore and then retry connect to the latest YarnRM master. When 
> calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache 
> with a flush flag. When YarnRM is failing over, every AM heartbeat thread 
> creates a different thread inside FederationInterceptor, each of which keeps 
> performing failover several times. This leads to a big spike of getSubCluster 
> call to FederationStateStore. 
> Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), 
> YarnRM master slave change might not result in RM addr change. In other 
> cases, a small delay of getting latest subcluster information may be 
> acceptable. This patch thus creates a config option, so that it is possible 
> to ask the FederationRMFailoverProxy to not flush cache when calling 
> getSubCluster(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390762#comment-16390762
 ] 

genericqa commented on YARN-8011:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  9s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 16s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 17s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}114m 19s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | YARN-8011 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913506/YARN-8011.001.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux bd4d02494d3e 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 
11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 583f459 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/19918/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/19918/testReport/ |
| Max. process+thread count | 799 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 

[jira] [Commented] (YARN-8006) Make Hbase-2 profile as default for YARN-7055 branch

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390756#comment-16390756
 ] 

genericqa commented on YARN-8006:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
33s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} YARN-7055 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  5m 
17s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
40s{color} | {color:green} YARN-7055 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m  
2s{color} | {color:green} YARN-7055 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green} YARN-7055 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
46m 19s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} YARN-7055 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
10s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m  
1s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 12m  1s{color} 
| {color:red} root generated 201 new + 1233 unchanged - 0 fixed = 1434 total 
(was 1233) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
4s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
9m 20s{color} | {color:green} patch has no errors when building and testing our 
client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
21s{color} | {color:green} hadoop-project in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
30s{color} | {color:green} hadoop-yarn-server-timelineservice-hbase-server in 
the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  2m 53s{color} 
| {color:red} hadoop-yarn-server-timelineservice-hbase-tests in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
37s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m 50s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRun |
|   | 
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowActivity |
|   | hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorageSchema 
|
|   | 
hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorageEntities |
|   | hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorageApps |
|   | 
hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
 |
|   | 
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:c8176b7 |
| JIRA Issue | YARN-8006 |
| JIRA Patch URL | 

[jira] [Commented] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390735#comment-16390735
 ] 

Chandni Singh commented on YARN-5015:
-

@[~leftnoteasy] You are correct that this logic is only needed in the NM. I 
re-examined the changes in YARN-611 and realized it is not needed outside of NM.

I will move it.

 

> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts (Introduced in 
> YARN-611). Similar sliding window retry policy is needed for container 
> restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7944) Remove master node link from headers of application pages

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7944:
-
Fix Version/s: (was: 3.1.0)

> Remove master node link from headers of application pages
> -
>
> Key: YARN-7944
> URL: https://issues.apache.org/jira/browse/YARN-7944
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: YARN-7944.001.patch, YARN-7944.002.patch, 
> YARN-7944.003.patch
>
>
> Rm UI2 has links for Master container log and master node. 
> This link published on application and service page. This links are not 
> required on all pages because AM container node link and container log link 
> are already present in Application view. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7944) Remove master node link from headers of application pages

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390710#comment-16390710
 ] 

Wangda Tan commented on YARN-7944:
--

And [~yeshavora], just removed fix version, it should be only set by committer 
when patch got committed.

> Remove master node link from headers of application pages
> -
>
> Key: YARN-7944
> URL: https://issues.apache.org/jira/browse/YARN-7944
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: YARN-7944.001.patch, YARN-7944.002.patch, 
> YARN-7944.003.patch
>
>
> Rm UI2 has links for Master container log and master node. 
> This link published on application and service page. This links are not 
> required on all pages because AM container node link and container log link 
> are already present in Application view. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7944) Remove master node link from headers of application pages

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390709#comment-16390709
 ] 

Wangda Tan commented on YARN-7944:
--

[~sunilg], did you commit this one? I saw this is still in PA state while doing 
scan.

> Remove master node link from headers of application pages
> -
>
> Key: YARN-7944
> URL: https://issues.apache.org/jira/browse/YARN-7944
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: YARN-7944.001.patch, YARN-7944.002.patch, 
> YARN-7944.003.patch
>
>
> Rm UI2 has links for Master container log and master node. 
> This link published on application and service page. This links are not 
> required on all pages because AM container node link and container log link 
> are already present in Application view. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390707#comment-16390707
 ] 

Wangda Tan commented on YARN-5028:
--

[~yufeigu], since this isn't committed to branch-3.1, just set fix version. Do 
we need this for 3.1.0? How bad the issue is?

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007-addendum.patch, 
> YARN-5028.007-addendum.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-5028:
-
Target Version/s:   (was: 3.1.0)

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007-addendum.patch, 
> YARN-5028.007-addendum.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-5028:
-
Fix Version/s: (was: 3.1.0)
   3.2.0

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007-addendum.patch, 
> YARN-5028.007-addendum.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390685#comment-16390685
 ] 

genericqa commented on YARN-7952:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
29s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
59s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 38s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
11s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in 
trunk has 1 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
40s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  7m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m  
9s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m  8s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 5 new + 397 unchanged - 0 fixed = 402 total (was 397) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 18s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
41s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
17s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
17s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 56s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 67m 
59s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}179m  2s{color} 

[jira] [Updated] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2018-03-07 Thread Tao Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8011:
---
Attachment: YARN-8011.001.patch

> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  fails sometimes in trunk
> ---
>
> Key: YARN-8011
> URL: https://issues.apache.org/jira/browse/YARN-8011
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-8011.001.patch
>
>
> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  often pass, but the following errors sometimes occur:
> {noformat}
> java.lang.AssertionError: 
> Expected :15360
> Actual :14336
> 
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
>  
> This problem is caused by that deducting resource is a little behind the 
> assertion. To solve this problem, It can sleep a while before this assertion 
> as below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2018-03-07 Thread Tao Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8011:
---
Description: 
TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
 often pass, but the following errors sometimes occur:
{noformat}
java.lang.AssertionError: 
Expected :15360
Actual :14336



at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}
 

This problem is caused by that deducting resource is a little behind the 
assertion. To solve this problem, It can sleep a while before this assertion as 
below.

  was:
TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
 often pass, but the following errors sometimes occur:

 
{noformat}
java.lang.AssertionError: 
Expected :15360
Actual :14336



at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}
 

This problem is caused by that deducting resource is a little behind the 
assertion. To solve this problem, It can sleep a while before this assertion as 
below.


> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  fails sometimes in trunk
> ---
>
> Key: YARN-8011
> URL: https://issues.apache.org/jira/browse/YARN-8011
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
>
> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  often pass, but the following errors sometimes occur:
> {noformat}
> java.lang.AssertionError: 
> Expected :15360
> Actual :14336
> 
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
> at 
> 

[jira] [Created] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2018-03-07 Thread Tao Yang (JIRA)
Tao Yang created YARN-8011:
--

 Summary: 
TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
 fails sometimes in trunk
 Key: YARN-8011
 URL: https://issues.apache.org/jira/browse/YARN-8011
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tao Yang
Assignee: Tao Yang


TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
 often pass, but the following errors sometimes occur:

 
{noformat}
java.lang.AssertionError: 
Expected :15360
Actual :14336



at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}
 

This problem is caused by that deducting resource is a little behind the 
assertion. To solve this problem, It can sleep a while before this assertion as 
below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390650#comment-16390650
 ] 

Wangda Tan commented on YARN-7952:
--

Thanks [~xgong] for update, mostly looks good, few nits:

1) ContainerManagerImpl: changes are unnecessary.
2) NMLogAggregationStatusTracker: 
- trackers => maybe recoveryStatuses? 
- trackers should be ConcurrentMap since it will be written under readlock.
- rollLogAggregationStatus: can be private, and the {{@Private}} is unnecessary.

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch, YARN-7952.5.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7999) Docker launch fails when user private filecache directory is missing

2018-03-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390634#comment-16390634
 ] 

Eric Yang commented on YARN-7999:
-

+1 This patch helps to pass the error message, but docker container still 
doesn't launch properly on trunk.  Container-executor hangs on Launching 
Container, but docker is not called.  I haven't encounter this issue using 3.1 
branch.

> Docker launch fails when user private filecache directory is missing
> 
>
> Key: YARN-7999
> URL: https://issues.apache.org/jira/browse/YARN-7999
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Jason Lowe
>Priority: Major
> Attachments: YARN-7999.001.patch, YARN-7999.002.patch
>
>
> Docker container is failing to launch in trunk.  The root cause is:
> {code}
> [COMPINSTANCE sleeper-1 : container_1520032931921_0001_01_20]: 
> [2018-03-02 23:26:09.196]Exception from container-launch.
> Container id: container_1520032931921_0001_01_20
> Exit code: 29
> Exception message: image: hadoop/centos:latest is trusted in hadoop registry.
> Could not determine real path of mount 
> '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache'
> Could not determine real path of mount 
> '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache'
> Invalid docker mount 
> '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache:/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache',
>  realpath=/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache
> Error constructing docker command, docker error code=12, error 
> message='Invalid docker mount'
> Shell output: main : command provided 4
> main : run as user is hbase
> main : requested yarn user is hbase
> Creating script paths...
> Creating local dirs...
> [2018-03-02 23:26:09.240]Diagnostic message from attempt 0 : [2018-03-02 
> 23:26:09.240]
> [2018-03-02 23:26:09.240]Container exited with a non-zero exit code 29.
> [2018-03-02 23:26:39.278]Could not find 
> nmPrivate/application_1520032931921_0001/container_1520032931921_0001_01_20//container_1520032931921_0001_01_20.pid
>  in any of the directories
> [COMPONENT sleeper]: Failed 11 times, exceeded the limit - 10. Shutting down 
> now...
> {code}
> The filecache cant not be mounted because it doesn't exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390626#comment-16390626
 ] 

Wangda Tan commented on YARN-5015:
--

[~csingh], could you explain a bit about how this logic will be shared by RM 
and AM? Per my understanding, restart AM container should be handled by NM, 
correct? Did you mean AM needs to implement similar logic to restart its 
container? If so, why not directly leverage NM logics to handle container auto 
restart?

bq. The default value of remainingRetries is -1, that is, when it is not set, 
it is -1.
How about set initial remainingRetries directly to maxRetries? Which can avoid 
such check

> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts (Introduced in 
> YARN-611). Similar sliding window retry policy is needed for container 
> restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390534#comment-16390534
 ] 

genericqa commented on YARN-5764:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
53s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 53s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
14s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in 
trunk has 1 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
51s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
19s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
29s{color} | {color:red} hadoop-yarn-api in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
35s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
20s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  6m 
40s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 13s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 2 new + 219 unchanged - 0 fixed = 221 total (was 219) {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
32s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 17s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
45s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
20s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 32s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
37s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 93m 44s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce 

[jira] [Created] (YARN-8010) add config in FederationRMFailoverProxy to not bypass facade cache when failing over

2018-03-07 Thread Botong Huang (JIRA)
Botong Huang created YARN-8010:
--

 Summary: add config in FederationRMFailoverProxy to not bypass 
facade cache when failing over
 Key: YARN-8010
 URL: https://issues.apache.org/jira/browse/YARN-8010
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Today when YarnRM is failing over, the FederationRMFailoverProxy running in 
AMRMProxy will perform failover, try to get latest subcluster info from 
FederationStateStore and then retry connect to the latest YarnRM master. When 
calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache 
with a flush flag. When YarnRM is failing over, every AM heartbeat thread 
creates a different thread inside FederationInterceptor, each of which keeps 
performing failover several times. This leads to a big spike of getSubCluster 
call to FederationStateStore. 

Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), YarnRM 
master slave change might not result in RM addr change. In other cases, a small 
delay of getting latest subcluster information may be acceptable. This patch 
thus creates a config option, so that it is possible to ask the 
FederationRMFailoverProxy to not flush cache when calling getSubCluster(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390501#comment-16390501
 ] 

Xuan Gong commented on YARN-7952:
-

rebase the patch, and kill off the jenkins

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch, YARN-7952.5.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-7952:

Attachment: YARN-7952.5.patch

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch, YARN-7952.5.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390490#comment-16390490
 ] 

Chandni Singh edited comment on YARN-5015 at 3/8/18 12:17 AM:
--

 [~leftnoteasy] Please find my answers below to some of the questions:
{quote}2) mv org.apache.hadoop.yarn.server.retry.SlidingWindowRetryPolicy to 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container: Why it is 
in server-common?
{quote}
It is in server common so that later we can use it for AM restart. Eventually 
we have to unify the code for AM and container restart, so this class needs to 
be accessible to RM as well.
{quote}4) calculatePendingRetries

return retryContext.getRemainingRetries() == -1 ? retryContext.getMaxRetries() 
: retryContext.getRemainingRetries();

 Why check {{retryContext.getRemainingRetries() == -1}}? Should this be 
getMaxRetries() == -1?
{quote}
The default value of {{remainingRetries}} is -1, that is, when it is not set, 
it is -1.

If remainingRetries is not set then pending retries = {{maxRetries}}. 
Otherwise, pendingRetries = {{remainingRetries}}.
 Just after this we update the {{remainingRetries}} = {{pendingRetries}} - 1.
{quote}1) Instead of adding getRestartTimes/getRemainingRetries to 
{{ContainerRetryContext}}, I suggest to have a separate class like 
NMContainerRetryContext which includes:
{quote}
Similar to 2, should I create a {{SlidingContainerRetryContext}} in the 
server-common? Even this needs to be accessible to RM later when we change AM 
retry code to use this common class?

 

 


was (Author: csingh):
 [~leftnoteasy] Please find my answers below to some of the questions:
{quote}2) mv org.apache.hadoop.yarn.server.retry.SlidingWindowRetryPolicy to 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container: Why it is 
in server-common?
{quote}
It is in server common so that later we can use it for AM restart. Eventually 
we have to unify the code for AM and container restart, so this class needs to 
be accessible to RM as well.
{quote}4) calculatePendingRetries

return retryContext.getRemainingRetries() == -1 ? retryContext.getMaxRetries() 
: retryContext.getRemainingRetries();

 Why check {{retryContext.getRemainingRetries() == -1}}? Should this be 
getMaxRetries() == -1?
{quote}
The default value of {{remainingRetries}} is -1, that is, when it is not set, 
it is -1.

If remainingRetries is not set then pending retries = {{maxRetries}}. 
Otherwise, pendingRetries = {{remainingRetries}}.
 Just after this we update the {{remainingRetries}} = {{pendingRetries}} - 1.

> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts (Introduced in 
> YARN-611). Similar sliding window retry policy is needed for container 
> restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390490#comment-16390490
 ] 

Chandni Singh commented on YARN-5015:
-

 [~leftnoteasy] Please find my answers below to some of the questions:
{quote}2) mv org.apache.hadoop.yarn.server.retry.SlidingWindowRetryPolicy to 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container: Why it is 
in server-common?
{quote}
It is in server common so that later we can use it for AM restart. Eventually 
we have to unify the code for AM and container restart, so this class needs to 
be accessible to RM as well.
{quote}4) calculatePendingRetries

return retryContext.getRemainingRetries() == -1 ? retryContext.getMaxRetries() 
: retryContext.getRemainingRetries();

 Why check {{retryContext.getRemainingRetries() == -1}}? Should this be 
getMaxRetries() == -1?
{quote}
The default value of {{remainingRetries}} is -1, that is, when it is not set, 
it is -1.

If remainingRetries is not set then pending retries = {{maxRetries}}. 
Otherwise, pendingRetries = {{remainingRetries}}.
 Just after this we update the {{remainingRetries}} = {{pendingRetries}} - 1.

> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts (Introduced in 
> YARN-611). Similar sliding window retry policy is needed for container 
> restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7891) LogAggregationIndexedFileController should support read from HAR file

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390476#comment-16390476
 ] 

Hudson commented on YARN-7891:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13793 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13793/])
Revert "YARN-7891. LogAggregationIndexedFileController should support (wangda: 
rev e718ac597f2225cb4946e1ac4b3986c336645643)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/TestLogAggregationIndexFileController.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/part-0
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/tracker/NMLogAggregationStatusTracker.java
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_SUCCESS
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_index
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/tracker/TestNMLogAggregationStatusTracker.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/BaseAMRMProxyTest.java
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_masterindex
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java
YARN-7891. LogAggregationIndexedFileController should support read from 
(wangda: rev 583f4594314b3db25b57b1e46ea8026eab21f932)
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_SUCCESS
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/part-0
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_index
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/TestLogAggregationIndexFileController.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_masterindex


> 

[jira] [Commented] (YARN-7891) LogAggregationIndexedFileController should support read from HAR file

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390457#comment-16390457
 ] 

Wangda Tan commented on YARN-7891:
--

oops, I committed a wrong patch, just reverted from trunk and pushed a new one.

> LogAggregationIndexedFileController should support read from HAR file
> -
>
> Key: YARN-7891
> URL: https://issues.apache.org/jira/browse/YARN-7891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7891.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7952:
-
Attachment: (was: YARN-7952.4.patch)

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7952:
-
Attachment: (was: YARN-7952.4.patch)

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7952:
-
Attachment: YARN-7952.4.patch

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch, YARN-7952.4.patch, 
> YARN-7952.4.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390428#comment-16390428
 ] 

Wangda Tan commented on YARN-5015:
--

Thanks [~csingh], see my comments below:

1) Instead of adding getRestartTimes/getRemainingRetries to 
{{ContainerRetryContext}}, I suggest to have a separate class like 
NMContainerRetryContext which includes:
- ContainerRetryContext
- getRestartTimes/getRemainingRetries

Since we should not add runtime information to protocol/api classes.

2) mv org.apache.hadoop.yarn.server.retry.SlidingWindowRetryPolicy to 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container: Why it is 
in server-common? 

3) {{shouldRetry}}:
- It's better to return true at the begining of the method when 
{{getMaxRetries() == ContainerRetryContext.RETRY_FOREVER}}, which can avoid 
lots of checks in the following functions like calculatePendingRetries.

4) {{calculatePendingRetries}}
{code}
  return retryContext.getRemainingRetries() == -1 ?
  retryContext.getMaxRetries() :
  retryContext.getRemainingRetries();
{code} 
Why check {{retryContext.getRemainingRetries() == -1}}? Should this be 
getMaxRetries() == -1? 

5) {{updateRetryContext}}:
{code}
retryContext.setRemainingRetries(pendingRetries -1);
{code} 

6) In ContainerImpl: 
{code}
  int n = container.containerRetryContext.getMaxRetries()
  - container.containerRetryContext.getRemainingRetries();
  container.addDiagnostics("Diagnostic message from attempt "
  + n + " : ", "\n");
{code} 
Under the context of SlidingWindowRetry, this n may keep changing. To avoid 
introducing more logics, I suggest to remove {{n}} from the diagnostics. 


> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts (Introduced in 
> YARN-611). Similar sliding window retry policy is needed for container 
> restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers

2018-03-07 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-5764:

Attachment: YARN-5764-v8.patch

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, 
> YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, 
> YARN-5764-v6.patch, YARN-5764-v7.patch, YARN-5764-v8.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390353#comment-16390353
 ] 

genericqa commented on YARN-7952:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  2s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
24s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in 
trunk has 1 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
49s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
23s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  1m 
23s{color} | {color:red} hadoop-yarn in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  1m 23s{color} 
| {color:red} hadoop-yarn in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 17s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 5 new + 300 unchanged - 0 fixed = 305 total (was 300) {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
21s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  3m 
36s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
21s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
20s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager
 generated 3 new + 9 unchanged - 0 fixed = 12 total (was 9) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
36s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
5s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 22s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 68m 41s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | YARN-7952 |

[jira] [Updated] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7952:
-
Attachment: YARN-7952.4.patch

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch, YARN-7952.4.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390237#comment-16390237
 ] 

Wangda Tan commented on YARN-7952:
--

Thanks [~xgong], I just updated title/description based on your previous 
comment, and rebased patch (004) to latest trunk to get a Jenkins report.

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch, YARN-7952.4.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7952:
-
Summary: RM should be able to recover log aggregation status after 
restart/fail-over  (was: Find a way to persist the log aggregation status)

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch
>
>
> In MAPREDUCE-6415, we have created a CLI to har the aggregated logs, and In 
> YARN-4946: RM should write out Aggregated Log Completion file flag next to 
> logs, we have a discussion on how we can get the log aggregation status: make 
> a client call to RM or get it directly from the Distributed file system(HDFS).
> No matter which approach we would like to choose, we need to figure out a way 
> to persist the log aggregation status first. This ticket is used to track the 
> working progress for this purpose.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7952) Find a way to persist the log aggregation status

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390233#comment-16390233
 ] 

genericqa commented on YARN-7952:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} YARN-7952 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-7952 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913331/YARN-7952.3.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/19914/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Find a way to persist the log aggregation status
> 
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch
>
>
> In MAPREDUCE-6415, we have created a CLI to har the aggregated logs, and In 
> YARN-4946: RM should write out Aggregated Log Completion file flag next to 
> logs, we have a discussion on how we can get the log aggregation status: make 
> a client call to RM or get it directly from the Distributed file system(HDFS).
> No matter which approach we would like to choose, we need to figure out a way 
> to persist the log aggregation status first. This ticket is used to track the 
> working progress for this purpose.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7952) RM should be able to recover log aggregation status after restart/fail-over

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7952:
-
Description: Right now, the NM would send its own log aggregation status to 
RM periodically to RM. And RM would aggregate the status for each application, 
but it will not generate the final status until a client call(from web ui or 
cli) trigger it. But RM never persists the log aggregation status. So, when RM 
restarts/fails over, the log aggregation status will become “NOT_STARTED”. This 
is confusing, maybe we should change it to “NOT_AVAILABLE” (will create a 
separate ticket for this). Anyway, we need to persist the log aggregation 
status for the future use.  (was: In MAPREDUCE-6415, we have created a CLI to 
har the aggregated logs, and In YARN-4946: RM should write out Aggregated Log 
Completion file flag next to logs, we have a discussion on how we can get the 
log aggregation status: make a client call to RM or get it directly from the 
Distributed file system(HDFS).
No matter which approach we would like to choose, we need to figure out a way 
to persist the log aggregation status first. This ticket is used to track the 
working progress for this purpose.)

> RM should be able to recover log aggregation status after restart/fail-over
> ---
>
> Key: YARN-7952
> URL: https://issues.apache.org/jira/browse/YARN-7952
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7952-poc.patch, YARN-7952.1.patch, 
> YARN-7952.2.patch, YARN-7952.3.patch, YARN-7952.3.patch
>
>
> Right now, the NM would send its own log aggregation status to RM 
> periodically to RM. And RM would aggregate the status for each application, 
> but it will not generate the final status until a client call(from web ui or 
> cli) trigger it. But RM never persists the log aggregation status. So, when 
> RM restarts/fails over, the log aggregation status will become “NOT_STARTED”. 
> This is confusing, maybe we should change it to “NOT_AVAILABLE” (will create 
> a separate ticket for this). Anyway, we need to persist the log aggregation 
> status for the future use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5015:

Description: 
We support sliding window retry policy for AM restarts (Introduced in 
YARN-611). Similar sliding window retry policy is needed for container restarts.

With this change, we can introduce a common class for SlidingWindowRetryPolicy 
( suggested by [~vvasudev] in the comments) and integrate it to container 
restart. 

In a subsequent jira, we can modify the AM code to use SlidingWindowRetryPolicy 
which will unify the AM and container restart code.

  was:
We support sliding window retry policy for AM restarts (Refer to YARN-3669). 
Similar sliding window retry policy is needed for container restarts.

With this change, we can introduce a common class for SlidingWindowRetryPolicy 
( suggested by [~vvasudev] in the comments) and integrate it to container 
restart. 

In a subsequent jira, we can modify the AM code to use SlidingWindowRetryPolicy 
which will unify the AM and container restart code.


> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts (Introduced in 
> YARN-611). Similar sliding window retry policy is needed for container 
> restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5015:

Description: 
We support sliding window retry policy for AM restarts (Refer to YARN-3669). 
Similar sliding window retry policy is needed for container restarts.

With this change, we can introduce a common class for SlidingWindowRetryPolicy 
( suggested by [~vvasudev] in the comments) and integrate it to container 
restart. 

In a subsequent jira, we can modify the AM code to use SlidingWindowRetryPolicy 
which will unify the AM and container restart code.

  was:
We support sliding window retry policy for AM restarts. Similar sliding window 
retry policy is needed for container restarts.

With this change, we can introduce a common class for SlidingWindowRetryPolicy 
( suggested by [~vvasudev] in the comments) and integrate it to container 
restart. 

In a subsequent jira, we can modify the AM code to use SlidingWindowRetryPolicy 
which will unify the AM and container restart code.


> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts (Refer to YARN-3669). 
> Similar sliding window retry policy is needed for container restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390123#comment-16390123
 ] 

Hudson commented on YARN-7626:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13789 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13789/])
YARN-7626. Allow regular expression matching in container-executor.cfg (wangda: 
rev 037d7834833df2d1e60f5015b60d42550b1ddce6)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_docker_util.cc
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/utils/docker-util.c


> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch, YARN-7626.007.patch, YARN-7626.008.patch, 
> YARN-7626.009.patch, YARN-7626.010.patch, YARN-7626.011.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7891) LogAggregationIndexedFileController should support read from HAR file

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390122#comment-16390122
 ] 

Hudson commented on YARN-7891:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13789 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13789/])
YARN-7891. LogAggregationIndexedFileController should support read from 
(wangda: rev 4d53ef7eefb14661d824924e503a910de1ae997f)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/tracker/TestNMLogAggregationStatusTracker.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/part-0
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/TestLogAggregationIndexFileController.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_SUCCESS
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/BaseAMRMProxyTest.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/tracker/NMLogAggregationStatusTracker.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_index
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/resources/application_123456_0001.har/_masterindex
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java


> LogAggregationIndexedFileController should support read from HAR file
> -
>
> Key: YARN-7891
> URL: https://issues.apache.org/jira/browse/YARN-7891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7891.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8009) YARN limit number of simultaneously running containers in the application level

2018-03-07 Thread Miklos Szegedi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390116#comment-16390116
 ] 

Miklos Szegedi commented on YARN-8009:
--

Thank you for raising this [~sachinjose2...@gmail.com]. Normally the 
Application Master has the ability to specify the amount of containers. It can 
then provide an option to the user. See the the distributed shell example for 
details:

https://github.com/apache/hadoop/blob/037d7834833df2d1e60f5015b60d42550b1ddce6/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java#L459

> YARN limit number of simultaneously running containers in the application 
> level
> ---
>
> Key: YARN-8009
> URL: https://issues.apache.org/jira/browse/YARN-8009
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Sachin Jose
>Priority: Minor
>  Labels: features
>
> It would be really useful if the user can specify maximum containers can be 
> running simultaneously in the application level. Most of the long running 
> YARN application can be benefited out of it. At this moment, the only 
> available option to restrict resource over usage of long running is in the 
> YARN resource manager queue level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390106#comment-16390106
 ] 

Chandni Singh commented on YARN-5015:
-

[~leftnoteasy] I have updated the description. I have followed [~vvasudev] 
suggestions
{quote}I think you probably need to change your approach if we want to unify 
the AM and container restart policies. I think what's required is a common 
class - something like SlidingWindowContainerRetryPolicy or something like that 
which takes a SlidingWindowContainerRetryContext consisting of the restart 
timestamps, the validity interval, the exit codes, the exit codes to ignore, 
and the remaining retry attempts. The SlidingWindowContainerRetryPolicy can 
then look at the various parameters and tell you whether to retry the container 
or not.
{quote}
 

> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts. Similar sliding 
> window retry policy is needed for container restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-03-07 Thread Miklos Szegedi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390095#comment-16390095
 ] 

Miklos Szegedi commented on YARN-7626:
--

Thank you for the contribution [~Zian Chen] and for the commit [~leftnoteasy].

> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch, YARN-7626.007.patch, YARN-7626.008.patch, 
> YARN-7626.009.patch, YARN-7626.010.patch, YARN-7626.011.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5015:

Description: 
We support sliding window retry policy for AM restarts. Similar sliding window 
retry policy is needed for container restarts.

With this change, we can introduce a common class for SlidingWindowRetryPolicy 
( suggested by [~vvasudev] in the comments) and integrate it to container 
restart. 

In a subsequent jira, we can modify the AM code to use SlidingWindowRetryPolicy 
which will unify the AM and container restart code.

  was:
We support sliding window retry policy for AM restart. Similar sliding window 
retry policy can be applied to Container restarts. With this change, we can 
introduce a common class for SlidingWindowRetryPolicy ( suggested by 
[~vvasudev] in the comments) and integrate to container restart. 

In a subsequent jira, we can use SlidingWindowRetryPolicy for am restart as 
well to unify the code.


> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restarts. Similar sliding 
> window retry policy is needed for container restarts.
> With this change, we can introduce a common class for 
> SlidingWindowRetryPolicy ( suggested by [~vvasudev] in the comments) and 
> integrate it to container restart. 
> In a subsequent jira, we can modify the AM code to use 
> SlidingWindowRetryPolicy which will unify the AM and container restart code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5015:

Description: 
We support sliding window retry policy for AM restart. Similar sliding window 
retry policy can be applied to Container restarts. With this change, we can 
introduce a common class for SlidingWindowRetryPolicy ( suggested by 
[~vvasudev] in the comments) and integrate to container restart. 

In a subsequent jira, we can use SlidingWindowRetryPolicy for am restart as 
well to unify the code.

  was:
We support sliding window retry policy for AM restart. Similar sliding window 
retry policy can be applied to Container restarts. With this change, we can 
introduce a common class for SlidingWindowRetryPolicy ( suggested by 
[~vvasudev] in the comments) and integrate to container restart. 

In a subsequent jira, we can use SlidingWindowRetryPolicy for am restart as 
well to unify the code


> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restart. Similar sliding window 
> retry policy can be applied to Container restarts. With this change, we can 
> introduce a common class for SlidingWindowRetryPolicy ( suggested by 
> [~vvasudev] in the comments) and integrate to container restart. 
> In a subsequent jira, we can use SlidingWindowRetryPolicy for am restart as 
> well to unify the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5015:

Description: 
We support sliding window retry policy for AM restart. Similar sliding window 
retry policy can be applied to Container restarts. With this change, we can 
introduce a common class for SlidingWindowRetryPolicy ( suggested by 
[~vvasudev] in the comments) and integrate to container restart. 

In a subsequent jira, we can use SlidingWindowRetryPolicy for am restart as 
well to unify the code

  was:We support sliding window retry policy for AM restart. Similar sliding 
window retry policy can be applied to Container restarts. With this change, we 
can introduce a common class as suggested by [~bikassaha] in the comments for a 
SlidingWindowRetrywill cr and container restarts - however the two have 
slightly different capabilities. We should unify them. There's no reason for 
them to be different.


> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restart. Similar sliding window 
> retry policy can be applied to Container restarts. With this change, we can 
> introduce a common class for SlidingWindowRetryPolicy ( suggested by 
> [~vvasudev] in the comments) and integrate to container restart. 
> In a subsequent jira, we can use SlidingWindowRetryPolicy for am restart as 
> well to unify the code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5015:

Description: We support sliding window retry policy for AM restart. Similar 
sliding window retry policy can be applied to Container restarts. With this 
change, we can introduce a common class as suggested by [~bikassaha] in the 
comments for a SlidingWindowRetrywill cr and container restarts - however the 
two have slightly different capabilities. We should unify them. There's no 
reason for them to be different.  (was: We support AM restart and container 
restarts - however the two have slightly different capabilities. We should 
unify them. There's no reason for them to be different.)

> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support sliding window retry policy for AM restart. Similar sliding window 
> retry policy can be applied to Container restarts. With this change, we can 
> introduce a common class as suggested by [~bikassaha] in the comments for a 
> SlidingWindowRetrywill cr and container restarts - however the two have 
> slightly different capabilities. We should unify them. There's no reason for 
> them to be different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5015) Support sliding window retry capability for container restart

2018-03-07 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5015:

Summary: Support sliding window retry capability for container restart   
(was: Unify restart policies across AM and container restarts)

> Support sliding window retry capability for container restart 
> --
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support AM restart and container restarts - however the two have slightly 
> different capabilities. We should unify them. There's no reason for them to 
> be different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390081#comment-16390081
 ] 

Wangda Tan edited comment on YARN-7626 at 3/7/18 7:32 PM:
--

Committed to trunk, thanks [~Zian Chen]. And thanks thorough reviews from 
[~miklos.szeg...@cloudera.com].


was (Author: leftnoteasy):
Committed to trunk, thanks [~Zian Chen]

> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch, YARN-7626.007.patch, YARN-7626.008.patch, 
> YARN-7626.009.patch, YARN-7626.010.patch, YARN-7626.011.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5015) Unify restart policies across AM and container restarts

2018-03-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390068#comment-16390068
 ] 

Wangda Tan commented on YARN-5015:
--

[~csingh], could you please elaborate approach and scope of this patch a bit? 
There're limited information from the description.

> Unify restart policies across AM and container restarts
> ---
>
> Key: YARN-5015
> URL: https://issues.apache.org/jira/browse/YARN-5015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Chandni Singh
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5015.01.patch, YARN-5015.02.patch, 
> YARN-5015.03.patch
>
>
> We support AM restart and container restarts - however the two have slightly 
> different capabilities. We should unify them. There's no reason for them to 
> be different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8009) YARN limit number of simultaneously running containers in the application level

2018-03-07 Thread Sachin Jose (JIRA)
Sachin Jose created YARN-8009:
-

 Summary: YARN limit number of simultaneously running containers in 
the application level
 Key: YARN-8009
 URL: https://issues.apache.org/jira/browse/YARN-8009
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.0.0
Reporter: Sachin Jose


It would be really useful if the user can specify maximum containers can be 
running simultaneously in the application level. Most of the long running YARN 
application can be benefited out of it. At this moment, the only available 
option to restrict resource over usage of long running is in the YARN resource 
manager queue level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7891) LogAggregationIndexedFileController should support read from HAR file

2018-03-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7891:
-
Summary: LogAggregationIndexedFileController should support read from HAR 
file  (was: LogAggregationIndexedFileController should support HAR file)

> LogAggregationIndexedFileController should support read from HAR file
> -
>
> Key: YARN-7891
> URL: https://issues.apache.org/jira/browse/YARN-7891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-7891.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7654) Support ENTRY_POINT for docker container

2018-03-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389883#comment-16389883
 ] 

Eric Yang edited comment on YARN-7654 at 3/7/18 6:01 PM:
-

Hi [~ebadger] [~billie.rinaldi] [~shaneku...@gmail.com] Here is an early draft 
of the patch.  The current patch allows to run both ENTRY_POINT enabled or 
disabled with limitation that user environment variables are not forwarded to 
ENTRY_POINT enabled container.  Given the complexity of the code base, and 
having to work on two different modes.  I think it is best to get some early 
feedbacks.

For patch 001 to work, you will need to apply YARN-7221 patch 6 and YARN-7677 
patch 7.

I did some review of the environment variables, and I am only comfortable to 
expose user defined environment variables in ENTRY_POINT enabled container.  
The global hadoop environment variable and node manager constructed environment 
variables will be available in prelaunch, but not forwarded to container.  Let 
me know your thoughts on this approach.  Thanks

Example job that I used to submit:
{code}
{
  "name": "sleeper-service",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "sleeper",
  "number_of_containers": 2,
  "artifact": {
"id": "hadoop/centos:latest",
"type": "DOCKER"
  },
  "privileged": true,
  "launch_command": "sleep 90",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL":"true",
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "USER_DEFINE":"1",
  "USER_DEFINE":"2"
},
"properties": {
  "docker.network": "host"
}
  }
}
  ]
}
{code}


was (Author: eyang):
Hi [~ebadger] [~billie.rinaldi] [~shaneku...@gmail.com] Here is an early draft 
of the patch.  The current patch allows to run both ENTRY_POINT enabled or 
disabled with limitation that user environment variables are not forwarded to 
ENTRY_POINT enabled container.  Given the complexity of the code base, and 
having to work on two different modes.  I think it is best to get some early 
feedbacks.

For patch 001 to work, you will need to apply YARN-7221 patch 6 and YARN-7677 
patch 7.

I did some review of the environment variables, and I am only comfortable to 
expose user defined environment variables.  The global hadoop environment 
variable and node manager constructed environment variables will be available 
in prelaunch, but not forwarded to container.  Let me know your thoughts on 
this approach.  Thanks

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-7654.001.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-03-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389883#comment-16389883
 ] 

Eric Yang commented on YARN-7654:
-

Hi [~ebadger] [~billie.rinaldi] [~shaneku...@gmail.com] Here is an early draft 
of the patch.  The current patch allows to run both ENTRY_POINT enabled or 
disabled with limitation that user environment variables are not forwarded to 
ENTRY_POINT enabled container.  Given the complexity of the code base, and 
having to work on two different modes.  I think it is best to get some early 
feedbacks.

For patch 001 to work, you will need to apply YARN-7221 patch 6 and YARN-7677 
patch 7.

I did some review of the environment variables, and I am only comfortable to 
expose user defined environment variables.  The global hadoop environment 
variable and node manager constructed environment variables will be available 
in prelaunch, but not forwarded to container.  Let me know your thoughts on 
this approach.  Thanks

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-7654.001.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8001) Newly created Yarn application ID lost after RM failover

2018-03-07 Thread shanyu zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389881#comment-16389881
 ] 

shanyu zhao commented on YARN-8001:
---

The failure is a pig job. After a client submitted an application successfully, 
later when it tries to query the status of the app then failed. Who is going to 
re-submit the application? Are you saying when using the Yarn API to get 
application it will automatically resubmit the application?

> Newly created Yarn application ID lost after RM failover
> 
>
> Key: YARN-8001
> URL: https://issues.apache.org/jira/browse/YARN-8001
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 2.7.3, 2.9.0
>Reporter: shanyu zhao
>Priority: Major
>
> I’ve seen a problem in Hadoop 2.7.3 where the newly submitted yarn 
> application was lost after a RM failover. It looks like when handling 
> Application submission, RM does not write it to the state-store (We are using 
> zookeeper based state store) immediately before it respond to the client. But 
> later it failed over to another RM and all write call to the state store 
> failed. The new RM recovers state from the state-store, and this app is lost. 
>  
> The symptom is error message at client side claiming a previously submitted 
> application ID does not exist:
> 2018-02-22 14:54:50,258 [JobControl] WARN  
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - 
> Invocation returned exception on [rm1] : 
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1519310222933_0160' doesn't exist in RM. Please check 
> that the job submission was successful.
>  
> This is a timeline excerpted from the resource manager logs:
> 2018-02-22 14:54:06.7685260    headnode1    Storing application with id 
> application_1519310222933_0160
> 2018-02-22 14:54:06.7685660    headnode1  
> application_1519310222933_0160 State change from NEW to NEW_SAVING
> 2018-02-22 14:54:17.8924760    headnode1    Transitioning to standby state
> 2018-02-22 14:54:30.3951160    headnode0    Transitioning to active state



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container

2018-03-07 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7654:

Attachment: YARN-7654.001.patch

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-7654.001.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7968) Reset the queue name in submission context while recovering an application

2018-03-07 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389794#comment-16389794
 ] 

Wilfred Spiegelenburg commented on YARN-7968:
-

During recovery we only have the queue name in the submission context. That was 
why YARN-7139 was logged.
We currently restore the submission context from the statestore which already 
has the correct queue set. We should not overwrite that with the one from the 
application object.
Do you mean that we should pull the queue out of the submission context and put 
it in the application object on restore?

> Reset the queue name in submission context while recovering an application
> --
>
> Key: YARN-7968
> URL: https://issues.apache.org/jira/browse/YARN-7968
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
>
> After YARN-7139, the new application can get correct queue name in its 
> submission context. We need to do the same thing for application recovering. 
> {code}
>   if (isAppRecovering) {
> if (LOG.isDebugEnabled()) {
>   LOG.debug(applicationId
>   + " is recovering. Skip notifying APP_ACCEPTED");
> }
>   } else {
> // During tests we do not always have an application object, handle
> // it here but we probably should fix the tests
> if (rmApp != null && rmApp.getApplicationSubmissionContext() != null) 
> {
>   // Before we send out the event that the app is accepted is
>   // to set the queue in the submissionContext (needed on restore etc)
>   rmApp.getApplicationSubmissionContext().setQueue(queue.getName());
> }
> rmContext.getDispatcher().getEventHandler().handle(
> new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED));
>   }
> {code}
> We can do it by moving the 
> {{rmApp.getApplicationSubmissionContext().setQueue}} block out of the if-else 
> block. cc [~wilfreds].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1495) Allow moving apps between queues

2018-03-07 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-1495.
-
Resolution: Fixed

Closing as last open jira YARN-1558 was done in 2.9 as part of YARN-5932

> Allow moving apps between queues
> 
>
> Key: YARN-1495
> URL: https://issues.apache.org/jira/browse/YARN-1495
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Affects Versions: 2.2.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Major
>
> This is an umbrella JIRA for work needed to allow moving YARN applications 
> from one queue to another.  The work will consist of additions in the command 
> line options, additions in the client RM protocol, and changes in the 
> schedulers to support this.
> I have a picture of how this should function in the Fair Scheduler, but I'm 
> not familiar enough with the Capacity Scheduler for the same there.  
> Ultimately, the decision to whether an application can be moved should go 
> down to the scheduler - some schedulers may wish not to support this at all.  
> However, schedulers that do support it should share some common semantics 
> around ACLs and what happens to running containers.
> Here is how I see the general semantics working out:
> * A move request is issued by the client.  After it gets past ACLs, the 
> scheduler checks whether executing the move will violate any constraints. For 
> the Fair Scheduler, these would be queue maxRunningApps and queue 
> maxResources constraints
> * All running containers are transferred from the old queue to the new queue
> * All outstanding requests are transferred from the old queue to the new queue
> Here is I see the ACLs of this working out:
> * To move an app from a queue a user must have modify access on the app or 
> administer access on the queue
> * To move an app to a queue a user must have submit access on the queue or 
> administer access on the queue 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1558) After apps are moved across queues, store new queue info in the RM state store

2018-03-07 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-1558.
-
Resolution: Duplicate

Closing as YARN-5932 is in 2.9 and later

> After apps are moved across queues, store new queue info in the RM state store
> --
>
> Key: YARN-1558
> URL: https://issues.apache.org/jira/browse/YARN-1558
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Sandy Ryza
>Assignee: Varun Saxena
>Priority: Major
>
> The result of moving an app to a new queue should persist across RM restarts. 
>  This will require updating the ApplicationSubmissionContext, the single 
> source of truth upon state recovery, with the new queue info.
> There will be a brief window after the move completes before the move is 
> stored.  If the RM dies during this window, the recovered RM will include the 
> old queue info.  Schedulers should be resilient to this situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389707#comment-16389707
 ] 

Hudson commented on YARN-7677:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13786 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13786/])
YARN-7677. Docker image cannot set HADOOP_CONF_DIR. Contributed by Jim (jlowe: 
rev d69b31f7f70f296ddd180e004fa0f827c2f737f2)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Apps.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DelegatingLinuxContainerRuntime.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AuxiliaryServiceHelper.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DefaultLinuxContainerRuntime.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/runtime/ContainerRuntime.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java


> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7998) RM crashes with NPE during recovering if ACL configuration was changed

2018-03-07 Thread Oleksandr Shevchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389655#comment-16389655
 ] 

Oleksandr Shevchenko commented on YARN-7998:


The failed tests are not related to the patch. I have rerun all failed test 
classes locally and all tests passed successfully.

Could someone review the last patch?

> RM crashes with NPE during recovering if ACL configuration was changed
> --
>
> Key: YARN-7998
> URL: https://issues.apache.org/jira/browse/YARN-7998
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
> Attachments: YARN-7998.000.patch, YARN-7998.001.patch, 
> YARN-7998.002.patch, YARN-7998.003.patch
>
>
> RM crashes with NPE during failover because ACL configurations were changed 
> as a result we no longer have a rights to submit an application to a queue.
> Scenario:
>  # Submit an application
>  # Change ACL configuration for a queue that accepted the application so that 
> an owner of the application will no longer have a rights to submit this 
> application.
>  # Restart RM.
> As a result, we get NPE:
> 2018-02-27 18:14:00,968 INFO org.apache.hadoop.service.AbstractService: 
> Service ResourceManager failed in state STARTED; cause: 
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:738)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1286)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1044)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4781) Support intra-queue preemption for fairness ordering policy.

2018-03-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389649#comment-16389649
 ] 

Eric Payne commented on YARN-4781:
--

Thanks a lot for the review [~sunilg].
{quote} my concern is over case where we need to worry about priority with 
FairOrdering. Both are kind of mutually exclusive feature to me.

we need to ensure that FairOrdering + USERLIMIT_FIRST is the combination
{quote}
It appears that the capacity scheduler's FairOrderingPolicy ignores priority. 
That is why I did not consider it in the {{TAFairOrderingComparator.}} Having 
said that, I am fine with adding an additional check to ensure that 
USERLIMIT_FIRST is set before setting the {{TAFairOrderingComparator}}
{quote}could we use Fair plugin instead of FifoIntraQueuePreemptionPlugin, and 
make an abstract class to move all common code. Then Fair plugin could be auto 
chosen based on ordering policy selected
{quote}
I feel that this would add unnecessary complexity. I think restricting the 
changes to the comparator is a cleaner approach.

 

> Support intra-queue preemption for fairness ordering policy.
> 
>
> Key: YARN-4781
> URL: https://issues.apache.org/jira/browse/YARN-4781
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-4781.001.patch
>
>
> We introduced fairness queue policy since YARN-3319, which will let large 
> applications make progresses and not starve small applications. However, if a 
> large application takes the queue’s resources, and containers of the large 
> app has long lifespan, small applications could still wait for resources for 
> long time and SLAs cannot be guaranteed.
> Instead of wait for application release resources on their own, we need to 
> preempt resources of queue with fairness policy enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6136) YARN registry service should avoid scanning whole ZK tree for every container/application finish

2018-03-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389632#comment-16389632
 ] 

Steve Loughran commented on YARN-6136:
--

It's just trying to do a cleanup at the end, no matter how things exit.

This could trivially be made optional

> YARN registry service should avoid scanning whole ZK tree for every 
> container/application finish
> 
>
> Key: YARN-6136
> URL: https://issues.apache.org/jira/browse/YARN-6136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
>
> In existing registry service implementation, purge operation triggered by 
> container finish event:
> {code}
>   public void onContainerFinished(ContainerId id) throws IOException {
> LOG.info("Container {} finished, purging container-level records",
> id);
> purgeRecordsAsync("/",
> id.toString(),
> PersistencePolicies.CONTAINER);
>   }
> {code} 
> Since this happens on every container finish, so it essentially scans all (or 
> almost) ZK node from the root. 
> We have a cluster which have hundreds of ZK nodes for service registry, and 
> have 20K+ ZK nodes for other purposes. The existing implementation could 
> generate massive ZK operations and internal Java objects (RegistryPathStatus) 
> as well. The RM becomes very unstable when there're batch container finish 
> events because of full GC pause and ZK connection failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7996) Allow user supplied Docker client configurations with YARN native services

2018-03-07 Thread Shane Kumpf (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389603#comment-16389603
 ] 

Shane Kumpf commented on YARN-7996:
---

I believe this is ready for review, [~billie.rinaldi] [~gsaha]

> Allow user supplied Docker client configurations with YARN native services
> --
>
> Key: YARN-7996
> URL: https://issues.apache.org/jira/browse/YARN-7996
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Assignee: Shane Kumpf
>Priority: Major
> Attachments: YARN-7996.001.patch, YARN-7996.002.patch
>
>
> YARN-5428 added support to distributed shell for supplying a Docker client 
> configuration at application submission time. The auth tokens within the 
> client configuration are then used to pull images from private Docker 
> repositories/registries. Add the same support to the YARN Native Services 
> framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2018-03-07 Thread Shane Kumpf (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389562#comment-16389562
 ] 

Shane Kumpf commented on YARN-7677:
---

+1 (non-binding) from me as well. I ran my usual suite of tests with and 
without Docker and did not see any issues. Thanks for driving this 
[~Jim_Brennan]!

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8007) Support specifying placement constraint for task containers in SLS

2018-03-07 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389557#comment-16389557
 ] 

Weiwei Yang edited comment on YARN-8007 at 3/7/18 1:37 PM:
---

Hi [~yangjiandan]

Thanks for initiating this, it makes a lot of sense to help test scheduler perf 
with placement constraints,

There are a few general comments to your patch
 # Instead of using {{SourceTags}} abstraction, lets follow the public API in 
scheduling request, which is to provide a Set of source tags and a 
PlacementConstraint. I think you will need 2 configs,
 allocation_tags: foo,bar
 placement_constraint: notin,node,foo
 this way we can honor number of tasks in the config.
 # I don't think we need to add placement constraint field in 
{{ContainerSimulator}}, a placement constraint is request level, not container 
level.
 # Have you tried to run this and how this works out?

Thanks


was (Author: cheersyang):
Hi [~yangjiandan]

Thanks for initiating this, it makes a lot of sense to help test scheduler perf 
with placement constraints,

There are a few general comments to your patch
 # Instead of using {{SourceTags}} abstraction, lets follow the public API in 
scheduling request, which is to provide a Set of source tags and a 
PlacementConstraint. I think you will need 2 configs,
 allocation_tags: foo,bar
 placement_constraint: notin,node,foo
 this way we can honor number of tasks in the config.
 # I don't think we need to add placement constraint field in 
ContainerSimulator, a placement constraint is request level, not container 
level.
 # Have you tried to run this and how this works out?

Thanks

> Support specifying placement constraint for task containers in SLS
> --
>
> Key: YARN-8007
> URL: https://issues.apache.org/jira/browse/YARN-8007
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8007.001.patch
>
>
> YARN-6592 introduces placement constraint. Currently SLS does not support 
> specify placement constraint. 
> In order to help better perf test, we should be able to support specify 
> placement for containers in sls configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8007) Support specifying placement constraint for task containers in SLS

2018-03-07 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389557#comment-16389557
 ] 

Weiwei Yang commented on YARN-8007:
---

Hi [~yangjiandan]

Thanks for initiating this, it makes a lot of sense to help test scheduler perf 
with placement constraints,

There are a few general comments to your patch
 # Instead of using {{SourceTags}} abstraction, lets follow the public API in 
scheduling request, which is to provide a Set of source tags and a 
PlacementConstraint. I think you will need 2 configs,
 allocation_tags: foo,bar
 placement_constraint: notin,node,foo
 this way we can honor number of tasks in the config.
 # I don't think we need to add placement constraint field in 
ContainerSimulator, a placement constraint is request level, not container 
level.
 # Have you tried to run this and how this works out?

Thanks

> Support specifying placement constraint for task containers in SLS
> --
>
> Key: YARN-8007
> URL: https://issues.apache.org/jira/browse/YARN-8007
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8007.001.patch
>
>
> YARN-6592 introduces placement constraint. Currently SLS does not support 
> specify placement constraint. 
> In order to help better perf test, we should be able to support specify 
> placement for containers in sls configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7905) Parent directory permission incorrect during public localization

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389535#comment-16389535
 ] 

genericqa commented on YARN-7905:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 18m 
36s{color} | {color:red} Docker failed to build yetus/hadoop:d4cc50f. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-7905 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913383/YARN-7905-003.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/19913/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Parent directory permission incorrect during public localization 
> -
>
> Key: YARN-7905
> URL: https://issues.apache.org/jira/browse/YARN-7905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-7905-001.patch, YARN-7905-002.patch, 
> YARN-7905-003.patch
>
>
> Similar to YARN-6708 during public localization also we have to take care for 
> parent directory if the umask is 027 during node manager start up.
> /filecache/0/200
> the directory permission of /filecache/0 is 750. Which cause 
> application failure 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7905) Parent directory permission incorrect during public localization

2018-03-07 Thread Bilwa S T (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389508#comment-16389508
 ] 

Bilwa S T edited comment on YARN-7905 at 3/7/18 12:56 PM:
--

[~bibinchundatt] 

i have taken care of comments given . Please review it .


was (Author: bilwast):
[~bibinchundatt] 

i have taken care of above comments given . Please review it .

> Parent directory permission incorrect during public localization 
> -
>
> Key: YARN-7905
> URL: https://issues.apache.org/jira/browse/YARN-7905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-7905-001.patch, YARN-7905-002.patch, 
> YARN-7905-003.patch
>
>
> Similar to YARN-6708 during public localization also we have to take care for 
> parent directory if the umask is 027 during node manager start up.
> /filecache/0/200
> the directory permission of /filecache/0 is 750. Which cause 
> application failure 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7905) Parent directory permission incorrect during public localization

2018-03-07 Thread Bilwa S T (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389508#comment-16389508
 ] 

Bilwa S T commented on YARN-7905:
-

[~bibinchundatt] 

i have taken care of above comments given . Please review it .

> Parent directory permission incorrect during public localization 
> -
>
> Key: YARN-7905
> URL: https://issues.apache.org/jira/browse/YARN-7905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-7905-001.patch, YARN-7905-002.patch, 
> YARN-7905-003.patch
>
>
> Similar to YARN-6708 during public localization also we have to take care for 
> parent directory if the umask is 027 during node manager start up.
> /filecache/0/200
> the directory permission of /filecache/0 is 750. Which cause 
> application failure 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7905) Parent directory permission incorrect during public localization

2018-03-07 Thread Bilwa S T (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T updated YARN-7905:

Attachment: YARN-7905-003.patch

> Parent directory permission incorrect during public localization 
> -
>
> Key: YARN-7905
> URL: https://issues.apache.org/jira/browse/YARN-7905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-7905-001.patch, YARN-7905-002.patch, 
> YARN-7905-003.patch
>
>
> Similar to YARN-6708 during public localization also we have to take care for 
> parent directory if the umask is 027 during node manager start up.
> /filecache/0/200
> the directory permission of /filecache/0 is 750. Which cause 
> application failure 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7905) Parent directory permission incorrect during public localization

2018-03-07 Thread Bilwa S T (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T updated YARN-7905:

Attachment: (was: YARN-7905-003.patch)

> Parent directory permission incorrect during public localization 
> -
>
> Key: YARN-7905
> URL: https://issues.apache.org/jira/browse/YARN-7905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-7905-001.patch, YARN-7905-002.patch, 
> YARN-7905-003.patch
>
>
> Similar to YARN-6708 during public localization also we have to take care for 
> parent directory if the umask is 027 during node manager start up.
> /filecache/0/200
> the directory permission of /filecache/0 is 750. Which cause 
> application failure 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7905) Parent directory permission incorrect during public localization

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389495#comment-16389495
 ] 

genericqa commented on YARN-7905:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} YARN-7905 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-7905 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913378/YARN-7905-003.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/19912/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Parent directory permission incorrect during public localization 
> -
>
> Key: YARN-7905
> URL: https://issues.apache.org/jira/browse/YARN-7905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-7905-001.patch, YARN-7905-002.patch, 
> YARN-7905-003.patch
>
>
> Similar to YARN-6708 during public localization also we have to take care for 
> parent directory if the umask is 027 during node manager start up.
> /filecache/0/200
> the directory permission of /filecache/0 is 750. Which cause 
> application failure 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7905) Parent directory permission incorrect during public localization

2018-03-07 Thread Bilwa S T (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T updated YARN-7905:

Attachment: YARN-7905-003.patch

> Parent directory permission incorrect during public localization 
> -
>
> Key: YARN-7905
> URL: https://issues.apache.org/jira/browse/YARN-7905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-7905-001.patch, YARN-7905-002.patch, 
> YARN-7905-003.patch
>
>
> Similar to YARN-6708 during public localization also we have to take care for 
> parent directory if the umask is 027 during node manager start up.
> /filecache/0/200
> the directory permission of /filecache/0 is 750. Which cause 
> application failure 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8007) Support specifying placement constraint for task containers in SLS

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389451#comment-16389451
 ] 

genericqa commented on YARN-8007:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
36s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
9m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 10s{color} | {color:orange} hadoop-tools/hadoop-sls: The patch generated 1 
new + 40 unchanged - 0 fixed = 41 total (was 40) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  9m 19s{color} 
| {color:red} hadoop-sls in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
18s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 49m 16s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.sls.TestSLSStreamAMSynth |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | YARN-8007 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913366/YARN-8007.001.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 3f579a853aa7 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 
11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 58ea2d7 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/19911/artifact/out/diff-checkstyle-hadoop-tools_hadoop-sls.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/19911/artifact/out/patch-unit-hadoop-tools_hadoop-sls.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/19911/testReport/ |
| Max. process+thread count | 467 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-sls U: 

[jira] [Created] (YARN-8008) Admin command to manage global placement constraints

2018-03-07 Thread Weiwei Yang (JIRA)
Weiwei Yang created YARN-8008:
-

 Summary: Admin command to manage global placement constraints
 Key: YARN-8008
 URL: https://issues.apache.org/jira/browse/YARN-8008
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Weiwei Yang
Assignee: Weiwei Yang


Add command for admin to manager global placement constraints, such as add, 
remove and list. This will be exposed via, for example
{code}
yarn rmadmin -placementConstraint [ -add -t  -c  | -remove -t  
| -list ]
{code}
expose to use this JIRA to add API/proto changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8007) Support specifying placement constraint for task containers in SLS

2018-03-07 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8007:

Attachment: YARN-8007.001.patch

> Support specifying placement constraint for task containers in SLS
> --
>
> Key: YARN-8007
> URL: https://issues.apache.org/jira/browse/YARN-8007
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8007.001.patch
>
>
> YARN-6592 introduces placement constraint. Currently SLS does not support 
> specify placement constraint. 
> In order to help better perf test, we should be able to support specify 
> placement for containers in sls configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8006) Make Hbase-2 profile as default for YARN-7055 branch

2018-03-07 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389361#comment-16389361
 ] 

Rohith Sharma K S commented on YARN-8006:
-

[~haibochen] could you review patch? It appears like test failures and other 
failures are because of switching to hbase-2 profile default. 

> Make Hbase-2 profile as default for YARN-7055 branch
> 
>
> Key: YARN-8006
> URL: https://issues.apache.org/jira/browse/YARN-8006
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8006-YARN-7055.001.patch
>
>
> In last weekly call folks discussed that we should have separate branch with 
> hbase-2 as profile by default. Trunk default profile is hbase-1 which runs 
> all the tests under hbase-1 profile. But for hbase-2 profile tests are not 
> running.
> As per the discussion, lets keep YARN-7055 branch for hbase-2 profile as 
> default. Any server side patches can be given to this branch as well which 
> runs tests for hbase-2 profile. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8006) Make Hbase-2 profile as default for YARN-7055 branch

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389348#comment-16389348
 ] 

genericqa commented on YARN-8006:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} YARN-7055 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  5m 
28s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
43s{color} | {color:green} YARN-7055 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 
56s{color} | {color:green} YARN-7055 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
29s{color} | {color:green} YARN-7055 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
45m  7s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
14s{color} | {color:green} YARN-7055 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
10s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m  
1s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 12m  1s{color} 
| {color:red} root generated 201 new + 1233 unchanged - 0 fixed = 1434 total 
(was 1233) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
4s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
8m 23s{color} | {color:green} patch has no errors when building and testing our 
client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
15s{color} | {color:green} hadoop-project in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
23s{color} | {color:green} hadoop-yarn-server-timelineservice-hbase-server in 
the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  2m 44s{color} 
| {color:red} hadoop-yarn-server-timelineservice-hbase-tests in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 76m 28s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRun |
|   | 
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowActivity |
|   | hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorageSchema 
|
|   | 
hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorageEntities |
|   | hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorageApps |
|   | 
hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
 |
|   | 
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:c8176b7 |
| JIRA Issue | YARN-8006 |
| JIRA Patch URL | 

[jira] [Commented] (YARN-7975) Add an optional arg to yarn cluster -list-node-labels to list nodes collection partitioned by labels

2018-03-07 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389327#comment-16389327
 ] 

Sunil G commented on YARN-7975:
---

[~shenyinjie], could u pls write some UT cases to verify the case which you 
added. Thank You.

> Add an optional arg to yarn cluster -list-node-labels to list nodes 
> collection partitioned by labels
> 
>
> Key: YARN-7975
> URL: https://issues.apache.org/jira/browse/YARN-7975
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Shen Yinjie
>Assignee: Shen Yinjie
>Priority: Major
> Attachments: YARN-7975.patch
>
>
> Since we have "yarn cluster -lnl" to print all nodelabels info .But it's not 
> enough,we should be abale to list nodes collection partitioned by 
> labels,especially in large cluster.
> So  I propose to add an optional argument  "-nodes" for  "yarn cluster -lnl" 
> to achieve this.
> e.g.
> [yarn@docker1 ~]$ yarn cluster -lnl -nodes
> Node Labels Num: 3
>               Labels                                               Nodes
>  

[jira] [Commented] (YARN-8000) Yarn Service: component instance name shows up as component name in container record

2018-03-07 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389324#comment-16389324
 ] 

Sunil G commented on YARN-8000:
---

Yes. As mentioned by [~gsaha], if the data is published to ATS correct, UI wont 
have any problem. Kindly check the service component page post this patch.

> Yarn Service: component instance name shows up as component name in container 
> record 
> -
>
> Key: YARN-8000
> URL: https://issues.apache.org/jira/browse/YARN-8000
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8000.001.patch, YARN-8000.002.patch, 
> YARN-8000.003.patch, YARN-8000.004.patch
>
>
> Yarn Service: component instance name shows up as component name in container 
> record 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State

2018-03-07 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389322#comment-16389322
 ] 

Bibin A Chundatt commented on YARN-2442:


Thank you [~rohithsharma] for quick patch
Over all patch looks good to me. Could you handle the checkstyle errors too.


> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4781) Support intra-queue preemption for fairness ordering policy.

2018-03-07 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389289#comment-16389289
 ] 

Sunil G commented on YARN-4781:
---

Thanks [~eepayne] for bringing this thread alive..

bq.Then, while all 4 users have running apps, User5 comes along and can't get 
any resources, they see that User1 is using 62% more resources than everyone 
else, and wonders why they can't get any resources.

Ideally when app3 finishes, quota will go for apps which are started first or 
less consumed resources compared to others. However my concern is over case 
where we need to worry about priority with FairOrdering. Both are kind of 
mutually exclusive feature to me.

 

Coming to the approach, i guess we need to ensure that FairOrdering + 
USERLIMIT_FIRST is the combination. In this combination, we will be able to do 
preemption for apps which under allocated as per FairOrdering policy.

 

The patch is simpler enough to cover the comparator change. Thinking out load, 
could we use Fair plugin instead of FifoIntraQueuePreemptionPlugin, and make an 
abstract class to move all common code. Then Fair plugin could be auto chosen 
based on ordering policy selected.

> Support intra-queue preemption for fairness ordering policy.
> 
>
> Key: YARN-4781
> URL: https://issues.apache.org/jira/browse/YARN-4781
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-4781.001.patch
>
>
> We introduced fairness queue policy since YARN-3319, which will let large 
> applications make progresses and not starve small applications. However, if a 
> large application takes the queue’s resources, and containers of the large 
> app has long lifespan, small applications could still wait for resources for 
> long time and SLAs cannot be guaranteed.
> Instead of wait for application release resources on their own, we need to 
> preempt resources of queue with fairness policy enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   >