[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900246#comment-14900246
 ] 

Hudson commented on YARN-4167:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #426 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/426/])
YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A 
Chundatt via rohithsharmaks) (rohithsharmaks: rev 
c9cb6a5960ad335a3ee93a6ee219eae5aad372f9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* hadoop-yarn-project/CHANGES.txt


> NPE on RMActiveServices#serviceStop when store is null
> --
>
> Key: YARN-4167
> URL: https://issues.apache.org/jira/browse/YARN-4167
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, 
> 0002-YARN-4167.patch
>
>
> Configure 
> {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} 
> mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}}
> On startup NPE is thrown on {{RMActiveServices#serviceStop}}
> {noformat}
> 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED; cause: 
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193)
> 2015-09-16 12:23:29,507 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing 
> store.
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193
> {noformat}
> *Impact Area*: RM failover with wrong configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900285#comment-14900285
 ] 

zhihai xu commented on YARN-4095:
-

Hi [~Jason Lowe], Could you help review the patch? thanks

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side

2015-09-21 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900290#comment-14900290
 ] 

Devaraj K commented on YARN-3964:
-

[~leftnoteasy], Sure, Thanks for your interest.

> Support NodeLabelsProvider at Resource Manager side
> ---
>
> Key: YARN-3964
> URL: https://issues.apache.org/jira/browse/YARN-3964
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, 
> YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, 
> YARN-3964.006.patch, YARN-3964.1.patch
>
>
> Currently, CLI/REST API is provided in Resource Manager to allow users to 
> specify labels for nodes. For labels which may change over time, users will 
> have to start a cron job to update the labels. This has the following 
> limitations:
> - The cron job needs to be run in the YARN admin user.
> - This makes it a little complicate to maintain as users will have to make 
> sure this service/daemon is alive.
> Adding a Node Labels Provider in Resource Manager will provide user more 
> flexibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900347#comment-14900347
 ] 

Hudson commented on YARN-4167:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2337 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2337/])
YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A 
Chundatt via rohithsharmaks) (rohithsharmaks: rev 
c9cb6a5960ad335a3ee93a6ee219eae5aad372f9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


> NPE on RMActiveServices#serviceStop when store is null
> --
>
> Key: YARN-4167
> URL: https://issues.apache.org/jira/browse/YARN-4167
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, 
> 0002-YARN-4167.patch
>
>
> Configure 
> {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} 
> mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}}
> On startup NPE is thrown on {{RMActiveServices#serviceStop}}
> {noformat}
> 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED; cause: 
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193)
> 2015-09-16 12:23:29,507 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing 
> store.
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193
> {noformat}
> *Impact Area*: RM failover with wrong configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900353#comment-14900353
 ] 

Hudson commented on YARN-4167:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #399 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/399/])
YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A 
Chundatt via rohithsharmaks) (rohithsharmaks: rev 
c9cb6a5960ad335a3ee93a6ee219eae5aad372f9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


> NPE on RMActiveServices#serviceStop when store is null
> --
>
> Key: YARN-4167
> URL: https://issues.apache.org/jira/browse/YARN-4167
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, 
> 0002-YARN-4167.patch
>
>
> Configure 
> {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} 
> mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}}
> On startup NPE is thrown on {{RMActiveServices#serviceStop}}
> {noformat}
> 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED; cause: 
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193)
> 2015-09-16 12:23:29,507 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing 
> store.
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193
> {noformat}
> *Impact Area*: RM failover with wrong configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900355#comment-14900355
 ] 

Hudson commented on YARN-4167:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2364 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2364/])
YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A 
Chundatt via rohithsharmaks) (rohithsharmaks: rev 
c9cb6a5960ad335a3ee93a6ee219eae5aad372f9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


> NPE on RMActiveServices#serviceStop when store is null
> --
>
> Key: YARN-4167
> URL: https://issues.apache.org/jira/browse/YARN-4167
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, 
> 0002-YARN-4167.patch
>
>
> Configure 
> {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} 
> mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}}
> On startup NPE is thrown on {{RMActiveServices#serviceStop}}
> {noformat}
> 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED; cause: 
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193)
> 2015-09-16 12:23:29,507 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing 
> store.
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193
> {noformat}
> *Impact Area*: RM failover with wrong configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3964) Support NodeLabelsProvider at Resource Manager side

2015-09-21 Thread Dian Fu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dian Fu updated YARN-3964:
--
Attachment: YARN-3964.006.patch

Thanks [~devaraj.k] for taking look at the patch. Attaching rebased patch.

> Support NodeLabelsProvider at Resource Manager side
> ---
>
> Key: YARN-3964
> URL: https://issues.apache.org/jira/browse/YARN-3964
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, 
> YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, 
> YARN-3964.006.patch, YARN-3964.1.patch
>
>
> Currently, CLI/REST API is provided in Resource Manager to allow users to 
> specify labels for nodes. For labels which may change over time, users will 
> have to start a cron job to update the labels. This has the following 
> limitations:
> - The cron job needs to be run in the YARN admin user.
> - This makes it a little complicate to maintain as users will have to make 
> sure this service/daemon is alive.
> Adding a Node Labels Provider in Resource Manager will provide user more 
> flexibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900255#comment-14900255
 ] 

Hudson commented on YARN-4167:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #1158 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1158/])
YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A 
Chundatt via rohithsharmaks) (rohithsharmaks: rev 
c9cb6a5960ad335a3ee93a6ee219eae5aad372f9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


> NPE on RMActiveServices#serviceStop when store is null
> --
>
> Key: YARN-4167
> URL: https://issues.apache.org/jira/browse/YARN-4167
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, 
> 0002-YARN-4167.patch
>
>
> Configure 
> {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} 
> mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}}
> On startup NPE is thrown on {{RMActiveServices#serviceStop}}
> {noformat}
> 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED; cause: 
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193)
> 2015-09-16 12:23:29,507 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing 
> store.
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193
> {noformat}
> *Impact Area*: RM failover with wrong configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side

2015-09-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900261#comment-14900261
 ] 

Wangda Tan commented on YARN-3964:
--

[~dian.fu], [~devaraj.k], I plan to take a look at this patch tomorrow, could 
you wait my review before commit it?

Thanks,

> Support NodeLabelsProvider at Resource Manager side
> ---
>
> Key: YARN-3964
> URL: https://issues.apache.org/jira/browse/YARN-3964
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, 
> YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, 
> YARN-3964.006.patch, YARN-3964.1.patch
>
>
> Currently, CLI/REST API is provided in Resource Manager to allow users to 
> specify labels for nodes. For labels which may change over time, users will 
> have to start a cron job to update the labels. This has the following 
> limitations:
> - The cron job needs to be run in the YARN admin user.
> - This makes it a little complicate to maintain as users will have to make 
> sure this service/daemon is alive.
> Adding a Node Labels Provider in Resource Manager will provide user more 
> flexibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900286#comment-14900286
 ] 

zhihai xu commented on YARN-4095:
-

Hi [~jlowe], Could you help review the patch? thanks

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901850#comment-14901850
 ] 

Hudson commented on YARN-4188:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1160 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1160/])
YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static 
(cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java
* hadoop-yarn-project/CHANGES.txt


> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities

2015-09-21 Thread Shiwei Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiwei Guo updated YARN-4199:
-
Description: 
In current implementation, LeveldbTimelineStore.discardOldEntities holds a 
writeLock on deleteLock, which will block other put operation, which eventually 
block the execution of YARN jobs(e.g. TEZ). When there is lots of history jobs 
in timelinestore, the block time will be very long. In our observation, it 
block all the TEZ jobs for several hours or longer. 

The possible solutions are:
- Optimize leveldb configuration,  so a full scan won't take long time.
- Take a snapshot of leveldb, and scan the snapshot, so we only need to hold 
lock while getSnapshot. One question is that whether snapshot will take long 
time or not, cause I have no experience with leveldb.

  was:
I current implementation, LeveldbTimelineStore.discardOldEntities holds a 
writeLock on deleteLock, which will block other put operation, which eventually 
block the execution of YARN jobs(e.g. TEZ). When there is lots of history jobs 
in timelinestore, the block time will be very long. In our observation, it 
block all the TEZ jobs for several hours or longer. 

The possible solutions are:
- Optimize leveldb configuration,  so a full scan won't take long time.
- Take a snapshot of leveldb, and scan the snapshot, so we only need to hold 
lock while getSnapshot. One question is that whether snapshot will take long 
time or not, cause I have no experience with leveldb.


> Minimize lock time in LeveldbTimelineStore.discardOldEntities
> -
>
> Key: YARN-4199
> URL: https://issues.apache.org/jira/browse/YARN-4199
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver, yarn
>Reporter: Shiwei Guo
>
> In current implementation, LeveldbTimelineStore.discardOldEntities holds a 
> writeLock on deleteLock, which will block other put operation, which 
> eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of 
> history jobs in timelinestore, the block time will be very long. In our 
> observation, it block all the TEZ jobs for several hours or longer. 
> The possible solutions are:
> - Optimize leveldb configuration,  so a full scan won't take long time.
> - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold 
> lock while getSnapshot. One question is that whether snapshot will take long 
> time or not, cause I have no experience with leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities

2015-09-21 Thread Shiwei Guo (JIRA)
Shiwei Guo created YARN-4199:


 Summary: Minimize lock time in 
LeveldbTimelineStore.discardOldEntities
 Key: YARN-4199
 URL: https://issues.apache.org/jira/browse/YARN-4199
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Reporter: Shiwei Guo


I current implementation, LeveldbTimelineStore.discardOldEntities holds a 
writeLock on deleteLock, which will block other put operation, which eventually 
block the execution of YARN jobs(e.g. TEZ). When there is lots of history jobs 
in timelinestore, the block time will be very long. In our observation, it 
block all the TEZ jobs for several hours or longer. 

The possible solutions are:
- Optimize leveldb configuration,  so a full scan won't take long time.
- Take a snapshot of leveldb, and scan the snapshot, so we only need to hold 
lock while getSnapshot. One question is that whether snapshot will take long 
time or not, cause I have no experience with leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4011) Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk

2015-09-21 Thread Maysam Yabandeh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901946#comment-14901946
 ] 

Maysam Yabandeh commented on YARN-4011:
---

We face this problem quite often in our ad hoc cluster and are thinking to 
implement some basic checkers to make such misbehaved jobs fail fast.

Until we have a proper solution for yarn, we can have a mapreduce-specific 
solution in place to protect the cluster from rogue mapreduce tasks? The 
mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above 
the configured limit. It is true that written bytes is larger than the actual 
used disk space, but to detect a rogue task the exact value is not required and 
a very large value for written bytes to local disk is a good indicative that 
the task is misbehaved.

Thoughts?

> Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk
> 
>
> Key: YARN-4011
> URL: https://issues.apache.org/jira/browse/YARN-4011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.4.0
>Reporter: Ashwin Shankar
>
> We observed jobs failed since tasks couldn't launch on nodes due to 
> "java.io.IOException No space left on device". 
> On digging in further, we found a rogue job which filled up disk.
> Specifically it was wrote a lot of map spills(like 
> attempt_1432082376223_461647_m_000421_0_spill_1.out) to nm-local-dir 
> causing disk to fill up, and it failed/got killed, but didn't clean up these 
> files in nm-local-dir.
> So the disk remained full, causing subsequent jobs to fail.
> This jira is created to address why files under nm-local-dir doesn't get 
> cleaned up when job fails after filling up disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

2015-09-21 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901897#comment-14901897
 ] 

Jian He commented on YARN-4000:
---

bq.  I think this shouldn't be a problem.
actually,  I think this will be a problem in regular case. Application is being 
killed by user right on RM restart. This is an existing problem though. Do you 
think so ?

> RM crashes with NPE if leaf queue becomes parent queue during restart
> -
>
> Key: YARN-4000
> URL: https://issues.apache.org/jira/browse/YARN-4000
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending

2015-09-21 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901869#comment-14901869
 ] 

Weiwei Yang commented on YARN-4165:
---

Hi Jason 

We are using capacity scheduler, and the problem can be described as, we have 2 
nodes,  and . If there is an outstanding container request for APP1, 
both  and  is reserved for the application, RM log looks like

2015-09-21 20:39:07,990 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:allocateContainersToNode(1240)) - Skipping scheduling 
since node :45454 is reserved by application 
appattempt_1442889801665_0001_01
2015-09-21 20:40:10,990 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:allocateContainersToNode(1240)) - Skipping scheduling 
since node :45454 is reserved by application 
appattempt_1442889801665_0001_01

then when I submit a new job APP2, the app master cannot be allocated because 
all nodes are reserved.

> An outstanding container request makes all nodes to be reserved causing all 
> jobs pending
> 
>
> Key: YARN-4165
> URL: https://issues.apache.org/jira/browse/YARN-4165
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>
> We have a long running service in YARN, it has a outstanding container 
> request that YARN cannot satisfy (require more memory that nodemanager can 
> supply). Then YARN reserves all nodes for this application, when I submit 
> other jobs (require relative small memory that nodemanager can supply), all 
> jobs are pending because YARN skips scheduling containers on the nodes that 
> have been reserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901932#comment-14901932
 ] 

Hudson commented on YARN-4188:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2339 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2339/])
YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static 
(cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java


> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901930#comment-14901930
 ] 

Hudson commented on YARN-4188:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #401 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/401/])
YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static 
(cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java


> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901880#comment-14901880
 ] 

Hudson commented on YARN-4188:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2366 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2366/])
YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static 
(cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java


> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900485#comment-14900485
 ] 

Bibin A Chundatt commented on YARN-4152:


Looks like only for {{LogAggregationService}} exists. 
{{ContainerEventDispatcher}} its handled.

> NM crash with NPE when LogAggregationService#stopContainer called for absent 
> container
> --
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, 
> 0003-YARN-4152.patch
>
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e51_1442063466801_0001_01_99 is : 143
> 2015-09-12 18:44:25,670 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e51_1442063466801_0001_01_000101 from application 
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
> OPERATION=Container Finished - SucceededTARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1442063466801_0001
> CONTAINERID=container_e51_1442063466801_0001_01_000100
> {code}
> *Analysis*
> Looks like for absent container also {{stopContainer}} is called 
> {code}
>   case CONTAINER_FINISHED:
> LogHandlerContainerFinishedEvent containerFinishEvent =
> (LogHandlerContainerFinishedEvent) event;
> stopContainer(containerFinishEvent.getContainerId(),
> containerFinishEvent.getExitCode());
> break;
> {code}
> *Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101*
> Should skip when {{null==context.getContainers().get(containerId)}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900524#comment-14900524
 ] 

Bibin A Chundatt commented on YARN-4140:


They are related will check how to update testcases for that.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500

[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900705#comment-14900705
 ] 

Bibin A Chundatt commented on YARN-4176:


Checkstyle is due to number of lines
{noformat}
File length is 2,146 lines (max allowed is 2,000).
{noformat}
I feel can be skipped as already the number of lines were greater than 2K

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-21 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900613#comment-14900613
 ] 

Sunil G commented on YARN-4143:
---

Yes [~adhoot]. I am also not getting any other way other than this Event 
handling. Because RMAppAttempt need to update such information to Schedulers 
(common), and either an Event or an API is only a clean way here.

I do not have any objection here for existing way what is done in the  patch. 
Thought of bringing up all possible options here and weigh in the  best. 

> Optimize the check for AMContainer allocation needed by blacklisting and 
> ContainerType
> --
>
> Key: YARN-4143
> URL: https://issues.apache.org/jira/browse/YARN-4143
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4143.001.patch
>
>
> In YARN-2005 there are checks made to determine if the allocation is for an 
> AM container. This happens in every allocate call and should be optimized 
> away since it changes only once per SchedulerApplicationAttempt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900498#comment-14900498
 ] 

Bibin A Chundatt commented on YARN-4140:


Will recheck {{TestNodeLabelContainerAllocation }} failures

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> 

[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint

2015-09-21 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900519#comment-14900519
 ] 

Steve Loughran commented on YARN-4191:
--

Do you mean the REST API serializing the Application Report isn't including the 
RPC URL?

Or that if an app chooses to register a REST endpoint as the port in the 
application report, the RM isn't redirecting to it?

the RM has bigger issues with REST, namely it assumes there's a user and a 
browser at the far end (YARN-2084 ), not an application sending PUT requests 
and expecting machine-parseable status codes and error text.

> Expose ApplicationMaster RPC port in ResourceManager REST endpoint
> --
>
> Key: YARN-4191
> URL: https://issues.apache.org/jira/browse/YARN-4191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Richard Lee
>Priority: Minor
>
> Currently, the ResourceManager REST endpoint returns only the trackingUrl for 
> the ApplicationMaster.  Some AMs, however, have their REST endpoints on the 
> RPC port.  However, the RM does not expose the AM RPC port via REST for some 
> reason.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-21 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901537#comment-14901537
 ] 

Li Lu commented on YARN-4075:
-

Hi [~varun_saxena]! Thanks for the work and sorry for the delayed reply. I 
looked at your POC.2 patch and here are some comments:

- getFlows (/flows/{clusterId}): Maybe we'd like to return the "default" 
cluster, or the cluster the reader runs on (or a reader farm associates to), if 
the given clusterId is empty?
- In TestTimelineReaderWebServicesFlowRun#testGetFlowRun, why do we compare 
equality through toString and comparing two strings? I think we need a "deep 
comparison" method for timeline metrics for this case, so maybe you'd like to 
add this method, and use it in testGetFlowRun? 
- The following logic:
{code}
+  callerUGI != null && (userId == null || userId.isEmpty()) ?
+  callerUGI.getUserName().trim() : parseStr(userId)
{code}
is common enough in TimelineReaderWebServices. Since the logic is not quite 
trivial, maybe we'd like to put them in a standalong private method? 
- I just noticed that we're returning Set rather than 
TimelineEntities in timeline reader. This is not consistent with timeline 
writer (which uses TimelineEntities). It doesn't hurt much to have one more 
level of indirection, so maybe we'd like to change the readers to return 
TimelineEntities? In this way the reader and the writer will have the same 
behavior on this. 
- Any special reasons to refactor TestHBaseTimelineStorage? 

Since we're merging YARN-4074 soon, I have not checked if this patch applies to 
the latest YARN-2928 branch. We need to make sure that after you refreshed your 
patch. 

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4189) Capacity Scheduler : Improve location preference waiting mechanism

2015-09-21 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901736#comment-14901736
 ] 

Xianyin Xin commented on YARN-4189:
---

[~leftnoteasy], convincing analysis. It's fine X << Y and X is close to the 
heartbeat interval, so, should we limit X to avoid users deploy it freely?

> Capacity Scheduler : Improve location preference waiting mechanism
> --
>
> Key: YARN-4189
> URL: https://issues.apache.org/jira/browse/YARN-4189
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4189 design v1.pdf
>
>
> There're some issues with current Capacity Scheduler implementation of delay 
> scheduling:
> *1) Waiting time to allocate each container highly depends on cluster 
> availability*
> Currently, app can only increase missed-opportunity when a node has available 
> resource AND it gets traversed by a scheduler. There’re lots of possibilities 
> that an app doesn’t get traversed by a scheduler, for example:
> A cluster has 2 racks (rack1/2), each rack has 40 nodes. 
> Node-locality-delay=40. An application prefers rack1. 
> Node-heartbeat-interval=1s.
> Assume there are 2 nodes available on rack1, delay to allocate one container 
> = 40 sec.
> If there are 20 nodes available on rack1, delay of allocating one container = 
> 2 sec.
> *2) It could violate scheduling policies (Fifo/Priority/Fair)*
> Assume a cluster is highly utilized, an app (app1) has higher priority, it 
> wants locality. And there’s another app (app2) has lower priority, but it 
> doesn’t care about locality. When node heartbeats with available resource, 
> app1 decides to wait, so app2 gets the available slot. This should be 
> considered as a bug that we need to fix.
> The same problem could happen when we use FIFO/Fair queue policies.
> Another problem similar to this is related to preemption: when preemption 
> policy preempts some resources from queue-A for queue-B (queue-A is 
> over-satisfied and queue-B is under-satisfied). But queue-B is waiting for 
> the node-locality-delay so queue-A will get resources back. In next round, 
> preemption policy could preempt this resources again from queue-A.
> This JIRA is target to solve these problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901828#comment-14901828
 ] 

Hudson commented on YARN-4188:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #428 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/428/])
YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static 
(cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java
* hadoop-yarn-project/CHANGES.txt


> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901706#comment-14901706
 ] 

Hudson commented on YARN-4188:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8496 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8496/])
YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static 
(cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java
* hadoop-yarn-project/CHANGES.txt


> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901749#comment-14901749
 ] 

Hudson commented on YARN-4188:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #420 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/420/])
YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static 
(cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java
* hadoop-yarn-project/CHANGES.txt


> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side

2015-09-21 Thread Dian Fu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901718#comment-14901718
 ] 

Dian Fu commented on YARN-3964:
---

Thanks [~leftnoteasy] for your detailed review. Make sense to me and will 
update the patch to incorporate your comments ASAP. 

> Support NodeLabelsProvider at Resource Manager side
> ---
>
> Key: YARN-3964
> URL: https://issues.apache.org/jira/browse/YARN-3964
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, 
> YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, 
> YARN-3964.006.patch, YARN-3964.1.patch
>
>
> Currently, CLI/REST API is provided in Resource Manager to allow users to 
> specify labels for nodes. For labels which may change over time, users will 
> have to start a cron job to update the labels. This has the following 
> limitations:
> - The cron job needs to be run in the YARN admin user.
> - This makes it a little complicate to maintain as users will have to make 
> sure this service/daemon is alive.
> Adding a Node Labels Provider in Resource Manager will provide user more 
> flexibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901501#comment-14901501
 ] 

Hudson commented on YARN-4113:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2338 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2338/])
YARN-4113. RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev 
b00392dd9cbb6778f2f3e669e96cf7133590dfe7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt


> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

2015-09-21 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901892#comment-14901892
 ] 

Jian He commented on YARN-4000:
---

bq. In recoverContainersOnNode, we check if application is present in the 
scheduler or not, which will not be there.
Ah, right, missed this part. thanks for pointing this out.
bq. we consider them as orphan containers and in the next HB from NM, report 
these containers as the ones to be cleaned up by NM.
Is this the case? I think in current code, RM is still ignoring these orphan 
containers?

> RM crashes with NPE if leaf queue becomes parent queue during restart
> -
>
> Key: YARN-4000
> URL: https://issues.apache.org/jira/browse/YARN-4000
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint

2015-09-21 Thread Richard Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900749#comment-14900749
 ] 

Richard Lee commented on YARN-4191:
---

AFAICT, looking at the yarn source code, the RM doesn't actually serialize the 
ApplicationReport. It gets a lot of similar information about the 
ApplicationMaster and returns it on the REST /apps endpoint. One thing that 
seems to be missing is the RPC port, tho.  In particular, I'm interested in 
working with the Samza Application Master. It has both a trackingUrl port and 
an RPC port.  The REST stuff is on the RPC port at / (with, oddly no version 
path or anything, which seems not the best practice).  Compare this to the Map 
Reduce ApplicationMaster, where the REST api is on the same port as the 
trackingUrl at /ws/v1/mapreduce. 

I was not aware of the other RM REST issues. However, at present, I've only 
been doing GET requests to retrieve information about the running cluster, and 
not yet trying to control it.

> Expose ApplicationMaster RPC port in ResourceManager REST endpoint
> --
>
> Key: YARN-4191
> URL: https://issues.apache.org/jira/browse/YARN-4191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Richard Lee
>Priority: Minor
>
> Currently, the ResourceManager REST endpoint returns only the trackingUrl for 
> the ApplicationMaster.  Some AMs, however, have their REST endpoints on the 
> RPC port.  However, the RM does not expose the AM RPC port via REST for some 
> reason.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.

2015-09-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900902#comment-14900902
 ] 

Hadoop QA commented on YARN-3224:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 51s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  1s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 53s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 28s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 29s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 52s | The applied patch generated  3 
new checkstyle issues (total was 188, now 191). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 47s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 39s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 42s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  62m  0s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 104m 46s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761436/0002-YARN-3224.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c9cb6a5 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9229/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9229/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9229/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9229/console |


This message was automatically generated.

> Notify AM with containers (on decommissioning node) could be preempted after 
> timeout.
> -
>
> Key: YARN-3224
> URL: https://issues.apache.org/jira/browse/YARN-3224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Junping Du
>Assignee: Sunil G
> Attachments: 0001-YARN-3224.patch, 0002-YARN-3224.patch
>
>
> We should leverage YARN preemption framework to notify AM that some 
> containers will be preempted after a timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-21 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901019#comment-14901019
 ] 

Vrushali C commented on YARN-4074:
--

Thanks everyone for the review, I will commit this patch in today. 

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Giovanni Matteo Fumarola (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-4188:
---
Attachment: YARN-4188.v0.patch

No test needed

> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-21 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901067#comment-14901067
 ] 

Li Lu commented on YARN-4074:
-

Hi [~sjlee0] [~vrushalic], thanks for the work and sorry I could not get back 
earlier. Overall the patch LGTM. I like the refactor here and it's almost a 
must to put it in soon. One nit is, on naming and code organization, we're 
putting all derived readers in the storage package, but inevitably associating 
them with our (specific) HBase storage. If it's quick and easy, maybe we can 
put them in a package inside storage? If I'm missing anything here and it's 
hard, let proceed with this patch. Your call. 

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4189) Capacity Scheduler : Improve location preference waiting mechanism

2015-09-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901075#comment-14901075
 ] 

Wangda Tan commented on YARN-4189:
--

[~xinxianyin],

Thanks for looking at the doc, however, I think the approach in the doc 
shouldn't decline the utilization:

Assume we limit the maximum waiting time for each container is X sec, and 
average container execution time is Y sec. It will be fine If X << Y.

In my mind, X is a value close to node heartbeat interval and Y is from minutes 
to hours.

I don't have any data to prove if my thoughts is true, we need to do some 
benchmark tests before using it in practice.

> Capacity Scheduler : Improve location preference waiting mechanism
> --
>
> Key: YARN-4189
> URL: https://issues.apache.org/jira/browse/YARN-4189
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4189 design v1.pdf
>
>
> There're some issues with current Capacity Scheduler implementation of delay 
> scheduling:
> *1) Waiting time to allocate each container highly depends on cluster 
> availability*
> Currently, app can only increase missed-opportunity when a node has available 
> resource AND it gets traversed by a scheduler. There’re lots of possibilities 
> that an app doesn’t get traversed by a scheduler, for example:
> A cluster has 2 racks (rack1/2), each rack has 40 nodes. 
> Node-locality-delay=40. An application prefers rack1. 
> Node-heartbeat-interval=1s.
> Assume there are 2 nodes available on rack1, delay to allocate one container 
> = 40 sec.
> If there are 20 nodes available on rack1, delay of allocating one container = 
> 2 sec.
> *2) It could violate scheduling policies (Fifo/Priority/Fair)*
> Assume a cluster is highly utilized, an app (app1) has higher priority, it 
> wants locality. And there’s another app (app2) has lower priority, but it 
> doesn’t care about locality. When node heartbeats with available resource, 
> app1 decides to wait, so app2 gets the available slot. This should be 
> considered as a bug that we need to fix.
> The same problem could happen when we use FIFO/Fair queue policies.
> Another problem similar to this is related to preemption: when preemption 
> policy preempts some resources from queue-A for queue-B (queue-A is 
> over-satisfied and queue-B is under-satisfied). But queue-B is waiting for 
> the node-locality-delay so queue-A will get resources back. In next round, 
> preemption policy could preempt this resources again from queue-A.
> This JIRA is target to solve these problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4178) [storage implementation] app id as string can cause incorrect ordering

2015-09-21 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901057#comment-14901057
 ] 

Varun Saxena commented on YARN-4178:


[~jrottinghuis],
No, I did not mean that we can use ApplicationId#toString to create a string 
which can be stored in rowkey, if that is what you meant. appid is already in 
that format.

What I was suggesting was that on the write path, we can store only the cluster 
timestamp and sequence number(12 bytes - one long and one int) in the row key 
and skip storing the "application_" part. Storing as long and int or 2 longs 
would ensure correct ordering(although ascending). So, as you said above 
Long.MAX_VALUE - X should be used for ensuring descending order.
ApplicationId#toString I was talking in context of read path. On the read path 
we can read these 12 bytes from row key and call ApplicationId#newInstance and 
ApplicationId#toString to change the timestamp and id to application_ prefix 
app id in string format, which can then be sent back to the client. And if 
prefix changes, ApplicationId will be changed as well(as it is used all over 
YARN).

However your comment about storing application_ part in the end to make row key 
future proof makes sense. We can go with it.

> [storage implementation] app id as string can cause incorrect ordering
> --
>
> Key: YARN-4178
> URL: https://issues.apache.org/jira/browse/YARN-4178
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
>
> Currently the app id is used in various places as part of row keys and in 
> column names. However, they are treated as strings for the most part. This 
> will cause a problem with ordering when the id portion of the app id rolls 
> over to the next digit.
> For example, "app_1234567890_100" will be considered *earlier* than 
> "app_1234567890_99". We should correct this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled

2015-09-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901047#comment-14901047
 ] 

Hadoop QA commented on YARN-3975:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 56s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 19s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 17s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 26s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 52s | The applied patch generated  2 
new checkstyle issues (total was 16, now 18). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 44s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 38s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 52s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   6m 59s | Tests failed in 
hadoop-yarn-client. |
| {color:red}-1{color} | yarn tests |   0m 24s | Tests failed in 
hadoop-yarn-server-web-proxy. |
| | |  49m 32s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.client.TestRMFailover |
|   | hadoop.yarn.server.webproxy.TestWebAppProxyServlet |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761457/YARN-3975.8.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c9cb6a5 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9230/artifact/patchprocess/diffcheckstylehadoop-yarn-server-web-proxy.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9230/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-server-web-proxy test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9230/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9230/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9230/console |


This message was automatically generated.

> WebAppProxyServlet should not redirect to RM page if AHS is enabled
> ---
>
> Key: YARN-3975
> URL: https://issues.apache.org/jira/browse/YARN-3975
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, 
> YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, 
> YARN-3975.8.patch
>
>
> WebAppProxyServlet should be updated to handle the case when the appreport 
> doesn't have a tracking URL and the Application History Server is eanbled.
> As we would have already tried the RM and got the 
> ApplicationNotFoundException we should not direct the user to the RM app page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue

2015-09-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901078#comment-14901078
 ] 

Wangda Tan commented on YARN-4059:
--

Finished design doc of improving delay scheduling mechanism and uploaded it to 
YARN-4189.

> Preemption should delay assignments back to the preempted queue
> ---
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901099#comment-14901099
 ] 

zhihai xu commented on YARN-4095:
-

The first patch put {{NM_GOOD_LOCAL_DIRS}} and {{NM_GOOD_LOG_DIRS}} in 
YarnConfiguration.java, the second patch moved them to 
LocalDirsHandlerService.java, since they are only used inside 
{{LocalDirsHandlerService}}.

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4178) [storage implementation] app id as string can cause incorrect ordering

2015-09-21 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900927#comment-14900927
 ] 

Joep Rottinghuis commented on YARN-4178:


[~varun_saxena] if you mean o.a.h.yarn.api.records.ApplicationId then no, that 
will _not_ do.
Its toString is defined as
{code}
return appIdStrPrefix + this.getClusterTimestamp() + "_"
+ appIdFormat.get().format(getId());
{code}
The appIdFormat uses a minimum of 4 digits: fmt.setMinimumIntegerDigits(4);
When the counter part wraps over to 10K or 100K or 1M (our clusters regularly 
run several million apps before the RM gets restarted) the sort order gets all 
wrong as per my comment in YARN-4074, which is why [~sangjin.park]

For example, lexically application_1442351767756_1 < 
application_1442351767756_
We need the applications to be ordered correctly, even at those boundaries.

In fact, I think we may have to store Long.MAX_VALUE - X for the timestamp and 
counter parts to that these will properly order in descending order for both 
the counter and the RM restart epoch part.

The fact that all application IDs are hardcoded with application_ in yarn seems 
a bit silly to me. It makes much more sense to me that applications should be 
able to indicate an application type and that those would have a different 
prefix. That way one can quickly distinguish between mapreduce apps, Tez, 
Spark, Impala, Presto, what-have-you.
This may not matter much on smaller clusters with less usage, but to make this 
an option for larger clusters with several tens of thousands of jobs per day 
this would be really really handy. Hence my suggestion to keep the application_ 
part at the end of the sort, to make the key-layout future proof (maybe wishful 
thinking in my part).


> [storage implementation] app id as string can cause incorrect ordering
> --
>
> Key: YARN-4178
> URL: https://issues.apache.org/jira/browse/YARN-4178
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
>
> Currently the app id is used in various places as part of row keys and in 
> column names. However, they are treated as strings for the most part. This 
> will cause a problem with ordering when the id portion of the app id rolls 
> over to the next digit.
> For example, "app_1234567890_100" will be considered *earlier* than 
> "app_1234567890_99". We should correct this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled

2015-09-21 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-3975:

Attachment: YARN-3975.8.patch

> WebAppProxyServlet should not redirect to RM page if AHS is enabled
> ---
>
> Key: YARN-3975
> URL: https://issues.apache.org/jira/browse/YARN-3975
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, 
> YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, 
> YARN-3975.8.patch
>
>
> WebAppProxyServlet should be updated to handle the case when the appreport 
> doesn't have a tracking URL and the Application History Server is eanbled.
> As we would have already tried the RM and got the 
> ApplicationNotFoundException we should not direct the user to the RM app page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900752#comment-14900752
 ] 

Sunil G commented on YARN-4113:
---

 Hi [~leftnoteasy] 
I feel test case is not needed as its already covered in HADOOP-12386.will this 
be fine?

> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4177) yarn.util.Clock should not be used to time a duration or time interval

2015-09-21 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900450#comment-14900450
 ] 

Xianyin Xin commented on YARN-4177:
---

Hi [~ste...@apache.org], thanks for your comment. I've read your post and did 
some investgations on this.
{quote}
1.Inconsistent across cores, hence non-monotonic on reads, especially reads 
likely to trigger thread suspend/resume (anything with sleep(), wait(), IO, 
accessing synchronized data under load).
{quote}
This was once a bug on some old OSs, but it seems not a problem on Linux newer 
than 2.6 or windows newer than XP SP2, if i understand your comment correctly. 
See 
http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless,
 and the refered 
https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks.
{quote}
2.Not actually monotonic.
{quote}
Can you explain in detail? As a reference, there're some discussion on 
clock_gettime which nanoTime depends in 
http://stackoverflow.com/questions/4943733/is-clock-monotonic-process-or-thread-specific?rq=1,
 especially in the second post that has 4 supports.
{quote}
3.Achieving a consistency by querying heavyweight counters with possible longer 
function execution time and lower granularity than the wall clock.
That is: modern NUMA, multi-socket servers are essentially multiple computers 
wired together, and we have a term for that: distributed system
{quote}
You mean achieving a consistent time across nodes in a cluster? I think the 
monotonic time we plan to offer should be limited to node-local. It's hard to 
make it cluster wide. 
{quote}
I've known for a long time that CPU frequency could change its rate
{quote}
I remembered Linux higher than 2.6.18 takes some measures to overcome this 
problem. 
http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless#comment40382219_510940
 has little discussion.

> yarn.util.Clock should not be used to time a duration or time interval
> --
>
> Key: YARN-4177
> URL: https://issues.apache.org/jira/browse/YARN-4177
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4177.001.patch, YARN-4177.002.patch
>
>
> There're many places uses Clock to time intervals, which is dangerous as 
> commented by [~ste...@apache.org] in HADOOP-12409. Instead, we should use 
> hadoop.util.Timer#monotonicNow() to get monotonic time. Or we could provide a 
> MonotonicClock in yarn.util considering the consistency of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4176:
---
Attachment: 0004-YARN-4176.patch

Hi [~Naganarasimha]
Thanks for review comments 

Attaching patch after handling the same.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901376#comment-14901376
 ] 

Hudson commented on YARN-4113:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2365 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2365/])
YARN-4113. RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev 
b00392dd9cbb6778f2f3e669e96cf7133590dfe7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt


> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901263#comment-14901263
 ] 

Hudson commented on YARN-4113:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #419 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/419/])
YARN-4113. RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev 
b00392dd9cbb6778f2f3e669e96cf7133590dfe7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java


> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900504#comment-14900504
 ] 

Hadoop QA commented on YARN-4176:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  19m  9s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 51s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  8s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 50s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 23s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 58s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   7m 48s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  56m 40s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761395/0004-YARN-4176.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c9cb6a5 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9228/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9228/console |


This message was automatically generated.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901254#comment-14901254
 ] 

Hudson commented on YARN-4113:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1159 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1159/])
YARN-4113. RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev 
b00392dd9cbb6778f2f3e669e96cf7133590dfe7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java
* hadoop-yarn-project/CHANGES.txt


> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

2015-09-21 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901272#comment-14901272
 ] 

Varun Saxena commented on YARN-4000:


[~jianhe], I think this shouldn't be a problem. In recoverContainersOnNode, we 
check if application is present in the scheduler or not, which will not be 
there.
If this is so, we consider them as orphan containers and in the next HB from 
NM, report these containers as the ones to be cleaned up by NM.
NM then cleans them up(kills them) if they are running.
Correct me if I am wrong.

> RM crashes with NPE if leaf queue becomes parent queue during restart
> -
>
> Key: YARN-4000
> URL: https://issues.apache.org/jira/browse/YARN-4000
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-21 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901299#comment-14901299
 ] 

Vrushali C commented on YARN-4074:
--

Hi [~gtCarrera9]

To confirm my understanding, did you mean putting all reader classes into a 
package like org.apache.hadoop.yarn.server.timelineservice.storage.reader ? 

There is a org.apache.hadoop.yarn.server.timelineservice.reader but that is for 
the web services related code. 
thanks
Vrushali

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class

2015-09-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901133#comment-14901133
 ] 

Hadoop QA commented on YARN-4188:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  19m 13s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 54s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m  2s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 31s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 13s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 49s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 46s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-api. |
| | |  44m 37s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761462/YARN-4188.v0.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c9cb6a5 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9231/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9231/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9231/console |


This message was automatically generated.

> MoveApplicationAcrossQueuesResponse should be an abstract class
> ---
>
> Key: YARN-4188
> URL: https://issues.apache.org/jira/browse/YARN-4188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Attachments: YARN-4188.v0.patch
>
>
> MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally 
> the new instance should have a static modifier. Currently we are not facing 
> any issues because the response is empty object on success. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint

2015-09-21 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901135#comment-14901135
 ] 

Steve Loughran commented on YARN-4191:
--

if there's no RPC port in REST status reports, yes , its a bug. Samza's use of 
a REST API without going via the proxy is a security risk, but it's not 
anything YARN can do to stop

> Expose ApplicationMaster RPC port in ResourceManager REST endpoint
> --
>
> Key: YARN-4191
> URL: https://issues.apache.org/jira/browse/YARN-4191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Richard Lee
>Priority: Minor
>
> Currently, the ResourceManager REST endpoint returns only the trackingUrl for 
> the ApplicationMaster.  Some AMs, however, have their REST endpoints on the 
> RPC port.  However, the RM does not expose the AM RPC port via REST for some 
> reason.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901149#comment-14901149
 ] 

Hudson commented on YARN-4113:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8495 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8495/])
YARN-4113. RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev 
b00392dd9cbb6778f2f3e669e96cf7133590dfe7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt


> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4009) CORS support for ResourceManager REST API

2015-09-21 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901169#comment-14901169
 ] 

Hitesh Shah commented on YARN-4009:
---

Thinking more on this, a global config might be something that is okay to start 
with ( we already have a huge proliferation of configs which users do not set 
). If there are concerns raised down the line, it should likely be easy enough 
to add yarn and hdfs specific configs which would override the global one in a 
compatible manner? [~jeagles] comments? 



> CORS support for ResourceManager REST API
> -
>
> Key: YARN-4009
> URL: https://issues.apache.org/jira/browse/YARN-4009
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prakash Ramachandran
>Assignee: Varun Vasudev
> Attachments: YARN-4009.001.patch, YARN-4009.002.patch, 
> YARN-4009.003.patch, YARN-4009.004.patch
>
>
> Currently the REST API's do not have CORS support. This means any UI (running 
> in browser) cannot consume the REST API's. For ex Tez UI would like to use 
> the REST API for getting application, application attempt information exposed 
> by the API's. 
> It would be very useful if CORS is enabled for the REST API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901190#comment-14901190
 ] 

Hudson commented on YARN-4113:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #427 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/427/])
YARN-4113. RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev 
b00392dd9cbb6778f2f3e669e96cf7133590dfe7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java


> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3942) Timeline store to read events from HDFS

2015-09-21 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-3942:

Attachment: YARN-3942-leveldb.001.patch

Thanks [~jlowe] for working on this! On top of the existing patch I built a new 
storage to move the in memory hash map storage to a level db database. The 
original in memory timeline store is not supposed to be used in production 
environments. The price for the new level db hash map storage is latency: it 
generally takes more time to fully load the entities into level db. Had an 
offline discussion with [~xgong] and seems like we need to reduce the 
granularity of caching to improve latency. We may want to address this problem 
in a separate JIRA. 

> Timeline store to read events from HDFS
> ---
>
> Key: YARN-3942
> URL: https://issues.apache.org/jira/browse/YARN-3942
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-3942-leveldb.001.patch, YARN-3942.001.patch
>
>
> This adds a new timeline store plugin that is intended as a stop-gap measure 
> to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
> v2.  The intent of this plugin is to provide a workable solution for running 
> the Tez UI against the timeline server on a large-scale clusters running many 
> thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS

2015-09-21 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901388#comment-14901388
 ] 

Li Lu commented on YARN-3942:
-

BTW, the patch apply to the existing YARN-3942.001.patch. 

> Timeline store to read events from HDFS
> ---
>
> Key: YARN-3942
> URL: https://issues.apache.org/jira/browse/YARN-3942
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-3942-leveldb.001.patch, YARN-3942.001.patch
>
>
> This adds a new timeline store plugin that is intended as a stop-gap measure 
> to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
> v2.  The intent of this plugin is to provide a workable solution for running 
> the Tez UI against the timeline server on a large-scale clusters running many 
> thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side

2015-09-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901422#comment-14901422
 ] 

Wangda Tan commented on YARN-3964:
--

[~dian.fu].

Some comments:
1) I suggest to make this to be an explicit node label configuration type: 
{{yarn.node-labels.configuration-type}}. Currently it has 
"centralized/distributed", I think you may add a "delegated-centralized" (or 
other better name).
Other configurations in your patch look fine to me.

2) Some comments of organization of Updater/Provider:
- Updater is a subclass of AbstractService, but no need to be abstract. I'm not 
sure what's the purpose of adding an AbstractNodeLabelsUpdater. Provider will 
be initialized by Updater, and Updater will call Provider's method periodically 
and notify RMNodeLabelsManager.
- Provider is an interface, minor comments to your patch:
** Why need a Configuration in getNodeLabels method?
** Returns Set instead of Set

3) There're some methods / comments include "Fetcher", could you replace them 
to "Provider"?

4) Instead of adding a new checkAndThrowIfNodeLabelsFetcherConfigured, I 
suggest to reuse the checkAndThrowIfDistributedNodeLabelConfEnabled: You can 
rename it to something like checkAndThrowIfNodeLabelCannotBeUpdatedManually, 
which will check {{yarn.node-labels.configuration-type}}, we only allow 
manually update labels when type=centralized configured.

5) You can add a method to get RMNodeLabelsUpdater from RMContext, and remove 
it from ResourceTrackerService constructor.

6) Add a test of RMNodeLabelsUpdater? It seems can only update labels-on-node 
once for every node.

7) I think we need to make sure labels will be updated *synchronizedly* when a 
node is registering, this can avoid a node's labels initialized after it 
registered a while.

8) If you agree with #7, I think wait/notify implementation of Updater could be 
removed, you can use synchronized lock instead. Code using wait/notify has bad 
readability and will likely introduce bugs.

Thanks,
Wangda

> Support NodeLabelsProvider at Resource Manager side
> ---
>
> Key: YARN-3964
> URL: https://issues.apache.org/jira/browse/YARN-3964
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, 
> YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, 
> YARN-3964.006.patch, YARN-3964.1.patch
>
>
> Currently, CLI/REST API is provided in Resource Manager to allow users to 
> specify labels for nodes. For labels which may change over time, users will 
> have to start a cron job to update the labels. This has the following 
> limitations:
> - The cron job needs to be run in the YARN admin user.
> - This makes it a little complicate to maintain as users will have to make 
> sure this service/daemon is alive.
> Adding a Node Labels Provider in Resource Manager will provide user more 
> flexibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint

2015-09-21 Thread Richard Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901415#comment-14901415
 ] 

Richard Lee commented on YARN-4191:
---

Is the RPC port necessarily HTTP, tho?  Seems that is not something YARN can 
count on and proxy for.  

I still think that the Samza people should put their REST endpoint under the 
trackingUrl, like mapreduce does.  That way, it would both not require any 
changes to YARN, and be using the ResourceManager proxy.

> Expose ApplicationMaster RPC port in ResourceManager REST endpoint
> --
>
> Key: YARN-4191
> URL: https://issues.apache.org/jira/browse/YARN-4191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Richard Lee
>Priority: Minor
>
> Currently, the ResourceManager REST endpoint returns only the trackingUrl for 
> the ApplicationMaster.  Some AMs, however, have their REST endpoints on the 
> RPC port.  However, the RM does not expose the AM RPC port via REST for some 
> reason.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901486#comment-14901486
 ] 

Hudson commented on YARN-4113:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #400 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/400/])
YARN-4113. RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev 
b00392dd9cbb6778f2f3e669e96cf7133590dfe7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java


> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4113.patch
>
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container

2015-09-21 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900399#comment-14900399
 ] 

Sunil G commented on YARN-4152:
---

Thanks [~bibinchundatt].

Yes, container seems like was not present in context. And this has happened in 
CONTAINER_FINISHED event, so absent container scenario can be handled with this 
check. And looks like this case is also handled in other events, may be you 
could double check it and make sure similar incidents are handled for other 
events also.
Other wise patch looks good to me.

> NM crash with NPE when LogAggregationService#stopContainer called for absent 
> container
> --
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, 
> 0003-YARN-4152.patch
>
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e51_1442063466801_0001_01_99 is : 143
> 2015-09-12 18:44:25,670 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e51_1442063466801_0001_01_000101 from application 
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
> OPERATION=Container Finished - SucceededTARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1442063466801_0001
> CONTAINERID=container_e51_1442063466801_0001_01_000100
> {code}
> *Analysis*
> Looks like for absent container also {{stopContainer}} is called 
> {code}
>   case CONTAINER_FINISHED:
> LogHandlerContainerFinishedEvent containerFinishEvent =
> (LogHandlerContainerFinishedEvent) event;
> stopContainer(containerFinishEvent.getContainerId(),
> containerFinishEvent.getExitCode());
> break;
> {code}
> *Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101*
> Should skip when {{null==context.getContainers().get(containerId)}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)