[jira] [Updated] (YARN-3763) Support fuzzy search in ATS

2015-06-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated YARN-3763:
-
Description: Currently ATS only support exact match. Sometimes fuzzy match 
may be helpful when the entities in the ATS has some common prefix or suffix.   
Link with TEZ-2531  (was: Currently ATS only support exact match. Sometimes 
fuzzy match may be helpful when the entities in the ATS has some common prefix 
or suffix.  )

> Support fuzzy search in ATS
> ---
>
> Key: YARN-3763
> URL: https://issues.apache.org/jira/browse/YARN-3763
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>
> Currently ATS only support exact match. Sometimes fuzzy match may be helpful 
> when the entities in the ATS has some common prefix or suffix.   Link with 
> TEZ-2531



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3763) Support for fuzzy search in ATS

2015-06-02 Thread Jeff Zhang (JIRA)
Jeff Zhang created YARN-3763:


 Summary: Support for fuzzy search in ATS
 Key: YARN-3763
 URL: https://issues.apache.org/jira/browse/YARN-3763
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.7.0
Reporter: Jeff Zhang


Currently ATS only support exact match. Sometimes fuzzy match may be helpful 
when the entities in the ATS has some common prefix or suffix.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3763) Support fuzzy search in ATS

2015-06-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated YARN-3763:
-
Summary: Support fuzzy search in ATS  (was: Support for fuzzy search in ATS)

> Support fuzzy search in ATS
> ---
>
> Key: YARN-3763
> URL: https://issues.apache.org/jira/browse/YARN-3763
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>
> Currently ATS only support exact match. Sometimes fuzzy match may be helpful 
> when the entities in the ATS has some common prefix or suffix.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570284#comment-14570284
 ] 

Sidharta Seethana commented on YARN-2194:
-

[~mjacobs] , Yes, that is what I am proposing.  If we handle the path 
separation correctly, we should be able to continue using the current 
(deprecated, but still workable) mechanism for using cgroups.

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570276#comment-14570276
 ] 

Jeff Zhang commented on YARN-3755:
--

Close it as won't fix

> Log the command of launching containers
> ---
>
> Key: YARN-3755
> URL: https://issues.apache.org/jira/browse/YARN-3755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: YARN-3755-1.patch, YARN-3755-2.patch
>
>
> In the resource manager log, yarn would log the command for launching AM, 
> this is very useful. But there's no such log in the NN log for launching 
> containers. It would be difficult to diagnose when containers fails to launch 
> due to some issue in the commands. Although user can look at the commands in 
> the container launch script file, this is an internal things of yarn, usually 
> user don't know that. In user's perspective, they only know what commands 
> they specify when building yarn application. 
> {code}
> 2015-06-01 16:06:42,245 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
> to launch container container_1433145984561_0001_01_01 : 
> $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA 
> -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
> 1>/stdout 2>/stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570275#comment-14570275
 ] 

Jeff Zhang commented on YARN-3755:
--

bq. How about we let individual frameworks like MapReduce/Tez log them as 
needed? That seems like the right place for debugging too - app developers 
don't always get access to the daemon logs.
Make sense. 

> Log the command of launching containers
> ---
>
> Key: YARN-3755
> URL: https://issues.apache.org/jira/browse/YARN-3755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: YARN-3755-1.patch, YARN-3755-2.patch
>
>
> In the resource manager log, yarn would log the command for launching AM, 
> this is very useful. But there's no such log in the NN log for launching 
> containers. It would be difficult to diagnose when containers fails to launch 
> due to some issue in the commands. Although user can look at the commands in 
> the container launch script file, this is an internal things of yarn, usually 
> user don't know that. In user's perspective, they only know what commands 
> they specify when building yarn application. 
> {code}
> 2015-06-01 16:06:42,245 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
> to launch container container_1433145984561_0001_01_01 : 
> $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA 
> -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
> 1>/stdout 2>/stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570202#comment-14570202
 ] 

Chun Chen commented on YARN-3749:
-

Thanks for reviewing the patch, [~zxu] ! 

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3558) Additional containers getting reserved from RM in case of Fair scheduler

2015-06-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-3558:
-

Assignee: Sunil G

> Additional containers getting reserved from RM in case of Fair scheduler
> 
>
> Key: YARN-3558
> URL: https://issues.apache.org/jira/browse/YARN-3558
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.7.0
> Environment: OS :Suse 11 Sp3
> Setup : 2 RM 2 NM
> Scheduler : Fair scheduler
>Reporter: Bibin A Chundatt
>Assignee: Sunil G
> Attachments: Amlog.txt, rm.log
>
>
> Submit PI job with 16 maps
> Total container expected : 16 MAPS + 1 Reduce  + 1 AM
> Total containers reserved by RM is 21
> Below set of containers are not being used for execution
> container_1430213948957_0001_01_20
> container_1430213948957_0001_01_19
> RM Containers reservation and states
> {code}
>  Processing container_1430213948957_0001_01_01 of type START
>  Processing container_1430213948957_0001_01_01 of type ACQUIRED
>  Processing container_1430213948957_0001_01_01 of type LAUNCHED
>  Processing container_1430213948957_0001_01_02 of type START
>  Processing container_1430213948957_0001_01_03 of type START
>  Processing container_1430213948957_0001_01_02 of type ACQUIRED
>  Processing container_1430213948957_0001_01_03 of type ACQUIRED
>  Processing container_1430213948957_0001_01_04 of type START
>  Processing container_1430213948957_0001_01_05 of type START
>  Processing container_1430213948957_0001_01_04 of type ACQUIRED
>  Processing container_1430213948957_0001_01_05 of type ACQUIRED
>  Processing container_1430213948957_0001_01_02 of type LAUNCHED
>  Processing container_1430213948957_0001_01_04 of type LAUNCHED
>  Processing container_1430213948957_0001_01_06 of type RESERVED
>  Processing container_1430213948957_0001_01_03 of type LAUNCHED
>  Processing container_1430213948957_0001_01_05 of type LAUNCHED
>  Processing container_1430213948957_0001_01_07 of type START
>  Processing container_1430213948957_0001_01_07 of type ACQUIRED
>  Processing container_1430213948957_0001_01_07 of type LAUNCHED
>  Processing container_1430213948957_0001_01_08 of type RESERVED
>  Processing container_1430213948957_0001_01_02 of type FINISHED
>  Processing container_1430213948957_0001_01_06 of type START
>  Processing container_1430213948957_0001_01_06 of type ACQUIRED
>  Processing container_1430213948957_0001_01_06 of type LAUNCHED
>  Processing container_1430213948957_0001_01_04 of type FINISHED
>  Processing container_1430213948957_0001_01_09 of type START
>  Processing container_1430213948957_0001_01_09 of type ACQUIRED
>  Processing container_1430213948957_0001_01_09 of type LAUNCHED
>  Processing container_1430213948957_0001_01_10 of type RESERVED
>  Processing container_1430213948957_0001_01_03 of type FINISHED
>  Processing container_1430213948957_0001_01_08 of type START
>  Processing container_1430213948957_0001_01_08 of type ACQUIRED
>  Processing container_1430213948957_0001_01_08 of type LAUNCHED
>  Processing container_1430213948957_0001_01_05 of type FINISHED
>  Processing container_1430213948957_0001_01_11 of type START
>  Processing container_1430213948957_0001_01_11 of type ACQUIRED
>  Processing container_1430213948957_0001_01_11 of type LAUNCHED
>  Processing container_1430213948957_0001_01_07 of type FINISHED
>  Processing container_1430213948957_0001_01_12 of type START
>  Processing container_1430213948957_0001_01_12 of type ACQUIRED
>  Processing container_1430213948957_0001_01_12 of type LAUNCHED
>  Processing container_1430213948957_0001_01_13 of type RESERVED
>  Processing container_1430213948957_0001_01_06 of type FINISHED
>  Processing container_1430213948957_0001_01_10 of type START
>  Processing container_1430213948957_0001_01_10 of type ACQUIRED
>  Processing container_1430213948957_0001_01_10 of type LAUNCHED
>  Processing container_1430213948957_0001_01_09 of type FINISHED
>  Processing container_1430213948957_0001_01_14 of type START
>  Processing container_1430213948957_0001_01_14 of type ACQUIRED
>  Processing container_1430213948957_0001_01_14 of type LAUNCHED
>  Processing container_1430213948957_0001_01_15 of type RESERVED
>  Processing container_1430213948957_0001_01_08 of type FINISHED
>  Processing container_1430213948957_0001_01_13 of type START
>  Processing container_1430213948957_0001_01_16 of type RESERVED
>  Processing container_1430213948957_0001_01_13 of type ACQUIRED
>  Processing container_1430213948957_0001_01_13 of type 

[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-06-02 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570171#comment-14570171
 ] 

Zhijie Shen commented on YARN-3044:
---

[~Naganarasimha], I'm fine with the last patch. Will do some local test. 
However, the patch doesn't apply because of YARN-1462. I think we need to add 
tag info for v2 publisher too. Would you mind taking care of it?

> [Event producers] Implement RM writing app lifecycle events to ATS
> --
>
> Key: YARN-3044
> URL: https://issues.apache.org/jira/browse/YARN-3044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3044-YARN-2928.004.patch, 
> YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, 
> YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, 
> YARN-3044-YARN-2928.009.patch, YARN-3044.20150325-1.patch, 
> YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch
>
>
> Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570131#comment-14570131
 ] 

zhihai xu commented on YARN-3749:
-

Hi [~chenchun], thanks for updating the patch quickly
bq. only make a copy of the configuration in initResourceManager when there are 
multiple RMs.
It is a nice optimization.

The lastest patch YARN-3749.7.patch  LGTM
Also the test failure(TestNodeLabelContainerAllocation) is not related to the 
patch.

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570082#comment-14570082
 ] 

Hadoop QA commented on YARN-3762:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 28s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 53s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 51s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 32s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 28s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 25s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 17s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12737043/yarn-3762-1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c1d50a9 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8170/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8170/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8170/console |


This message was automatically generated.

> FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
> ---
>
> Key: YARN-3762
> URL: https://issues.apache.org/jira/browse/YARN-3762
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: yarn-3762-1.patch, yarn-3762-1.patch
>
>
> In our testing, we ran into the following ConcurrentModificationException:
> {noformat}
> halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
> 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
> queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
> queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
> 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Matthew Jacobs (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570040#comment-14570040
 ] 

Matthew Jacobs commented on YARN-2194:
--

Thanks, [sidharta-s]. So the change would be in how the container-executor 
accepts lists of paths, not attempting to re-mount the controllers, right? If I 
understand it correctly, that sounds like a good plan to me.

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3510) Create an extension of ProportionalCapacityPreemptionPolicy which preempts a number of containers from each application in a way which respects fairness

2015-06-02 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570039#comment-14570039
 ] 

Craig Welch commented on YARN-3510:
---

[~leftnoteasy] and I had some offline discussion.  The patch currently here is 
simply meant to keep from unbalancing whatever allocation process is active by, 
generally, keeping relative usage between applications the same.  It doesn't 
attempt to actively re-allocate in a way which achieves the overall allocation 
policy, i.e., "as if all the applications had started at once".  (this is a 
more complex proposition, obviously).  There's a desire to have this because, 
among other things, sometime down the road we may do preemption just among 
users/applications in a queue and it will be necessary for the preemption to 
actively work toward the allocation goals to do that, rather than just maintain 
current levels.  This will add some medium level complexity to the current 
patch, deltas with the current approach are:
Since the effect of preemption on order for fairness doesn't occur until the 
container is released, and we want to consider it right away, there will need 
to be a need to retain info about "pending preemption" for comparison on the 
app resources (it will be a deduction from usage for ordering purposes, as if 
the preemption had already happened)
The preemptEvenly loop will need to reorder the app which was preempted after 
each preemption and then restart the iteration over apps (not necessarily over 
all apps, again, just until the first preemption)


> Create an extension of ProportionalCapacityPreemptionPolicy which preempts a 
> number of containers from each application in a way which respects fairness
> 
>
> Key: YARN-3510
> URL: https://issues.apache.org/jira/browse/YARN-3510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Craig Welch
>Assignee: Craig Welch
> Attachments: YARN-3510.2.patch, YARN-3510.3.patch, YARN-3510.5.patch, 
> YARN-3510.6.patch
>
>
> The ProportionalCapacityPreemptionPolicy preempts as many containers from 
> applications as it can during it's preemption run.  For fifo this makes 
> sense, as it is prempting in reverse order & therefore maintaining the 
> primacy of the "oldest".  For fair ordering this does not have the desired 
> effect - instead, it should preempt a number of containers from each 
> application which maintains a fair balance /close to a fair balance between 
> them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570037#comment-14570037
 ] 

Sidharta Seethana commented on YARN-2194:
-

There are two different issues here : 

* container-executor binary invocation uses ‘,’ as a separator when supplying a 
list of paths - which breaks when the path contains ‘,’
* cpu,cpuacct are mounted together by default on RHEL7 

Now, for the latter issue : In {{CgroupsLCEResourcesHandler}}, the following 
steps occur : 

* If the {{yarn.nodemanager.linux-container-executor.cgroups.mount}} switch is 
enabled , the ‘cpu’ controller is explicitly mounted at the specified path. 
* (irrespective of the state of the switch) The {{/proc/mounts}} file (possibly 
updated by the previous step) is subsequently parsed to determine the mount 
locations for the various cgroup controllers - this parsing code seems to be 
correct even if cpu and cpuacct are mounted in one location.

So, the thing we need to fix is the separator issue and we should be good.  The 
important thing to remember is that there are *two* cgroups implementation 
classes ( {{CgroupsLCEResourcesHandler}} and {{CGroupsHandlerImpl}} ). 
Hopefully, this will be addressed soon ( YARN-3542 ) - or we risk divergence. 


> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570025#comment-14570025
 ] 

Wangda Tan commented on YARN-3733:
--

Took a look at the patch and discussion. Thanks for working on this 
[~rohithsharma].

I think [~sunilg] mentioned 
https://issues.apache.org/jira/browse/YARN-3733?focusedCommentId=14568880&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14568880
 makes sense to me. If the clusterResource is 0, we can compare individual 
resource type. It could be:

{code}
Returns >: when l.mem > right.mem || l.cpu > right.cpu
Returns =: when (l.mem <= right.mem && l.cpu >= right.cpu) || (l.mem >= 
right.mem && l.cpu <= right.cpu)
Returns <: when l.mem < right.mem || l.cpu < right.cpu
{code}

This produces same result as the INF approach in the patch, but also can 
compare if both l/r have > 0 values. The reason I prefer this is, I'm sure the 
patch can solve the am-resource-percent problem. But with suggested approach, 
we can make sure getting more reasonable result if we need to compare 
non-zero-resource when clusterResource is zero. (For example, sort applications 
by their requirements when clusterResource is zero).

And to avoid future regression, could you add a test to verify the 
am-resource-limit problem is solved?

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3534) Collect memory/cpu usage on the node

2015-06-02 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3534:
--
Attachment: YARN-3534-10.patch

Solved some comments

> Collect memory/cpu usage on the node
> 
>
> Key: YARN-3534
> URL: https://issues.apache.org/jira/browse/YARN-3534
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Inigo Goiri
>Assignee: Inigo Goiri
> Attachments: YARN-3534-1.patch, YARN-3534-10.patch, 
> YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, 
> YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, 
> YARN-3534-9.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> YARN should be aware of the resource utilization of the nodes when scheduling 
> containers. For this, this task will implement the collection of memory/cpu 
> usage on the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-02 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570007#comment-14570007
 ] 

Zhijie Shen commented on YARN-3725:
---

bq. is there a JIRA for the longer term fix?

Yeah, I've filed YARN-3761 previously.

> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569992#comment-14569992
 ] 

Karthik Kambatla commented on YARN-3762:


Changed it to critical and targeting 2.8.0, as it only fails the application 
and not the RM.

> FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
> ---
>
> Key: YARN-3762
> URL: https://issues.apache.org/jira/browse/YARN-3762
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: yarn-3762-1.patch, yarn-3762-1.patch
>
>
> In our testing, we ran into the following ConcurrentModificationException:
> {noformat}
> halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
> 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
> queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
> queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
> 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3762:
---
Priority: Critical  (was: Blocker)
Target Version/s: 2.8.0  (was: 2.7.1)

> FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
> ---
>
> Key: YARN-3762
> URL: https://issues.apache.org/jira/browse/YARN-3762
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: yarn-3762-1.patch, yarn-3762-1.patch
>
>
> In our testing, we ran into the following ConcurrentModificationException:
> {noformat}
> halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
> 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
> queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
> queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
> 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3762:
---
Attachment: yarn-3762-1.patch

Sorry, I forgot to rebase and included some HDFS change as well.

> FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
> ---
>
> Key: YARN-3762
> URL: https://issues.apache.org/jira/browse/YARN-3762
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: yarn-3762-1.patch, yarn-3762-1.patch
>
>
> In our testing, we ran into the following ConcurrentModificationException:
> {noformat}
> halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
> 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
> queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
> queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
> 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3762:
---
Attachment: yarn-3762-1.patch

Here is a patch that protects FSParentQueue members with read-write locks. 

> FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
> ---
>
> Key: YARN-3762
> URL: https://issues.apache.org/jira/browse/YARN-3762
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: yarn-3762-1.patch
>
>
> In our testing, we ran into the following ConcurrentModificationException:
> {noformat}
> halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
> 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
> queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
> queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
> 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-3762:
--

 Summary: FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
 Key: YARN-3762
 URL: https://issues.apache.org/jira/browse/YARN-3762
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker


In our testing, we ran into the following ConcurrentModificationException:

{noformat}
halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, 
queueApplicationCount=0, queueChildQueueCount=0
15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
java.util.ConcurrentModificationException: 
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569918#comment-14569918
 ] 

Vinod Kumar Vavilapalli commented on YARN-3725:
---

[~zjshen], is there a JIRA for the longer term fix?

> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Philip Langdale (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569899#comment-14569899
 ] 

Philip Langdale commented on YARN-2194:
---

You can remount controllers if you retain the same combination as the existing 
mount point, so I guess you could replace the ',' with something your parsing 
code can handle (or you could fix the parsing code). In general, life is a lot 
easier if you can avoid remounting as you then don't have to worry about 
managing their lifecycle.

I'd argue the most robust thing to do is discover the existing mount point from 
/proc/mounts and then use it (assuming the comma parsing can be fixed) if it's 
present (and don't forget to respect the NodeManager's cgroup paths from 
/proc/self/mounts)

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2392) add more diags about app retry limits on AM failures

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569869#comment-14569869
 ] 

Hadoop QA commented on YARN-2392:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 23s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   9m 25s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 23s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 56s | The applied patch generated  2 
new checkstyle issues (total was 244, now 245). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 46s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 42s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  52m  1s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  94m 38s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12737003/YARN-2392-002.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 03fb5c6 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8169/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8169/console |


This message was automatically generated.

> add more diags about app retry limits on AM failures
> 
>
> Key: YARN-2392
> URL: https://issues.apache.org/jira/browse/YARN-2392
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: YARN-2392-001.patch, YARN-2392-002.patch, 
> YARN-2392-002.patch
>
>
> # when an app fails the failure count is shown, but not what the global + 
> local limits are. If the two are different, they should both be printed. 
> # the YARN-2242 strings don't have enough whitespace between text and the URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Matthew Jacobs (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569778#comment-14569778
 ] 

Matthew Jacobs commented on YARN-2194:
--

I'm confused, does this mean that you'll re-mount the cpu and cpuacct 
controllers? Do we know that other components in the RHEL7 world don't expect 
them to be in the default place?

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569773#comment-14569773
 ] 

Jason Lowe commented on YARN-3585:
--

+1 latest patch lgtm.  Will commit this tomorrow if there are no objections.

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: 0001-YARN-3585.patch, YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures

2015-06-02 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2392:
-
Priority: Minor  (was: Major)

> add more diags about app retry limits on AM failures
> 
>
> Key: YARN-2392
> URL: https://issues.apache.org/jira/browse/YARN-2392
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: YARN-2392-001.patch, YARN-2392-002.patch, 
> YARN-2392-002.patch
>
>
> # when an app fails the failure count is shown, but not what the global + 
> local limits are. If the two are different, they should both be printed. 
> # the YARN-2242 strings don't have enough whitespace between text and the URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures

2015-06-02 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2392:
-
Attachment: YARN-2392-002.patch

Patch 002
* in sync with trunk
* uses String.format for a more readable format of the response
* includes sliding window details in the message

There's no test here, for which I apologise. To test this I'd need a test to 
trigger failures and look for the final error message, which seems excessive 
for a log tuning. If there's a test for the sliding-window retry that could be 
patched, I'll do it there.

> add more diags about app retry limits on AM failures
> 
>
> Key: YARN-2392
> URL: https://issues.apache.org/jira/browse/YARN-2392
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2392-001.patch, YARN-2392-002.patch, 
> YARN-2392-002.patch
>
>
> # when an app fails the failure count is shown, but not what the global + 
> local limits are. If the two are different, they should both be printed. 
> # the YARN-2242 strings don't have enough whitespace between text and the URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-02 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569727#comment-14569727
 ] 

zhihai xu commented on YARN-3591:
-

Hi [~lavkesh], I think we can create a separate JIRA for storing local Error 
directories in NM state store, which will be a good enhancement.
thanks [~sunilg]! Adding a new API to get local error directories is also a 
good suggestion. But I think it will be enough to just check newErrorDirs 
instead of all errorDirs.

To better support NM recovery and make DirsChangeListener interface simple, I 
propose the following changes:

1.In DirectoryCollection, notify listener when any set of dirs(localDirs, 
errorDirs and fullDirs) are changed
The code change at {{DirectoryCollection#checkDirs}} looks like the following:
{code}
bool needNotifyListener = false;
needNotifyListener = setChanged;
for (String dir : preCheckFullDirs) {
  if (postCheckOtherDirs.contains(dir)) {
needNotifyListener = true;
LOG.warn("Directory " + dir + " error "
+ dirsFailedCheck.get(dir).message);
  }
}
for (String dir : preCheckOtherErrorDirs) {
  if (postCheckFullDirs.contains(dir)) {
needNotifyListener = true;
LOG.warn("Directory " + dir + " error "
+ dirsFailedCheck.get(dir).message);
  }
}
if (needNotifyListener) {
  for (DirsChangeListener listener : dirsChangeListeners) {
listener.onDirsChanged();
  }
}
{code}

2.  add an API to get local error directories.
As [~sunilg] suggested, We can add an API {{synchronized List 
getErrorDirs()}} in DirectoryCollection.java
We also need add an API {{public List getLocalErrorDirs()}} in 
LocalDirsHandlerService.java, which will call 
{{DirectoryCollection#getErrorDirs}}

3. add a field {{Set preLocalErrorDirs}} in 
ResourceLocalizationService.java to store previous local error directories.
{{ResourceLocalizationService#preLocalErrorDirs}} should be loaded from state 
store at the beginning if we support storing local Error directories in NM 
state store.

4.The following is pseudo code for {{localDirsChangeListener#onDirsChanged}}:
{code}
Set curLocalErrorDirs = new 
HashSet(dirsHandler.getLocalErrorDirs());
List newErrorDirs = new ArrayList();
List newRepairedDirs = new ArrayList();
for (String dir : curLocalErrorDirs) {
  if (!preLocalErrorDirs.contains(dir)) {
newErrorDirs.add(dir);
  }
}
for (String dir : preLocalErrorDirs) {
  if (!curLocalErrorDirs.contains(dir)) {
newRepairedDirs.add(dir);
  }
}
for (String localDir : newRepairedDirs) {
cleanUpLocalDir(lfs, delService, localDir);
}
if (!newErrorDirs.isEmpty()) {
//As Sunil suggested, checkLocalizedResources will call removeResource on those 
localized resources whose parent is present in newErrorDirs.
publicRsrc.checkLocalizedResources(newErrorDirs);
for (LocalResourcesTracker tracker : privateRsrc.values()) {
tracker.checkLocalizedResources(newErrorDirs);
}
}
if (!newErrorDirs.isEmpty() || !newRepairedDirs.isEmpty()) {
preLocalErrorDirs = curLocalErrorDirs;
stateStore.storeLocalErrorDirs(StringUtils.arrayToString(curLocalErrorDirs.toArray(new
 String[0])));
}
checkAndInitializeLocalDirs();
{code}

5. It will be better to move {{verifyDirUsingMkdir(testDir)}} right after 
{{DiskChecker.checkDir(testDir)}} in {{DirectoryCollection#testDirs}}, so we 
can detect the error directory before detecting the full directory.

Please feel free to change or add more to my proposal.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569648#comment-14569648
 ] 

Siddharth Seth commented on YARN-1462:
--

ApplicationReport.newInstance is used by mapreduce and Tez, and potentially 
other applications which may be modeled along the same AMs. It'll be useful to 
make the API change here compatible. This is along the lines of newInstances 
being used for various constructs like ContainerId, AppId, etc.
With the change, I don't believe MR2.6 will work with a 2.8 cluster - depending 
on how the classpath is setup.

> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569646#comment-14569646
 ] 

Hadoop QA commented on YARN-3069:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  19m 46s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 58s | Site still builds. |
| {color:green}+1{color} | checkstyle |   1m 36s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 22s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  23m 34s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | yarn tests |   1m 55s | Tests passed in 
hadoop-yarn-common. |
| | |  72m 56s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736976/YARN-3069.011.patch |
| Optional Tests | site javadoc javac unit findbugs checkstyle |
| git revision | trunk / a2bd621 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/console |


This message was automatically generated.

> Document missing properties in yarn-default.xml
> ---
>
> Key: YARN-3069
> URL: https://issues.apache.org/jira/browse/YARN-3069
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: BB2015-05-TBR, supportability
> Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
> YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
> YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
> YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch
>
>
> The following properties are currently not defined in yarn-default.xml.  
> These properties should either be
>   A) documented in yarn-default.xml OR
>   B)  listed as an exception (with comments, e.g. for internal use) in the 
> TestYarnConfigurationFields unit test
> Any comments for any of the properties below are welcome.
>   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
>   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
>   security.applicationhistory.protocol.acl
>   yarn.app.container.log.backups
>   yarn.app.container.log.dir
>   yarn.app.container.log.filesize
>   yarn.client.app-submission.poll-interval
>   yarn.client.application-client-protocol.poll-timeout-ms
>   yarn.is.minicluster
>   yarn.log.server.url
>   yarn.minicluster.control-resource-monitoring
>   yarn.minicluster.fixed.ports
>   yarn.minicluster.use-rpc
>   yarn.node-labels.fs-store.retry-policy-spec
>   yarn.node-labels.fs-store.root-dir
>   yarn.node-labels.manager-class
>   yarn.nodemanager.container-executor.os.sched.priority.adjustment
>   yarn.nodemanager.container-monitor.process-tree.class
>   yarn.nodemanager.disk-health-checker.enable
>   yarn.nodemanager.docker-container-executor.image-name
>   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
>   yarn.nodemanager.linux-container-executor.group
>   yarn.nodemanager.log.deletion-threads-count
>   yarn.nodemanager.user-home-dir
>   yarn.nodemanager.webapp.https.address
>   yarn.nodemanager.webapp.spnego-keytab-fil

[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569548#comment-14569548
 ] 

Sergey Shelukhin commented on YARN-1462:


[~sseth] can you please comment on the above (use of Private API)?

> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-02 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-3069:
-
Attachment: YARN-3069.011.patch

Thanks Akira!  New patch with the following changes:

- Fix description for yarn.node-labels.fs-store.retry-policy-spec
- Remove YARN registry entries from yarn-default.xml
- Remove one outdated entry yarn.application.classpath.prepend.distcache
- Add entry for yarn.intermediate-data-encryption.enable

I'll also go through the yarn-default.xml file once more to make sure no 
default values will change.

> Document missing properties in yarn-default.xml
> ---
>
> Key: YARN-3069
> URL: https://issues.apache.org/jira/browse/YARN-3069
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: BB2015-05-TBR, supportability
> Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
> YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
> YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
> YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch
>
>
> The following properties are currently not defined in yarn-default.xml.  
> These properties should either be
>   A) documented in yarn-default.xml OR
>   B)  listed as an exception (with comments, e.g. for internal use) in the 
> TestYarnConfigurationFields unit test
> Any comments for any of the properties below are welcome.
>   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
>   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
>   security.applicationhistory.protocol.acl
>   yarn.app.container.log.backups
>   yarn.app.container.log.dir
>   yarn.app.container.log.filesize
>   yarn.client.app-submission.poll-interval
>   yarn.client.application-client-protocol.poll-timeout-ms
>   yarn.is.minicluster
>   yarn.log.server.url
>   yarn.minicluster.control-resource-monitoring
>   yarn.minicluster.fixed.ports
>   yarn.minicluster.use-rpc
>   yarn.node-labels.fs-store.retry-policy-spec
>   yarn.node-labels.fs-store.root-dir
>   yarn.node-labels.manager-class
>   yarn.nodemanager.container-executor.os.sched.priority.adjustment
>   yarn.nodemanager.container-monitor.process-tree.class
>   yarn.nodemanager.disk-health-checker.enable
>   yarn.nodemanager.docker-container-executor.image-name
>   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
>   yarn.nodemanager.linux-container-executor.group
>   yarn.nodemanager.log.deletion-threads-count
>   yarn.nodemanager.user-home-dir
>   yarn.nodemanager.webapp.https.address
>   yarn.nodemanager.webapp.spnego-keytab-file
>   yarn.nodemanager.webapp.spnego-principal
>   yarn.nodemanager.windows-secure-container-executor.group
>   yarn.resourcemanager.configuration.file-system-based-store
>   yarn.resourcemanager.delegation-token-renewer.thread-count
>   yarn.resourcemanager.delegation.key.update-interval
>   yarn.resourcemanager.delegation.token.max-lifetime
>   yarn.resourcemanager.delegation.token.renew-interval
>   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
>   yarn.resourcemanager.metrics.runtime.buckets
>   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
>   yarn.resourcemanager.reservation-system.class
>   yarn.resourcemanager.reservation-system.enable
>   yarn.resourcemanager.reservation-system.plan.follower
>   yarn.resourcemanager.reservation-system.planfollower.time-step
>   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
>   yarn.resourcemanager.webapp.spnego-keytab-file
>   yarn.resourcemanager.webapp.spnego-principal
>   yarn.scheduler.include-port-in-node-name
>   yarn.timeline-service.delegation.key.update-interval
>   yarn.timeline-service.delegation.token.max-lifetime
>   yarn.timeline-service.delegation.token.renew-interval
>   yarn.timeline-service.generic-application-history.enabled
>   
> yarn.timeline-service.generic-application-history.fs-history-store.compression-type
>   yarn.timeline-service.generic-application-history.fs-history-store.uri
>   yarn.timeline-service.generic-application-history.store-class
>   yarn.timeline-service.http-cross-origin.enabled
>   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3761) Set delegation token service address at the server side

2015-06-02 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena reassigned YARN-3761:
--

Assignee: Varun Saxena

> Set delegation token service address at the server side
> ---
>
> Key: YARN-3761
> URL: https://issues.apache.org/jira/browse/YARN-3761
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: security
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
>
> Nowadays, YARN components generate the delegation token without the service 
> address set, and leave it to the client to set. With our java client library, 
> it is usually fine. However, if users are using REST API, it's going to be a 
> problem: The delegation token is returned as a url string. It's so unfriendly 
> for the thin client to deserialize the url string, set the token service 
> address and serialize it again for further usage. If we move the task of 
> setting the service address to the server side, the client can get rid of 
> this trouble.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569480#comment-14569480
 ] 

Sunil G commented on YARN-3591:
---

If we have a new api which returns the present set of error dirs alone (w/o 
full dirs) 
{code}
synchronized List getErrorDirs() 
{code}
then could we modify LocalResourcesTrackerImpl#checkLocalizedResources in such 
a way that we call *removeResource* on those localized resources whose parent 
is present in ErrorDirs.



> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569465#comment-14569465
 ] 

Xuan Gong commented on YARN-3753:
-

Committed into branch-2.7. Thanks, Jian

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMac

[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569456#comment-14569456
 ] 

Xuan Gong commented on YARN-3753:
-

+1, LGTM. Check this in

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.h

[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569453#comment-14569453
 ] 

Hadoop QA commented on YARN-2618:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12723515/YARN-2618-7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a2bd621 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8167/console |


This message was automatically generated.

> Avoid over-allocation of disk resources
> ---
>
> Key: YARN-2618
> URL: https://issues.apache.org/jira/browse/YARN-2618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wei Yan
>Assignee: Wei Yan
>  Labels: BB2015-05-TBR
> Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
> YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch
>
>
> Subtask of YARN-2139. 
> This should include
> - Add API support for introducing disk I/O as the 3rd type resource.
> - NM should report this information to the RM
> - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569452#comment-14569452
 ] 

Karthik Kambatla commented on YARN-2618:


[~vvasudev] - thanks for the ping. I haven't had the time to do a thorough 
review of remaining tasks here, and hence avoided committing this. Do you have 
the cycles to help shepherd this work into the branch? 

And yes, we should true YARN-2139 up to trunk and commit this.

> Avoid over-allocation of disk resources
> ---
>
> Key: YARN-2618
> URL: https://issues.apache.org/jira/browse/YARN-2618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wei Yan
>Assignee: Wei Yan
>  Labels: BB2015-05-TBR
> Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
> YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch
>
>
> Subtask of YARN-2139. 
> This should include
> - Add API support for introducing disk I/O as the 3rd type resource.
> - NM should report this information to the RM
> - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569445#comment-14569445
 ] 

Karthik Kambatla commented on YARN-2194:


I haven't looked at it closely, but think Yarn doesn't pick the separator. If 
we could easily change the separator from within Yarn, that is without 
requiring any other environment changes by the admin, I ll be a +1 for that 
change. By the way, Linux allows anything but '/' and '%' for filenames. So, 
picking ':' or '|' is only less likely to cause issues in the future. Who would 
have thought they would use ', in a filename?

If we continue with the patch posted here, I think [~mjacobs]' suggestion makes 
sense.  

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources

2015-06-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569437#comment-14569437
 ] 

Varun Vasudev commented on YARN-2618:
-

[~kasha] - should we commit this to the YARN-2139 branch? Should we get the 
branch up to date with trunk first?

> Avoid over-allocation of disk resources
> ---
>
> Key: YARN-2618
> URL: https://issues.apache.org/jira/browse/YARN-2618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wei Yan
>Assignee: Wei Yan
>  Labels: BB2015-05-TBR
> Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
> YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch
>
>
> Subtask of YARN-2139. 
> This should include
> - Add API support for introducing disk I/O as the 3rd type resource.
> - NM should report this information to the RM
> - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569421#comment-14569421
 ] 

Wei Yan commented on YARN-2194:
---

[~sidharta-s], thanks for the advice. Use a different separator LGTM. In that 
way, we can trust the "cpu" controller, and can also help avoid doing 
OS-specific changes.
Comments? [~kasha], [~vinodkv], [~mjacobs].

And for the new CGroupsHandlerImpl, didn't find any problem when I checked the 
patch. [~vvasudev], please correct me if I missed anything. 


> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3761) Set delegation token service address at the server side

2015-06-02 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-3761:
-

 Summary: Set delegation token service address at the server side
 Key: YARN-3761
 URL: https://issues.apache.org/jira/browse/YARN-3761
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: security
Reporter: Zhijie Shen


Nowadays, YARN components generate the delegation token without the service 
address set, and leave it to the client to set. With our java client library, 
it is usually fine. However, if users are using REST API, it's going to be a 
problem: The delegation token is returned as a url string. It's so unfriendly 
for the thin client to deserialize the url string, set the token service 
address and serialize it again for further usage. If we move the task of 
setting the service address to the server side, the client can get rid of this 
trouble.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569381#comment-14569381
 ] 

Sunil G commented on YARN-3754:
---

[~bibinchundatt] Could u also please attach NM logs here.

> Race condition when the NodeManager is shutting down and container is launched
> --
>
> Key: YARN-3754
> URL: https://issues.apache.org/jira/browse/YARN-3754
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Sunil G
>Priority: Critical
>
> Container is launched and returned to ContainerImpl
> NodeManager closed the DB connection which resulting in 
> {{org.iq80.leveldb.DBException: Closed}}. 
> *Attaching the exception trace*
> {code}
> 2015-05-30 02:11:49,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Unable to update state store diagnostics for 
> container_e310_1432817693365_3338_01_02
> java.io.IOException: org.iq80.leveldb.DBException: Closed
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.iq80.leveldb.DBException: Closed
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
> ... 15 more
> {code}
> we can add a check whether DB is closed while we move container from ACQUIRED 
> state.
> As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569347#comment-14569347
 ] 

Vinod Kumar Vavilapalli commented on YARN-3755:
---

We had this long ago in YARN, but removed it as the log files were getting 
inundated in large/high throughput clusters. If you combine the command line 
with the environment (classpath etc), this can get very long.

How about we let individual frameworks like MapReduce/Tez log them as needed? 
That seems like the right place for debugging too - app developers don't always 
get access to the daemon logs.

> Log the command of launching containers
> ---
>
> Key: YARN-3755
> URL: https://issues.apache.org/jira/browse/YARN-3755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: YARN-3755-1.patch, YARN-3755-2.patch
>
>
> In the resource manager log, yarn would log the command for launching AM, 
> this is very useful. But there's no such log in the NN log for launching 
> containers. It would be difficult to diagnose when containers fails to launch 
> due to some issue in the commands. Although user can look at the commands in 
> the container launch script file, this is an internal things of yarn, usually 
> user don't know that. In user's perspective, they only know what commands 
> they specify when building yarn application. 
> {code}
> 2015-06-01 16:06:42,245 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
> to launch container container_1433145984561_0001_01_01 : 
> $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA 
> -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
> 1>/stdout 2>/stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569342#comment-14569342
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2162 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2162/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569329#comment-14569329
 ] 

Sidharta Seethana commented on YARN-2194:
-

To clarify, my comment was respect to this line in the description {{The comma 
in the controller name leads to container launch failure.}} . I believe 
switching separators or encoding arguments in some way is a better approach 
than requiring symlinks or transform "cpu,cpuacct" to "cpu" as the controller 
name. 

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569322#comment-14569322
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #214 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/214/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3760) Log aggregation failures

2015-06-02 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569289#comment-14569289
 ] 

Daryn Sharp commented on YARN-3760:
---

Cancelled tokens trigger the retry proxy bug.

> Log aggregation failures 
> -
>
> Key: YARN-3760
> URL: https://issues.apache.org/jira/browse/YARN-3760
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> The aggregated log file does not appear to be properly closed when writes 
> fail.  This leaves a lease renewer active in the NM that spams the NN with 
> lease renewals.  If the token is marked not to be cancelled, the renewals 
> appear to continue until the token expires.  If the token is cancelled, the 
> periodic renew spam turns into a flood of failed connections until the lease 
> renewer gives up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-06-02 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569278#comment-14569278
 ] 

Devaraj K commented on YARN-41:
---

Thanks a lot [~djp] for your review and comments, I really appreciate your help 
on reviewing the patch.

> The RM should handle the graceful shutdown of the NM.
> -
>
> Key: YARN-41
> URL: https://issues.apache.org/jira/browse/YARN-41
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Ravi Teja Ch N V
>Assignee: Devaraj K
> Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
> MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
> YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, 
> YARN-41-8.patch, YARN-41.patch
>
>
> Instead of waiting for the NM expiry, RM should remove and handle the NM, 
> which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3760) Log aggregation failures

2015-06-02 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-3760:
-

 Summary: Log aggregation failures 
 Key: YARN-3760
 URL: https://issues.apache.org/jira/browse/YARN-3760
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Daryn Sharp
Priority: Critical


The aggregated log file does not appear to be properly closed when writes fail. 
 This leaves a lease renewer active in the NM that spams the NN with lease 
renewals.  If the token is marked not to be cancelled, the renewals appear to 
continue until the token expires.  If the token is cancelled, the periodic 
renew spam turns into a flood of failed connections until the lease renewer 
gives up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569267#comment-14569267
 ] 

Karthik Kambatla commented on YARN-3753:


Fix looks reasonable to me.

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   

[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569262#comment-14569262
 ] 

Karthik Kambatla commented on YARN-2962:


YARN-3643 should help alleviate most of the issues users face. This JIRA could 
be targeted only at trunk, without worrying about rolling upgrades.

> ZKRMStateStore: Limit the number of znodes under a znode
> 
>
> Key: YARN-2962
> URL: https://issues.apache.org/jira/browse/YARN-2962
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch
>
>
> We ran into this issue where we were hitting the default ZK server message 
> size configs, primarily because the message had too many znodes even though 
> they individually they were all small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3754:
---
Target Version/s: 2.8.0  (was: 2.7.1)

> Race condition when the NodeManager is shutting down and container is launched
> --
>
> Key: YARN-3754
> URL: https://issues.apache.org/jira/browse/YARN-3754
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Sunil G
>Priority: Critical
>
> Container is launched and returned to ContainerImpl
> NodeManager closed the DB connection which resulting in 
> {{org.iq80.leveldb.DBException: Closed}}. 
> *Attaching the exception trace*
> {code}
> 2015-05-30 02:11:49,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Unable to update state store diagnostics for 
> container_e310_1432817693365_3338_01_02
> java.io.IOException: org.iq80.leveldb.DBException: Closed
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.iq80.leveldb.DBException: Closed
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
> ... 15 more
> {code}
> we can add a check whether DB is closed while we move container from ACQUIRED 
> state.
> As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3754:
---
Priority: Critical  (was: Major)
Target Version/s: 2.7.1

> Race condition when the NodeManager is shutting down and container is launched
> --
>
> Key: YARN-3754
> URL: https://issues.apache.org/jira/browse/YARN-3754
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Sunil G
>Priority: Critical
>
> Container is launched and returned to ContainerImpl
> NodeManager closed the DB connection which resulting in 
> {{org.iq80.leveldb.DBException: Closed}}. 
> *Attaching the exception trace*
> {code}
> 2015-05-30 02:11:49,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Unable to update state store diagnostics for 
> container_e310_1432817693365_3338_01_02
> java.io.IOException: org.iq80.leveldb.DBException: Closed
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.iq80.leveldb.DBException: Closed
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
> ... 15 more
> {code}
> we can add a check whether DB is closed while we move container from ACQUIRED 
> state.
> As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569237#comment-14569237
 ] 

Jason Lowe commented on YARN-3758:
--

First off, one should never set the heap size and the container size to the 
same value.  The container size needs to be big enough to hold the entire 
process, not just the heap, so it needs to also consider the overhead of the 
JVM itself and any off-heap usage (e.g.: JVM code, data, thread stacks, shared 
libs, off-heap allocations, etc.).  If you set the heap size to the same size 
as the container then when the heap fills up the process overall will be bigger 
than the heap size and YARN will kill the container.

Couple of things to check:
- Does the job configuration show that it is indeed asking for only 256 MB 
containers for tasks?  Check the job configuration link for the job on the job 
history server or the configuration link for the AM's UI while the job is 
running.
- Check the RM logs to verify what minimum allocation size it is loading from 
the configs and what request size it is allocating per task

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
> Physical memory each node
> Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
> Physical memory each node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3603) Application Attempts page confusing

2015-06-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3603:
--
Attachment: ahs1.png

> Application Attempts page confusing
> ---
>
> Key: YARN-3603
> URL: https://issues.apache.org/jira/browse/YARN-3603
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.8.0
>Reporter: Thomas Graves
>Assignee: Sunil G
> Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png
>
>
> The application attempts page 
> (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)
> is a bit confusing on what is going on.  I think the table of containers 
> there is for only Running containers and when the app is completed or killed 
> its empty.  The table should have a label on it stating so.  
> Also the "AM Container" field is a link when running but not when its killed. 
>  That might be confusing.
> There is no link to the logs in this page but there is in the app attempt 
> table when looking at http://
> rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3603) Application Attempts page confusing

2015-06-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3603:
--
Attachment: 0002-YARN-3603.patch

Attaching an updated version of patch. Also attaching screen shots of UI. 
[~tgraves] Could u please take a look on this. Thank you.

> Application Attempts page confusing
> ---
>
> Key: YARN-3603
> URL: https://issues.apache.org/jira/browse/YARN-3603
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.8.0
>Reporter: Thomas Graves
>Assignee: Sunil G
> Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png
>
>
> The application attempts page 
> (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)
> is a bit confusing on what is going on.  I think the table of containers 
> there is for only Running containers and when the app is completed or killed 
> its empty.  The table should have a label on it stating so.  
> Also the "AM Container" field is a link when running but not when its killed. 
>  That might be confusing.
> There is no link to the logs in this page but there is in the app attempt 
> table when looking at http://
> rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-06-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569182#comment-14569182
 ] 

Junping Du commented on YARN-41:


Thanks [~devaraj.k] for updating the patch with addressing previous comments! 
Latest patch LGTM. +1. Will commit it tomorrow if no further comments on the 
code from other reviewers.
In addition, given the patch involve new SHUTDOWN category on: NodeState, UI 
and Cluster Metrics. Although it doesn't break any public APIs, we should mark 
this JIRA as incompatible for its inconsistent behaviors with previous releases 
in UI, CLI, Metrics (to notify users or third-party management & monitor 
software). In general, I think it should be fine to keep the plan to include 
this patch in 2.x releases. However, please comments here to let us know if you 
have any concerns.

> The RM should handle the graceful shutdown of the NM.
> -
>
> Key: YARN-41
> URL: https://issues.apache.org/jira/browse/YARN-41
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Ravi Teja Ch N V
>Assignee: Devaraj K
> Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
> MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
> YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, 
> YARN-41-8.patch, YARN-41.patch
>
>
> Instead of waiting for the NM expiry, RM should remove and handle the NM, 
> which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569191#comment-14569191
 ] 

Hudson commented on YARN-1462:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #205 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/205/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569171#comment-14569171
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2144 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2144/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-06-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568947#comment-14568947
 ] 

Junping Du commented on YARN-41:


bq. Junping Du I have updated the patch with review comments. Can you have a 
look into this?
Sorry for being late on this as taking travel last week. I will review your 
latest patch today.

> The RM should handle the graceful shutdown of the NM.
> -
>
> Key: YARN-41
> URL: https://issues.apache.org/jira/browse/YARN-41
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Ravi Teja Ch N V
>Assignee: Devaraj K
> Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
> MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
> YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, 
> YARN-41-8.patch, YARN-41.patch
>
>
> Instead of waiting for the NM expiry, RM should remove and handle the NM, 
> which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568926#comment-14568926
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #946 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/946/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568920#comment-14568920
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #216 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/216/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568913#comment-14568913
 ] 

Hadoop QA commented on YARN-3733:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  6s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 54s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 33s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| | |  40m 10s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736802/0001-YARN-3733.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8166/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8166/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8166/console |


This message was automatically generated.

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3759) Include command line, localization info and env vars on AM launch failure

2015-06-02 Thread Steve Loughran (JIRA)
Steve Loughran created YARN-3759:


 Summary: Include command line, localization info and env vars on 
AM launch failure
 Key: YARN-3759
 URL: https://issues.apache.org/jira/browse/YARN-3759
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Steve Loughran
Priority: Minor


While trying to diagnose AM launch failures, its important to be able to get at 
the final, expanded {{CLASSPATH}} and other env variables. We don't get that 
today: you can log the unexpanded values on the client, and tweak NM 
ContainerExecutor log levels to DEBUG & get some of this —‚ut you don't get it 
in the task logs, and tuning NM log level isn't viable on a large, busy cluster.

Launch failures should include some env specifics:
# list of env vars (ideally, full getenv values), with some stripping of 
"sensitive" options (i'm thinking AWS env vars here)
# command line
# path localisations

These can go in the task logs, we don't need to include them in the application 
report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568880#comment-14568880
 ] 

Sunil G commented on YARN-3733:
---

Hi [~rohithsharma]
Thanks for the detailed scenario.

Scenario 4 can be possible, correct?. clusterResource<0,0> : lhs <2,2> and rhs 
<3,2>.

Currently getResourceAsValue gives back the max ratio of mem/vcores if 
dominent. Else gives the min ratio.
If clusterResource is 0, then could we directly send the max of mem/vcore if 
dominent, and min in other case. This has to be made more better algorithm when 
more resources comes in.
This is not completely perfect as we treat memory and vcores leniently. Pls 
share your thoughts.

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0001-YARN-3733.patch

The updated patch that fixes for 2nd and 3rd scenarios(This issue scenario  
fixes) in above table and refactored the test code.

As a overall solution that solves input combination like 4th and 5th from above 
table, need to explore more on how to define fraction and how to decide which 
one is dominant. Any suggestions on this?



> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-02 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568853#comment-14568853
 ] 

Brahma Reddy Battula commented on YARN-3170:


Updated patch..Kindly review!!

> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, 
> YARN-3170-010.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-02 Thread Lavkesh Lahngir (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568834#comment-14568834
 ] 

Lavkesh Lahngir commented on YARN-3591:
---

[~zxu] :Can we get away without storing into NMstateStore? Other changes seems 
to be okay.
It's not a big change in terms of the code, but adding in NMstate could be 
debatable.
[~vvasudev]: Thoughts?

> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568801#comment-14568801
 ] 

Hadoop QA commented on YARN-3755:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 45s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 38s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 33s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 36s | The applied patch generated  1 
new checkstyle issues (total was 58, now 58). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 13s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m  7s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  43m 31s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736781/YARN-3755-2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8165/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8165/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8165/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8165/console |


This message was automatically generated.

> Log the command of launching containers
> ---
>
> Key: YARN-3755
> URL: https://issues.apache.org/jira/browse/YARN-3755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: YARN-3755-1.patch, YARN-3755-2.patch
>
>
> In the resource manager log, yarn would log the command for launching AM, 
> this is very useful. But there's no such log in the NN log for launching 
> containers. It would be difficult to diagnose when containers fails to launch 
> due to some issue in the commands. Although user can look at the commands in 
> the container launch script file, this is an internal things of yarn, usually 
> user don't know that. In user's perspective, they only know what commands 
> they specify when building yarn application. 
> {code}
> 2015-06-01 16:06:42,245 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
> to launch container container_1433145984561_0001_01_01 : 
> $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA 
> -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
> 1>/stdout 2>/stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568797#comment-14568797
 ] 

Hadoop QA commented on YARN-3753:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 53s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 48s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 27s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m  6s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m  7s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736776/YARN-3753.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8164/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8164/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8164/console |


This message was automatically generated.

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.ha

[jira] [Resolved] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho resolved YARN-3757.
-
Resolution: Duplicate

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3757
> URL: https://issues.apache.org/jira/browse/YARN-3757
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: Hadoop 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, 8G Physical memory 
> each node
> Second cluster is 10 node, 2 application queuey, 230G Physical memory each 
> node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho resolved YARN-3756.
-
Resolution: Duplicate

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3756
> URL: https://issues.apache.org/jira/browse/YARN-3756
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: hadoop 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, 8G Physical memory 
> each node
> Second cluster is 10 node, 2 application queuey, 230G Physical memory each 
> node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568790#comment-14568790
 ] 

Naganarasimha G R commented on YARN-3758:
-

YARN-3756 and YARN-3757 are same as this issue ! can you close them .

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
> Physical memory each node
> Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
> Physical memory each node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode

2015-06-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568789#comment-14568789
 ] 

Varun Saxena commented on YARN-2962:


Was waiting for an input from [~vinodkv] and [~asuresh] so that we reach a 
common understanding on what we will do on the backward compatibility part.

Anyways in the coming week, plan to upload a patch implementing one of the 
approaches discussed.

> ZKRMStateStore: Limit the number of znodes under a znode
> 
>
> Key: YARN-2962
> URL: https://issues.apache.org/jira/browse/YARN-2962
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch
>
>
> We ran into this issue where we were hitting the default ZK server message 
> size configs, primarily because the message had too many znodes even though 
> they individually they were all small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-02 Thread Joep Rottinghuis (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joep Rottinghuis updated YARN-3706:
---
Attachment: YARN-3726-YARN-2928.004.patch

YARN-3726-YARN-2928.004.patch :
- fixed bug in cleanse (found thanks to unit test)
- fixed value separator (was ! instead of ?).
- Added readResult and readResults to EntityColumnPrefix (still need to add 
signature in interface).
- Added initial unit test for TimeLineWriterUtils
- Added relationship checking to TestTimelineWriterImpl

> Generalize native HBase writer for additional tables
> 
>
> Key: YARN-3706
> URL: https://issues.apache.org/jira/browse/YARN-3706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Joep Rottinghuis
>Assignee: Joep Rottinghuis
>Priority: Minor
> Attachments: YARN-3706-YARN-2928.001.patch, 
> YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, 
> YARN-3726-YARN-2928.004.patch
>
>
> When reviewing YARN-3411 we noticed that we could change the class hierarchy 
> a little in order to accommodate additional tables easily.
> In order to get ready for benchmark testing we left the original layout in 
> place, as performance would not be impacted by the code hierarchy.
> Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho updated YARN-3758:

Description: 
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
Physical memory each node
Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G Physical 
memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml & mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~

  was:
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml & mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~


> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
> Physical memory each node
> Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
> Physical memory each node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated YARN-3755:
-
Attachment: YARN-3755-2.patch

Upload new patch to address the checkstyle issue 

> Log the command of launching containers
> ---
>
> Key: YARN-3755
> URL: https://issues.apache.org/jira/browse/YARN-3755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: YARN-3755-1.patch, YARN-3755-2.patch
>
>
> In the resource manager log, yarn would log the command for launching AM, 
> this is very useful. But there's no such log in the NN log for launching 
> containers. It would be difficult to diagnose when containers fails to launch 
> due to some issue in the commands. Although user can look at the commands in 
> the container launch script file, this is an internal things of yarn, usually 
> user don't know that. In user's perspective, they only know what commands 
> they specify when building yarn application. 
> {code}
> 2015-06-01 16:06:42,245 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
> to launch container container_1433145984561_0001_01_01 : 
> $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA 
> -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
> 1>/stdout 2>/stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568735#comment-14568735
 ] 

Hadoop QA commented on YARN-3749:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  20m  3s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 8 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 17s | The applied patch generated  1 
new checkstyle issues (total was 212, now 213). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   6m  5s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   6m 58s | Tests passed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  60m 34s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 51s | Tests passed in 
hadoop-yarn-server-tests. |
| | | 121m  2s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736753/YARN-3749.7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-tests test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/console |


This message was automatically generated.

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddr

[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho updated YARN-3758:

Description: 
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml & mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~

  was:
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~


> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, 8G Physical memory 
> each node
> Second cluster is 10 node, 2 application queuey, 230G Physical memory each 
> node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)
skrho created YARN-3758:
---

 Summary: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container
 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho


Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)
skrho created YARN-3757:
---

 Summary: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container
 Key: YARN-3757
 URL: https://issues.apache.org/jira/browse/YARN-3757
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: Hadoop 2.4.0
Reporter: skrho


Hello there~~

I have 2 clusters

First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 

In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)
skrho created YARN-3756:
---

 Summary: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container
 Key: YARN-3756
 URL: https://issues.apache.org/jira/browse/YARN-3756
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: hadoop 2.4.0
Reporter: skrho


Hello there~~

I have 2 clusters

First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml & mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 

In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )

But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode

2015-06-02 Thread Jun Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568700#comment-14568700
 ] 

Jun Xu commented on YARN-2962:
--

We suffered from this problem too. Seems this issue has last for nearly half a 
year, any new progress guys?

> ZKRMStateStore: Limit the number of znodes under a znode
> 
>
> Key: YARN-2962
> URL: https://issues.apache.org/jira/browse/YARN-2962
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch
>
>
> We ran into this issue where we were hitting the default ZK server message 
> size configs, primarily because the message had too many znodes even though 
> they individually they were all small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-02 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3753:
--
Attachment: YARN-3753.2.patch

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMa

[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568682#comment-14568682
 ] 

Rohith commented on YARN-3733:
--

Updated the summary as per defect.

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)