date:20140822

[jira] [Commented] (YARN-1801) NPE in public localizer

2014-08-22 Thread Hong Zhiguo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106555#comment-14106555
 ] 

Hong Zhiguo commented on YARN-1801:
---

I think YARN-1575 already fixed this NPE. We could mark it as duplicated.

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical
 Attachments: YARN-1801.patch


 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread mai shurong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106583#comment-14106583
 ] 

mai shurong commented on YARN-1458:
---

George Wong ,
You can try our YARN-1458.patch and it is easy to understand,but the issue is 
still unresolved. You can consult to the corresponding code in later hadoop 
version such as 2.2.1, 2.3.x, 2.4.x


zhihai xu,
It seams your thinking is more rigorous than our patch.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:


Attachment: YARN-1458.002.patch

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2439) Move winutils task related functionality under a yarn-servers-nodemanager project

2014-08-22 Thread Remus Rusanu (JIRA)

Remus Rusanu created YARN-2439:
--

 Summary: Move winutils task related functionality under a 
yarn-servers-nodemanager project
 Key: YARN-2439
 URL: https://issues.apache.org/jira/browse/YARN-2439
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Remus Rusanu
Priority: Minor


Currently winutils is build as part of hadoop-common. But winutils has features 
that relate strictly to the nodemanager, namely `winutils task`. Being build 
under hadoop-common menas that any mvn/pom compile configuration has to be done 
in the hadoop-common project. For example I wanted to add a configuration file 
similar to the container-executor cfg, which gets the .cfg location from the 
${container-executor.conf.dir} in it's parent pom. But for winutils I would 
have to add the config to the hadoop-common pom, despite being very specific 
for the nodemanager use.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-08-22 Thread Remus Rusanu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106605#comment-14106605
]

Remus Rusanu commented on YARN-2198:

I have created YARN-2439 to track the separation of winutils task functionality
into a nodemanager related project, away from hadoop-common

Remove the need to run NodeManager as privileged account for Windows Secure
Container Executor
--

Key: YARN-2198
URL: https://issues.apache.org/jira/browse/YARN-2198
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Attachments: YARN-2198.1.patch, YARN-2198.2.patch,
YARN-2198.separation.patch

YARN-1972 introduces a Secure Windows Container Executor. However this
executor requires a the process launching the container to be LocalSystem or
a member of the a local Administrators group. Since the process in question
is the NodeManager, the requirement translates to the entire NM to run as a
privileged account, a very large surface area to review and protect.
This proposal is to move the privileged operations into a dedicated NT
service. The NM can run as a low privilege account and communicate with the
privileged NT service when it needs to launch a container. This would reduce
the surface exposed to the high privileges.
There has to exist a secure, authenticated and authorized channel of
communication between the NM and the privileged NT service. Possible
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would
be to use Windows LPC (Local Procedure Calls), which is a Windows platform
specific inter-process communication channel that satisfies all requirements
and is easy to deploy. The privileged NT service would register and listen on
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop
with libwinutils which would host the LPC client code. The client would
connect to the LPC port (NtConnectPort) and send a message requesting a
container launch (NtRequestWaitReplyPort). LPC provides authentication and
the privileged NT service can use authorization API (AuthZ) to validate the
caller.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106606#comment-14106606
 ] 

zhihai xu commented on YARN-1458:
-

I added a test case testFairShareWithZeroWeight in new patch 
YARN-1458.002.patch to verify the patch can work with zero weight.
Without the patch, testFairShareWithZeroWeight will run forever.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Hao Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106630#comment-14106630
 ] 

Hao Gao commented on YARN-2345:
---

yarn node -status nodeid 
will list the status of a single node.
I reuse the code there to get the status of all nodes.

{code:xml}
Nodes Report :
Node-Id : 192.168.1.6:53239
Rack : /default-rack
Node-State : RUNNING
Node-Http-Address : 192.168.1.6:8042
Last-Health-Update : Fri 22/Aug/14 12:53:38:312PDT
Health-Report :
Containers : 0
Memory-Used : 0MB
Memory-Capacity : 8192MB
CPU-Used : 0 vcores
CPU-Capacity : 8 vcores
{code}

Do we need more information? 
Also Do we need to have options like -live -dead ?

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie

 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Hao Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Gao updated YARN-2345:
--

Attachment: YARN-2345.1.patch

Attached the patch.

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2437) [post-HADOOP-9902] start-yarn.sh/stop-yarn needs to give info

2014-08-22 Thread Hao Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Gao updated YARN-2437:
--

Assignee: Hao Gao

 [post-HADOOP-9902] start-yarn.sh/stop-yarn needs to give info
 -

 Key: YARN-2437
 URL: https://issues.apache.org/jira/browse/YARN-2437
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie

 With the merger and cleanup of the daemon launch code, yarn-daemons.sh no 
 longer prints Starting information.  This should be made more of an analog 
 of start-dfs.sh/stop-dfs.sh.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

Varun Vasudev created YARN-2440:
---

 Summary: Cgroups should limit YARN containers to cores allocated 
in yarn-site.xml
 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev


The current cgroups implementation does not limit YARN containers to the cores 
allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106669#comment-14106669
 ] 

Wangda Tan commented on YARN-2345:
--

Hi Hao,
I think we already have a NodeCLI, which is yarn node -status nodeid as you 
said. We don't need add such method to RM admin CLI. RM admin CLI should only 
implement methods contained by ResourceManagerAdministrationProtocol.
I would suggest to add more information when execute yarn node -all -list, 
like memory-used, CPU-used, etc. Just like RM web UI - nodes page. 

Thanks,
Wangda

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2440:


Attachment: screenshot-current-implementation.jpg

Screenshot with the CPU usage in the current implementation. In my 
yarn-site.xml, I had set yarn.nodemanager.resource.cpu-vcores to 2. The python 
script is taking up as many cores as it can.

The quota for the yarn group was set to -1.

varun@ubuntu:/var/hadoop/hadoop-3.0.0-SNAPSHOT$ cat 
/cgroup/cpu/yarn/cpu.cfs_quota_us
-1

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2440:


Attachment: apache-yarn-2440.0.patch

Attached patch with fix.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106671#comment-14106671
 ] 

Varun Vasudev commented on YARN-2440:
-

After applying the patch, the quota is set correctly.

{noformat}
varun@ubuntu:/var/hadoop/hadoop-3.0.0-SNAPSHOT$ cat 
/cgroup/cpu/yarn/cpu.cfs_quota_us
20
{noformat}

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2014-08-22 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106687#comment-14106687
 ] 

Varun Vasudev commented on YARN-810:


[~sandyr] [~ywskycn] are you still working on this? If not, I'd like to pick it 
up.

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
 Attachments: YARN-810.patch, YARN-810.patch


 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
 {noformat}
 It drops to 1% usage, and you can see the box has room to spare:
 {noformat}
 Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si, 
 0.0%st
 {noformat}
 Turning the quota back to -1:
 {noformat}
 echo -1  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
 {noformat}
 Burns the cores again:
 {noformat}
 Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
 0.0%st
 CPU
 4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 ...
 {noformat}
 On my dev box, I was testing CGroups by running a python process eight times, 
 to burn through all the cores, since it was doing as described above (giving 
 extra CPU to the process, even with a cpu.shares limit). Toggling the 
 cfs_quota_us seems to enforce a hard limit.

[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work

2014-08-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106720#comment-14106720
 ] 

Hudson commented on YARN-2436:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #654 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/654/])
YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn


 [post-HADOOP-9902] yarn application help doesn't work
 -

 Key: YARN-2436
 URL: https://issues.apache.org/jira/browse/YARN-2436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
  Labels: newbie
 Fix For: 3.0.0

 Attachments: YARN-2436.patch


 The previous version of the yarn command plays games with the command stack 
 for some commands.  The new code needs duplicate this wackiness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled

2014-08-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106717#comment-14106717
 ] 

Hudson commented on YARN-2434:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #654 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/654/])
YARN-2434. RM should not recover containers from previously failed attempt when 
AM restart is not enabled. Contributed by Jian He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 RM should not recover containers from previously failed attempt when AM 
 restart is not enabled
 --

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Nishan Shetty (JIRA)

Nishan Shetty created YARN-2441:
---

 Summary: NPE in nodemanager after restart
 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Priority: Minor


{code}
2014-08-22 16:43:19,640 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Blocking new container-requests as container manager rpc server is still 
starting.
2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 45026: starting
2014-08-22 16:43:20,029 INFO 
org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
 Updating node address : host-10-18-40-95:45026
2014-08-22 16:43:20,029 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 ContainerManager started at /10.18.40.95:45026
2014-08-22 16:43:20,030 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
Reader #1 for port 45027
2014-08-22 16:43:20,158 INFO 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
to the server
2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 45027: starting
2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for 
port 45026: readAndProcess from client 10.18.40.84 threw exception 
[java.lang.NullPointerException]
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
at 
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
at 
org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
at 
org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
at 
org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
at 
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
at 
org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
at 
org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
at 
org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
at 
org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
at 
org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for 
port 45026: readAndProcess from client 10.18.40.84 threw exception 
[java.lang.NullPointerException]
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2442) ResourceManager JMX UI does not give HA State

2014-08-22 Thread Nishan Shetty (JIRA)

Nishan Shetty created YARN-2442:
---

 Summary: ResourceManager JMX UI does not give HA State
 Key: YARN-2442
 URL: https://issues.apache.org/jira/browse/YARN-2442
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Priority: Trivial


ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
STOPPED)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816
 ] 

Allen Wittenauer edited comment on YARN-2345 at 8/22/14 1:26 PM:
-

[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while YARN doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.




was (Author: aw):
[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while the RM doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.



 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816
 ] 

Allen Wittenauer commented on YARN-2345:


[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while the RM doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.



 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816
 ] 

Allen Wittenauer edited comment on YARN-2345 at 8/22/14 1:28 PM:
-

[~leftnoteasy]], this is to bring consistency between HDFS and YARN.hdfs 
dfsadmin -report has existed for a very long time while YARN doesn't have one.  
From a user perspective, it's irrelevant what is happening on the inside, just 
that YARN is weird if the equivalent is yarn node -all -list.




was (Author: aw):
[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while YARN doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.



 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work

2014-08-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106863#comment-14106863
 ] 

Hudson commented on YARN-2436:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1845 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1845/])
YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn


 [post-HADOOP-9902] yarn application help doesn't work
 -

 Key: YARN-2436
 URL: https://issues.apache.org/jira/browse/YARN-2436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
  Labels: newbie
 Fix For: 3.0.0

 Attachments: YARN-2436.patch


 The previous version of the yarn command plays games with the command stack 
 for some commands.  The new code needs duplicate this wackiness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled

2014-08-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106860#comment-14106860
 ] 

Hudson commented on YARN-2434:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1845 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1845/])
YARN-2434. RM should not recover containers from previously failed attempt when 
AM restart is not enabled. Contributed by Jian He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 RM should not recover containers from previously failed attempt when AM 
 restart is not enabled
 --

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Nathan Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106889#comment-14106889
 ] 

Nathan Roberts commented on YARN-2440:
--

Thanks Varun for the patch. I'm wondering if it would be possible to make this 
configurable at the system level and per-app. For example, I'd like an 
application to be able to specify that it wants to run with strict container 
limits (to verify SLA's for example), but in general I don't want these limits 
in place (why not let a container use additional CPU if it's available?). 

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106893#comment-14106893
 ] 

Varun Vasudev commented on YARN-2440:
-

[~nroberts] there's already a ticket for your request - YARN-810. That's next 
on my todo list. I've left a comment there asking if I can take it over.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106902#comment-14106902
 ] 

Jason Lowe commented on YARN-2441:
--

Was this truly running trunk as the Affected Versions field indicates or was 
this some other version of Hadoop?  Also was this a work-preserving NM restart 
scenario (i.e.: yarn.nodemanager.recovery.enabled=true) or a typical NM startup?

 NPE in nodemanager after restart
 

 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Priority: Minor

 {code}
 2014-08-22 16:43:19,640 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Blocking new container-requests as container manager rpc server is still 
 starting.
 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45026: starting
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
  Updating node address : host-10-18-40-95:45026
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager started at /10.18.40.95:45026
 2014-08-22 16:43:20,030 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
 callQueue class java.util.concurrent.LinkedBlockingQueue
 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
 Reader #1 for port 45027
 2014-08-22 16:43:20,158 INFO 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
 protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
 to the server
 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45027: starting
 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
   at 
 org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
   at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
   at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
   at 
 org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
   at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled

2014-08-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106932#comment-14106932
 ] 

Hudson commented on YARN-2434:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1871 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1871/])
YARN-2434. RM should not recover containers from previously failed attempt when 
AM restart is not enabled. Contributed by Jian He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 RM should not recover containers from previously failed attempt when AM 
 restart is not enabled
 --

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work

2014-08-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106935#comment-14106935
 ] 

Hudson commented on YARN-2436:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1871 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1871/])
YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn


 [post-HADOOP-9902] yarn application help doesn't work
 -

 Key: YARN-2436
 URL: https://issues.apache.org/jira/browse/YARN-2436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
  Labels: newbie
 Fix For: 3.0.0

 Attachments: YARN-2436.patch


 The previous version of the yarn command plays games with the command stack 
 for some commands.  The new code needs duplicate this wackiness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2393) FairScheduler: Implement steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2393:
---

Summary: FairScheduler: Implement steady fair share  (was: Fair Scheduler : 
Implement steady fair share)

 FairScheduler: Implement steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2442) ResourceManager JMX UI does not give HA State

2014-08-22 Thread Nishan Shetty (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2442:


Affects Version/s: (was: 3.0.0)
   2.5.0

 ResourceManager JMX UI does not give HA State
 -

 Key: YARN-2442
 URL: https://issues.apache.org/jira/browse/YARN-2442
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty
Priority: Trivial

 ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
 STOPPED)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Nishan Shetty (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106987#comment-14106987
 ] 

Nishan Shetty commented on YARN-2441:
-

[~jlowe] Sorry i mentioned the wrong Affected Version. Its branch 2. 
Work-preserving NM is not enabled, its just plain restart

 NPE in nodemanager after restart
 

 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty
Priority: Minor

 {code}
 2014-08-22 16:43:19,640 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Blocking new container-requests as container manager rpc server is still 
 starting.
 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45026: starting
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
  Updating node address : host-10-18-40-95:45026
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager started at /10.18.40.95:45026
 2014-08-22 16:43:20,030 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
 callQueue class java.util.concurrent.LinkedBlockingQueue
 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
 Reader #1 for port 45027
 2014-08-22 16:43:20,158 INFO 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
 protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
 to the server
 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45027: starting
 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
   at 
 org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
   at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
   at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
   at 
 org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
   at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2393) FairScheduler: Implement steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106990#comment-14106990
 ] 

Karthik Kambatla commented on YARN-2393:


One of the reasons we (Sandy and I) wanted to make the fairshare being used for 
scheduling instantaneous was to address the case where the maxAMResource 
becomes so small when there are multiple queues that we can't run any 
applications at all. I think it is better to leave it as is. In case any one 
runs into (in testing) issues with maxAMResource, we can consider preempting 
AMs as an alternative. 

 FairScheduler: Implement steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2393) FairScheduler: Implement steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106991#comment-14106991
 ] 

Karthik Kambatla commented on YARN-2393:


Committing this. 

 FairScheduler: Implement steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2393) FairScheduler: Add the notion of steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2393:
---

Summary: FairScheduler: Add the notion of steady fair share  (was: 
FairScheduler: Implement steady fair share)

 FairScheduler: Add the notion of steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2393) FairScheduler: Add the notion of steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2393:
---

Issue Type: New Feature  (was: Improvement)

 FairScheduler: Add the notion of steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

2014-08-22 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106995#comment-14106995
 ] 

Varun Vasudev commented on YARN-160:


[~djp]
{quote}
Both physical id and core id are not guaranteed to have in /proc/cpuinfo 
(please see below for my local VM's info). We may use processor number instead 
in case these ids are 0 (like we did in Windows). Again, this weak my 
confidence that this automatic way of getting CPU/memory resources should 
happen by default (not sure if any cross-platform issues). May be a safer way 
here is to keep previous default behavior (with some static setting) with an 
extra config to enable this. We can wait this feature to be more stable later 
to change the default behavior.
{noformat}

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 70
model name  : Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
stepping: 1
cpu MHz : 2295.265
cache size  : 6144 KB
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm 
constant_tsc up arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc 
aperfmperf unfair_spinlock pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 
x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb 
xsaveopt pln pts dts tpr_shadow vnmi ept vpid fsgsbase smep
bogomips: 4590.53
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
{noformat}
{quote}

In the example you gave, where we have processors listed but no physical id or 
core id entries, the numProcessors will be set to the number of entries and 
numCores will be set to 1. From the diff -
{noformat}
+  numCores = 1;
{noformat}
There is also a test case to ensure this behaviour.

In addition, cluster administrators can decide whether the NodeManager should 
report numProcessors or numCores by toggling 
yarn.nodemanager.resource.count-logical-processors-as-vcores which by default 
is true. In the vm example, by default the NodeManager will report vcores as 
the number of processor entries in /proc/cpuinfo. If 
yarn.nodemanager.resource.count-logical-processors-as-vcores is set to false, 
the NodeManager will report vcores as 1(if there are no physical id or core id 
entries).

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-160.0.patch, apache-yarn-160.1.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106999#comment-14106999
 ] 

Jason Lowe commented on YARN-2441:
--

Ah, then this seems like a case where a client (likely an AM) is connecting to 
the NM before the NM has finished registering with the RM to get the secret 
keys.  Trying to block new container requests at the app level probably isn't 
going to work in practice because the SASL layer in RPC doesn't let the 
connection get to the point where the app can try to reject the request.

IMHO we should remove the blocking client requests code and instead do a 
delayed server start, sorta like the delay added by YARN-1337 when NM recovery 
is enabled.  Ideally the RPC layer would support the ability to bind to a 
server socket but not start accepting requests until later.  That would allow 
us to register with the RM knowing what our client port is but without trying 
to let clients through that port until we're really ready.

Shorter term fix might be to have the secret manager throw an exception that 
can be retried by clients if the master key isn't set yet.

 NPE in nodemanager after restart
 

 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty
Priority: Minor

 {code}
 2014-08-22 16:43:19,640 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Blocking new container-requests as container manager rpc server is still 
 starting.
 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45026: starting
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
  Updating node address : host-10-18-40-95:45026
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager started at /10.18.40.95:45026
 2014-08-22 16:43:20,030 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
 callQueue class java.util.concurrent.LinkedBlockingQueue
 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
 Reader #1 for port 45027
 2014-08-22 16:43:20,158 INFO 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
 protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
 to the server
 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45027: starting
 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
   at 
 org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
   at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
   at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
   at 
 org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
   at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
 2014-08-22

1 2 >

1 - 100 of 102 matches

Mail list logo