[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-08-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103546#comment-14103546
 ] 

Hadoop QA commented on YARN-1492:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12662918/YARN-1492-all-trunk-v2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4674//console

This message is automatically generated.

 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Chris Trezzo
 Attachments: YARN-1492-all-trunk-v1.patch, 
 YARN-1492-all-trunk-v2.patch, shared_cache_design.pdf, 
 shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, 
 shared_cache_design_v4.pdf, shared_cache_design_v5.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2432) RMStateStore should process the pending events before close

2014-08-20 Thread Varun Saxena (JIRA)
Varun Saxena created YARN-2432:
--

 Summary: RMStateStore should process the pending events before 
close
 Key: YARN-2432
 URL: https://issues.apache.org/jira/browse/YARN-2432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Saxena
Assignee: Varun Saxena


Refer to discussion on YARN-2136 
(https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
 

As pointed out by [~jianhe], we should process the dispatcher event queue 
before closing the state store by flipping over the following statements in 
code.

{code:title=RMStateStore.java|borderStyle=solid}
 protected void serviceStop() throws Exception {
closeInternal();
dispatcher.stop();
  }
{code}

Currently, if the state store is being closed on events such as switching to 
standby, it will first close the state store(in case of ZKRMStateStore, close 
connection with ZK) and then process the pending events. Instead, we should 
first process the pending events and then call close.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2136) RMStateStore can explicitly handle store/update events when fenced

2014-08-20 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103618#comment-14103618
 ] 

Varun Saxena commented on YARN-2136:


Hi [~jianhe], for flipping over these statements, I will raise a separate JIRA

 RMStateStore can explicitly handle store/update events when fenced
 --

 Key: YARN-2136
 URL: https://issues.apache.org/jira/browse/YARN-2136
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Varun Saxena

 RMStateStore can choose to handle/ignore store/update events upfront instead 
 of invoking more ZK operations if state store is at fenced state. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before close

2014-08-20 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2432:
---

Description: 
Refer to discussion on YARN-2136 
(https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
 

As pointed out by [~jianhe], we should process the dispatcher event queue 
before closing the state store by flipping over the following statements in 
code.

{code:title=RMStateStore.java|borderStyle=solid}
 protected void serviceStop() throws Exception {
closeInternal();
dispatcher.stop();
  }
{code}

Currently, if the state store is being stopped on events such as switching to 
standby, it will first close the state store(in case of ZKRMStateStore, close 
connection with ZK) and then process the pending events. Instead, we should 
first process the pending events and then call close.


  was:
Refer to discussion on YARN-2136 
(https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
 

As pointed out by [~jianhe], we should process the dispatcher event queue 
before closing the state store by flipping over the following statements in 
code.

{code:title=RMStateStore.java|borderStyle=solid}
 protected void serviceStop() throws Exception {
closeInternal();
dispatcher.stop();
  }
{code}

Currently, if the state store is being closed on events such as switching to 
standby, it will first close the state store(in case of ZKRMStateStore, close 
connection with ZK) and then process the pending events. Instead, we should 
first process the pending events and then call close.



 RMStateStore should process the pending events before close
 ---

 Key: YARN-2432
 URL: https://issues.apache.org/jira/browse/YARN-2432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Saxena
Assignee: Varun Saxena

 Refer to discussion on YARN-2136 
 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
  
 As pointed out by [~jianhe], we should process the dispatcher event queue 
 before closing the state store by flipping over the following statements in 
 code.
 {code:title=RMStateStore.java|borderStyle=solid}
  protected void serviceStop() throws Exception {
 closeInternal();
 dispatcher.stop();
   }
 {code}
 Currently, if the state store is being stopped on events such as switching to 
 standby, it will first close the state store(in case of ZKRMStateStore, close 
 connection with ZK) and then process the pending events. Instead, we should 
 first process the pending events and then call close.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1801) NPE in public localizer

2014-08-20 Thread Beckham007 (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103649#comment-14103649
 ] 

Beckham007 commented on YARN-1801:
--

When something got wrong with hdfs, this error would happen.
This NPE make NM crash.So I think we should fix this in yarn.

2014-08-20 10:21:04,004 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Failed to download rsrc { { 
hdfs://...:54310/tmp/temp-793434835/tmp-707424512/CosAgent.jar, 1408501159584, 
FILE, null 
},pending,[(container_1407229860715_13071531_01_87)],18021755091999344,DOWNLOADING}
java.io.FileNotFoundException: File does not exist: 
hdfs://...:54310/tmp/temp-793434835/tmp-707424512/CosAgent.jar
2014-08-20 10:21:04,032 FATAL 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Error: Shutting down
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
2014-08-20 10:21:04,032 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Public cache exiting
2014-08-20 10:21:04,052 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
java.util.concurrent.RejectedExecutionException

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical
 Attachments: YARN-1801.patch


 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu reassigned YARN-1458:
---

Assignee: zhihai xu

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103678#comment-14103678
 ] 

zhihai xu commented on YARN-1458:
-

The patch didn't consider type conversion from double to integer in 
computeShare will lose precision. So break when zero will cause all  
Schedulable's FairShare to be zero if all Schedulable's Weight and MinShare are 
less than 1. In the unit test, the queues' Weight are 0.25 and 0.75, the 
queues' MinShare are Resources.none().
I will create a new patch.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1800) YARN NodeManager with java.util.concurrent.RejectedExecutionException

2014-08-20 Thread Beckham007 (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103700#comment-14103700
 ] 

Beckham007 commented on YARN-1800:
--

[~vinodkv] [~jlowe] [~vvasudev] I think we shouldn't catch this exception. As 
[~jlowe] mentioned,NM will be running in a damaged state where every public 
localization will fail the container.  Mostly, those container will failed. 
But the cpu/memory are free, other container would assigned to the NM. The new 
container would alse failed. This would decrease throughput of whole cluster. 
Maybe Let NM crashing would be a good choice.

 YARN NodeManager with java.util.concurrent.RejectedExecutionException
 -

 Key: YARN-1800
 URL: https://issues.apache.org/jira/browse/YARN-1800
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Paul Isaychuk
Assignee: Varun Vasudev
Priority: Critical
 Fix For: 2.4.0

 Attachments: apache-yarn-1800.0.patch, apache-yarn-1800.1.patch, 
 yarn-yarn-nodemanager-host-2.log.zip


 Noticed this on tests running on Apache Hadoop 2.2 cluster
 {code}
 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(196)) - Resource 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar
  transitioned from INIT to DOWNLOADING
 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(196)) - Resource 
 hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.splitmetainfo
  transitioned from INIT to DOWNLOADING
 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(196)) - Resource 
 hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.split 
 transitioned from INIT to DOWNLOADING
 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(196)) - Resource 
 hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.xml 
 transitioned from INIT to DOWNLOADING
 2014-01-23 01:30:28,576 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440627435, FILE, null }
 2014-01-23 01:30:28,576 FATAL event.AsyncDispatcher 
 (AsyncDispatcher.java:dispatch(141)) - Error in dispatcher thread
 java.util.concurrent.RejectedExecutionException
 at 
 java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
 at 
 java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
 at 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
 at 
 java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:678)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:583)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:525)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
 at java.lang.Thread.run(Thread.java:662)
 2014-01-23 01:30:28,577 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:dispatch(144)) - Exiting, bbye..
 2014-01-23 01:30:28,596 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
 SelectChannelConnector@0.0.0.0:50060
 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
 (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(328)) - 
 Applications still running : [application_1389742077466_0396]
 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
 (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(336)) - Wa
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2426) ResourceManger is not able renew WebHDFS token when application submitted by Yarn WebService

2014-08-20 Thread Karam Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karam Singh updated YARN-2426:
--

Description: 
Encountered this issue during using new YARN's RM WS for application 
submission, on single node cluster while submitting Distributed Shell 
application using RM WS(webservice).
For this we need  pass custom script and AppMaster jar along with webhdfs token.

Application was failing with ResouceManager was failing to renew token for user 
(appOwner). So RM was Rejecting application with following exception trace in 
RM log:
{code}
2014-08-19 03:12:54,733 WARN  security.DelegationTokenRenewer 
(DelegationTokenRenewer.java:handleDTRenewerAppSubmitEvent(661)) - Unable to 
add the application to the delegation token renewer.
java.io.IOException: Failed to renew token: Kind: WEBHDFS delegation, Service: 
NNHOST:FSPORT, Ident: (WEBHDFS delegation token  for hrt_qa)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:394)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$5(DelegationTokenRenewer.java:357)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:657)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:638)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Unexpected HTTP response: code=-1 != 200, 
op=RENEWDELEGATIONTOKEN, message=null
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:331)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:90)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:598)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:448)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:477)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:473)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.renewDelegationToken(WebHdfsFileSystem.java:1318)
at 
org.apache.hadoop.hdfs.web.TokenAspect$TokenManager.renew(TokenAspect.java:73)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:477)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:473)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:392)
... 6 more
Caused by: java.io.IOException: The error stream is null.
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.jsonParse(WebHdfsFileSystem.java:304)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:329)
... 24 more
2014-08-19 03:12:54,735 DEBUG event.AsyncDispatcher 
(AsyncDispatcher.java:dispatch(164)) - Dispatching the event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRejectedEvent.EventType:
 APP_REJECTED
{code}

From exception trace it is clear that RM is try contact to Namenode on FSPort 
instead of Http port and failing to renew token
 Looks like it is because WebHDFS token Namenodes IP and FSPort in delegation 
token instead of http. Causing RM to contact WebHDFS on FSPort and failing to 
renew token



  was:
Encountered this issue during using new YARN's RM WS for application 
submission, on single node cluster while submitting Distributed Shell 
application using RM WS(webservice).
For this we need  pass custom script and AppMaster jar along with webhdfs token 
to NodeManager for localization.

Distributed Shell Application was failing as 

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-08-20 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-796:


Attachment: YARN-796.node-label.demo.patch.1

Hi guys,
Thanks for your input in the past several weeks, I implemented a patch based 
the design doc: 
https://issues.apache.org/jira/secure/attachment/12662291/Node-labels-Requirements-Design-doc-V2.pdf
 during the past two weeks. Really appreciate if you can take a look. The patch 
is: YARN-796.node-label.demo.patch.1 (I made a longer name to not confuse with 
other patches).

*Already included in this patch:*
* Protocol changes for ResourceRequest, ApplicationSubmissionContext (leveraged 
contribution from Yuliya's patch, thanks). also updated AMRMClient
* RMAdmin changes to dynamically update labels of node (add/set/remove), also 
updated RMAdmin CLI
* Capacity scheduler related changes including: 
** headroom calculation, preemption, container allocation respect labels. 
** Allow user set list of labels of a queue can access in capacity-scheduler.xml
* A centralized node label manager can be updated dynamically to add/set/remove 
labels, and can store labels to file system. It will work with RM restart/HA 
scenario (Similar to RMStateStore).
* Support set {{--labels}} option in distributed shell, we can use distributed 
shell to test this feature
* Related unit tests

*Will include later:*
* RM REST APIs for node label
* Distributed configuration (set labels in yarn-site.xml of NMs)
* Support labels in FairScheduler

*Try this patch*
1. Create a capacity-scheduler.xml with labels accessible on queues
{code}
   root
   /  \
  ab
  ||
  a1   b1

a.capacity = 50, b.capacity = 50 
a1.capacity = 100, b1.capacity = 100

And a.label = red,blue; b.label = blue,green
property
nameyarn.scheduler.capacity.root.a.labels/name
valuered, blue/value
/property

property
nameyarn.scheduler.capacity.root.b.labels/name
valueblue, green/value
/property)
{code}
This means queue a (And its sub queues) CAN access label red and blue; queue b 
(And its sub queues) CAN access label blue and green

2. Create a node-labels.json locally, this is initial labels on nodes, (you can 
dynamically change it using rmadmin CLI while RM is running, you don't have to 
do it). And set {{yarn.resourcemanager.labels.node-to-label-json.path}} to 
{{file:///path/to/node-labels.xml}}
{code}
{
   host1:{
   labels:[red, blue]
   },
   host2:{
   labels:[blue, green]
   }
}
{code}
This sets red/blue labels on host1, and sets blue/green labels on host2

3. Start Yarn cluster (if you have several nodes in the cluster, you need 
launch HDFS to use distributed shell)
* Submit a distributed shell:
{code}
hadoop jar path/to/*distributedshell*.jar 
org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command 
hostname -jar path/to/*distributedshell*.jar -num_containers 10 -labels red  
blue -queue a1
{code}
This will run a distributed shell, launch 10 containers, and the command run is 
hostname, asked label is red  blue, all containers will be allocated on 
host1,

Some other examples:
* {{-queue a1 -labels red  green}}, this will be rejected, because queue a1 
cannot access label green
* {{-queue a1 -labels blue}}, some containers will be allocated on host1, and 
some others will be allocated to host2, because both of host1/host2 contain 
blue label
* {{-queue b1 -labels green}}, all containers will be allocated on host2

4. Dynamically update labels using rmadmin CLI
{code}
// dynamically add labels x, y to label manager
yarn rmadmin -addLabels x,y

// dynamically set label x on node1, and label y on node2
yarn rmadmin -setNodeToLabels node1:x;node2:x,y

// remove labels from label manager, and also remove labels on nodes
yarn rmadmin -removeLabels x
{code}

*Two more examples for node label*
1. Labels as constraints:
{code}
Queue structure:
root
   / | \
  a  b  c

a has label: WINDOWS, LINUX, GPU
b has label: WINDOWS, LINUX, LARGE_MEM
c doesn't have label

25 nodes in the cluster:
h1-h5:   LINUX, GPU
h6-h10:  LINUX,
h11-h15: LARGE_MEM, LINUX
h16-h20: LARGE_MEM, WINDOWS
h21-h25: empty
{code}
If you want LINUX  GPU resource, you should submit to queue-a, and set 
label in Resource Request to LINUX  GPU
If you want LARGE_MEM resource, and don't mind its OS, you can submit to 
queue-b, and set label in Resource Request to LARGE_MEM
If you want to allocate on nodes don't have labels (h21-h25), you can submit it 
to any queue, and leave label in Resource Request empty

2. Labels to hard partition cluster
{code}
Queue structure:
root
   / | \
  a  b  c

a has label: MARKETING
b has label: HR
c has label: RD

15 nodes in the cluster:
h1-h5:   MARKETING
h6-h10:  HR
h11-h15: RD
{code}
Now cluster is hard partitioned to 3 small clusters, h1-h5 for marketing, only 
queue-A can use it, you should set label in Resource Request to a. Similar to 
HR/RD cluster. 

I appreciate your 

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-08-20 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103885#comment-14103885
 ] 

Allen Wittenauer commented on YARN-796:
---

I might have missed it, but I don't see dynamic labels generated from an admin 
provided script or class on the NM listed above.  That's a must have feature to 
make this viable for any large installation.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, 
 Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, 
 YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1801) NPE in public localizer

2014-08-20 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1801:
-

 Target Version/s: 2.6.0  (was: 2.4.1)
Affects Version/s: 2.2.0

[~beckham007] what Hadoop version corresponds to the log messages above?  The 
logs imply it might be something close to Hadoop 2.2 since the NPE is on the 
same line number as originally reported.

The core problem with the original NPE is that assoc should never be null 
unless there's a code bug, and we closed a race condition that could cause that 
in YARN-1575.  It would be good to know if you're already running on a version 
that includes the fix from YARN-1575, and if not, can you reproduce  the 
problem after including that fix.

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical
 Attachments: YARN-1801.patch


 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before clo

2014-08-20 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2432:
---

Summary: RMStateStore should process the pending events before clo  (was: 
RMStateStore should process the pending events before close)

 RMStateStore should process the pending events before clo
 -

 Key: YARN-2432
 URL: https://issues.apache.org/jira/browse/YARN-2432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Saxena
Assignee: Varun Saxena

 Refer to discussion on YARN-2136 
 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
  
 As pointed out by [~jianhe], we should process the dispatcher event queue 
 before closing the state store by flipping over the following statements in 
 code.
 {code:title=RMStateStore.java|borderStyle=solid}
  protected void serviceStop() throws Exception {
 closeInternal();
 dispatcher.stop();
   }
 {code}
 Currently, if the state store is being stopped on events such as switching to 
 standby, it will first close the state store(in case of ZKRMStateStore, close 
 connection with ZK) and then process the pending events. Instead, we should 
 first process the pending events and then call close.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before close

2014-08-20 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2432:
---

Summary: RMStateStore should process the pending events before close  (was: 
RMStateStore should process the pending events before clo)

 RMStateStore should process the pending events before close
 ---

 Key: YARN-2432
 URL: https://issues.apache.org/jira/browse/YARN-2432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Saxena
Assignee: Varun Saxena

 Refer to discussion on YARN-2136 
 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
  
 As pointed out by [~jianhe], we should process the dispatcher event queue 
 before closing the state store by flipping over the following statements in 
 code.
 {code:title=RMStateStore.java|borderStyle=solid}
  protected void serviceStop() throws Exception {
 closeInternal();
 dispatcher.stop();
   }
 {code}
 Currently, if the state store is being stopped on events such as switching to 
 standby, it will first close the state store(in case of ZKRMStateStore, close 
 connection with ZK) and then process the pending events. Instead, we should 
 first process the pending events and then call close.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-314) Schedulers should allow resource requests of different sizes at the same priority and location

2014-08-20 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-314:
-

Assignee: Karthik Kambatla  (was: Sandy Ryza)

I would like to take a stab at this. 

 Schedulers should allow resource requests of different sizes at the same 
 priority and location
 --

 Key: YARN-314
 URL: https://issues.apache.org/jira/browse/YARN-314
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Karthik Kambatla
 Fix For: 2.6.0


 Currently, resource requests for the same container and locality are expected 
 to all be the same size.
 While it it doesn't look like it's needed for apps currently, and can be 
 circumvented by specifying different priorities if absolutely necessary, it 
 seems to me that the ability to request containers with different resource 
 requirements at the same priority level should be there for the future and 
 for completeness sake.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2409) Active to StandBy transition does not stop rmDispatcher that causes 1 AsyncDispatcher thread leak.

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103914#comment-14103914
 ] 

Hudson commented on YARN-2409:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #652 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/652/])
YARN-2409. RM ActiveToStandBy transition missing stoping previous rmDispatcher. 
Contributed by Rohith (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618915)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java


 Active to StandBy transition does not stop rmDispatcher that causes 1 
 AsyncDispatcher thread leak. 
 ---

 Key: YARN-2409
 URL: https://issues.apache.org/jira/browse/YARN-2409
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Assignee: Rohith
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2409.patch


 {code}
   at java.lang.Thread.run(Thread.java:662)
 2014-08-12 07:03:00,839 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 STATUS_UPDATE at LAUNCHED
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:662)
 2014-08-12 07:03:00,839 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 CONTAINER_ALLOCATED at LAUNCHED
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:662)
 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.ya
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103912#comment-14103912
 ] 

Hudson commented on YARN-2249:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #652 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/652/])
YARN-2249. Avoided AM release requests being lost on work preserving RM 
restart. Contributed by Jian He. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618972)
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 AM release request may be lost on RM restart
 

 Key: YARN-2249
 URL: https://issues.apache.org/jira/browse/YARN-2249
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.6.0

 Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, 
 YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch


 AM resync on RM restart will send outstanding container release requests back 
 to the new RM. In the meantime, NMs report the container statuses back to RM 
 to recover the containers. If RM receives the container release request  
 before the container is actually recovered in scheduler, the container won't 
 be released and the release request will be lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-08-20 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103919#comment-14103919
 ] 

Allen Wittenauer commented on YARN-796:
---

bq. set labels on yarn-site.xml in each NM, and NM will report such labels to RM

This breaks configuration management; changing the yarn-site.xml on a per-node 
basis means ops folks will lose the ability to use system tools to verify the 
file's integrity (e.g., rpm -V).  

bq. If it's not, could you please give me more details about what is dynamic 
labels generated from an admin on the NM in your thinking

As I've said before, I basically want something similar to the health check 
code: I provide something executable that the NM can run at runtime that will 
provide the list of labels. If we need to add labels, it's updating the script 
which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, 
 Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, 
 YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2345) yarn rmadmin -report

2014-08-20 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2345:
---

Assignee: Hao Gao

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie

 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2345) yarn rmadmin -report

2014-08-20 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2345:
---

Component/s: resourcemanager
 nodemanager

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie

 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2345) yarn rmadmin -report

2014-08-20 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104002#comment-14104002
 ] 

Allen Wittenauer commented on YARN-2345:


I've made [~haogao] a contributor and assigned this jira. 

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie

 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect

2014-08-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104067#comment-14104067
 ] 

Jason Lowe commented on YARN-2034:
--

+1, committing this.

 Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
 

 Key: YARN-2034
 URL: https://issues.apache.org/jira/browse/YARN-2034
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
  Labels: documentation
 Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch


 The description in yarn-default.xml for 
 yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per 
 local directory, but according to the code it's a setting for the entire node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104155#comment-14104155
 ] 

Owen O'Malley commented on YARN-2424:
-

This is a pretty clear case of trying to fix the breakage from YARN-1253. Yahoo 
ran clusters for a year with LCE before security was turned on and got 
significant value from that. The largest being that it prevents killall -9 
java type mistakes on the part of users. (Yes that did actually happen.)

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104179#comment-14104179
 ] 

Alejandro Abdelnur commented on YARN-2424:
--

I disagree on YARN-1253 being a breakage. 

Personally, I would never recommend using this in production. Given that, I'm 
OK with the patch if:

* the NM logs print a WARN at startup stating the setting.
* the container stdout/stderr also prints a WARN to alert the user of the 
setting.



 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104196#comment-14104196
 ] 

Owen O'Malley commented on YARN-2424:
-

Alejandro, after I told you that users have run in production with that 
setting, it is very rude to say that removing the feature is not breakage. It 
is *obviously* breakage.

A warning makes sense, but it should only be once when the ResourceManager 
boots. It is a system level configuration and warning more than once is wrong.

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2409) Active to StandBy transition does not stop rmDispatcher that causes 1 AsyncDispatcher thread leak.

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104224#comment-14104224
 ] 

Hudson commented on YARN-2409:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1843 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1843/])
YARN-2409. RM ActiveToStandBy transition missing stoping previous rmDispatcher. 
Contributed by Rohith (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618915)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java


 Active to StandBy transition does not stop rmDispatcher that causes 1 
 AsyncDispatcher thread leak. 
 ---

 Key: YARN-2409
 URL: https://issues.apache.org/jira/browse/YARN-2409
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Assignee: Rohith
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2409.patch


 {code}
   at java.lang.Thread.run(Thread.java:662)
 2014-08-12 07:03:00,839 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 STATUS_UPDATE at LAUNCHED
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:662)
 2014-08-12 07:03:00,839 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 CONTAINER_ALLOCATED at LAUNCHED
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:662)
 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.ya
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104222#comment-14104222
 ] 

Hudson commented on YARN-2249:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1843 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1843/])
YARN-2249. Avoided AM release requests being lost on work preserving RM 
restart. Contributed by Jian He. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618972)
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 AM release request may be lost on RM restart
 

 Key: YARN-2249
 URL: https://issues.apache.org/jira/browse/YARN-2249
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.6.0

 Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, 
 YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch


 AM resync on RM restart will send outstanding container release requests back 
 to the new RM. In the meantime, NMs report the container statuses back to RM 
 to recover the containers. If RM receives the container release request  
 before the container is actually recovered in scheduler, the container won't 
 be released and the release request will be lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104257#comment-14104257
 ] 

Alejandro Abdelnur edited comment on YARN-2424 at 8/20/14 6:10 PM:
---

I disagree on me being rude (or very rude) just for disagreeing with something. 
IMO security fixes trump backwards compatibility.

Anyway, I'm -0 with the patch if the WARNs are printed in in the RM at startup 
as Owen suggests. I insists that the WARN should be in the stderr/stdout of 
every container. Otherwise this will go completely unnoticed to users running 
apps. It should be obvious to them that they are exposed.



was (Author: tucu00):
I disagree in me being rude (or very rude) just for disagreeing with something. 
IMO security fixes trump backwards compatibility.

Anyway, I'm -0 with the patch if the WARNs are printed in in the RM at startup 
as Owen suggests. I insists that the WARN should be in the stderr/stdout of 
every container. Otherwise this will go completely unnoticed to users running 
apps. It should be obvious to them that they are exposed.


 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104257#comment-14104257
 ] 

Alejandro Abdelnur commented on YARN-2424:
--

I disagree in me being rude (or very rude) just for disagreeing with something. 
IMO security fixes trump backwards compatibility.

Anyway, I'm -0 with the patch if the WARNs are printed in in the RM at startup 
as Owen suggests. I insists that the WARN should be in the stderr/stdout of 
every container. Otherwise this will go completely unnoticed to users running 
apps. It should be obvious to them that they are exposed.


 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104303#comment-14104303
 ] 

Allen Wittenauer commented on YARN-2424:


bq. It should be obvious to them that they are exposed.

Then we should return a WARN whenever isSecurityEnabled returns false since 
that's the only way they are secure.

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104314#comment-14104314
 ] 

Alejandro Abdelnur commented on YARN-2424:
--

if you don't have to kinit it is obvious security is OFF, no?

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104322#comment-14104322
 ] 

Allen Wittenauer commented on YARN-2424:


Apparently not, given:

bq. Otherwise this will go completely unnoticed to users running apps.



 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104331#comment-14104331
 ] 

Hudson commented on YARN-2174:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6089 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6089/])
YARN-2174. Enable HTTPs for the writer REST API of TimelineServer. Contributed 
by Zhijie Shen (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619160)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java


 Enabling HTTPs for the writer REST API of TimelineServer
 

 Key: YARN-2174
 URL: https://issues.apache.org/jira/browse/YARN-2174
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2174.1.patch, YARN-2174.2.patch, YARN-2174.3.patch


 Since we'd like to allow the application to put the timeline data at the 
 client, the AM and even the containers, we need to provide the way to 
 distribute the keystore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104332#comment-14104332
 ] 

Hudson commented on YARN-2034:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6089 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6089/])
YARN-2034. Description for yarn.nodemanager.localizer.cache.target-size-mb is 
incorrect. Contributed by Chen He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619176)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
 

 Key: YARN-2034
 URL: https://issues.apache.org/jira/browse/YARN-2034
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
  Labels: documentation
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch


 The description in yarn-default.xml for 
 yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per 
 local directory, but according to the code it's a setting for the entire node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-20 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104391#comment-14104391
 ] 

Jonathan Eagles commented on YARN-2035:
---

[~zjshen], can you take a quick look at this? This has been a little bit of a 
pain for testing since it can't come up when the namenode is in safemode.

 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104426#comment-14104426
 ] 

Hudson commented on YARN-2174:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1869 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1869/])
YARN-2174. Enable HTTPs for the writer REST API of TimelineServer. Contributed 
by Zhijie Shen (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619160)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java


 Enabling HTTPs for the writer REST API of TimelineServer
 

 Key: YARN-2174
 URL: https://issues.apache.org/jira/browse/YARN-2174
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2174.1.patch, YARN-2174.2.patch, YARN-2174.3.patch


 Since we'd like to allow the application to put the timeline data at the 
 client, the AM and even the containers, we need to provide the way to 
 distribute the keystore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104427#comment-14104427
 ] 

Hudson commented on YARN-2034:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1869 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1869/])
YARN-2034. Description for yarn.nodemanager.localizer.cache.target-size-mb is 
incorrect. Contributed by Chen He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619176)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
 

 Key: YARN-2034
 URL: https://issues.apache.org/jira/browse/YARN-2034
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
  Labels: documentation
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch


 The description in yarn-default.xml for 
 yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per 
 local directory, but according to the code it's a setting for the entire node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE

2014-08-20 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104464#comment-14104464
 ] 

Tsuyoshi OZAWA commented on YARN-1919:
--

[~kkambatl], could you take a look, please?

 Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
 --

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before close

2014-08-20 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2432:
---

Attachment: YARN-2432.patch

 RMStateStore should process the pending events before close
 ---

 Key: YARN-2432
 URL: https://issues.apache.org/jira/browse/YARN-2432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-2432.patch


 Refer to discussion on YARN-2136 
 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
  
 As pointed out by [~jianhe], we should process the dispatcher event queue 
 before closing the state store by flipping over the following statements in 
 code.
 {code:title=RMStateStore.java|borderStyle=solid}
  protected void serviceStop() throws Exception {
 closeInternal();
 dispatcher.stop();
   }
 {code}
 Currently, if the state store is being stopped on events such as switching to 
 standby, it will first close the state store(in case of ZKRMStateStore, close 
 connection with ZK) and then process the pending events. Instead, we should 
 first process the pending events and then call close.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-20 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104587#comment-14104587
 ] 

Zhijie Shen commented on YARN-2035:
---

[~jeagles], is the problematic scenario that NN and TimelineServer (TS) start 
around the same time? Therefore, while NN still stays in the safe mode, TS is 
trying to create a directory on it, result in SafeModeException.

In the patch, checking whether dir exists seems to be necessary. Moreover, 
shall we do something similar to that we did for MR job history server? See 
HistoryFileManager#serviceInit.
{code}
long maxFSWaitTime = conf.getLong(
JHAdminConfig.MR_HISTORY_MAX_START_WAIT_TIME,
JHAdminConfig.DEFAULT_MR_HISTORY_MAX_START_WAIT_TIME);
createHistoryDirs(new SystemClock(), 10 * 1000, maxFSWaitTime);
{code}
createHistoryDirs is going to retry dir creation until using up waiting time.

 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-20 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104608#comment-14104608
 ] 

Jonathan Eagles commented on YARN-2035:
---

In my scenario, the dir already exists and so I don't want to crash trying to 
create an existing dir. The code you mentioned could be helpful for the first 
time startup but it's a slightly different scenario I care about. Let me know 
if you if we should handle that as part of this jira or separately.

 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2395) Fair Scheduler : implement fair share preemption at parent queue based on fairSharePreemptionTimeout

2014-08-20 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2395:
--

Attachment: YARN-2395-1.patch

Discussed with Karthik offline. We agree on the solution that each queue can 
specify its own fairSharePreemptionTimeout. If not specified, the queue will 
inherits the value from its parent queue.

Another issue here: I removed the old defaultFairSharePreemptionTimeout and 
added a new one rootFairSharePreemptionTimeout, which configures the timeout 
value for the root queue. I didn't use the name 
defaultFairSharePreemptionTimeout as it may confuse users that the queue will 
use this value if not configure, which is not true and the queue will take 
value from its parent queue.

 Fair Scheduler : implement fair share preemption at parent queue based on 
 fairSharePreemptionTimeout
 

 Key: YARN-2395
 URL: https://issues.apache.org/jira/browse/YARN-2395
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2395-1.patch


 Currently in fair scheduler, the preemption logic considers fair share 
 starvation only at leaf queue level. This jira is created to implement it at 
 the parent queue as well.
 It involves :
 1. Making check for fair share starvation and amount of resource to 
 preempt  recursive such that they traverse the queue hierarchy from root to 
 leaf.
 2. Currently fairSharePreemptionTimeout is a global config. We could make it 
 configurable on a per queue basis,so that we can specify different timeouts 
 for parent queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-20 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104661#comment-14104661
 ] 

Tsuyoshi OZAWA commented on YARN-2035:
--

Hi [~jeagles], how about adding tests as follows to cover the scenario by 
adding a helper method like {{initRootPath(fs, path)}} to make FileSystem 
object injectable? 

{code}
  @Test
  public void testInitExistingWorkingDirectoryInSafeMode() throws IOException {
LOG.info(Starting testInitWorkingDirectoryInSafeMode);
store.stop();
doThrow(new IOException(emulating safe mode exception)).when(fs)
.mkdirs(any(Path.class));

FileSystemApplicationHistoryStore store
= new FileSystemApplicationHistoryStore();
try {
  store.initRootPath(fs, fsWorkingPath);
} catch (Exception e) {
  Assert.fail(Exception should not be thrown:  + e);
}
  }

  @Test
  public void testInitNonExistingWorkingDirectoryInSafeMode()
  throws IOException {
LOG.info(Starting testInitNonExistingWorkingDirectoryInSafeMode);
store.stop();
fs.delete(fsWorkingPath, true);
doThrow(new IOException(emulating safe mode exception)).when(fs)
.mkdirs(any(Path.class));

FileSystemApplicationHistoryStore store
= new FileSystemApplicationHistoryStore();
try {
  store.initRootPath(fs, fsWorkingPath);
  Assert.fail(Exception should be thrown);
} catch (Exception e) {
  // expected behavior.
}
  }
{code}


 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop

2014-08-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104681#comment-14104681
 ] 

Karthik Kambatla commented on YARN-1919:


+1. Committing this. 

 Potential NPE in EmbeddedElectorService#stop
 

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1919) Potential NPE in EmbeddedElectorService#stop

2014-08-20 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1919:
---

Summary: Potential NPE in EmbeddedElectorService#stop  (was: Log 
yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE)

 Potential NPE in EmbeddedElectorService#stop
 

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop

2014-08-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104695#comment-14104695
 ] 

Karthik Kambatla commented on YARN-1919:


Thanks [~ozawa] for this fix. Just committed this to trunk and branch-2. 

 Potential NPE in EmbeddedElectorService#stop
 

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop

2014-08-20 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104696#comment-14104696
 ] 

Tsuyoshi OZAWA commented on YARN-1919:
--

Thanks Jian and Karthik for your review.

 Potential NPE in EmbeddedElectorService#stop
 

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-08-20 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-415:


Attachment: YARN-415.201408181938.txt

Reattaching latest patch in order to trigger Hadoopqa.

[~jianhe], thank you for all of your help and input. This patch will charge 
container usage to the current attempt, whether the container is running or 
completed. Will you please take a look at it again?

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:


Attachment: YARN-1458.001.patch

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104744#comment-14104744
 ] 

zhihai xu commented on YARN-1458:
-

I uploaded a new patch YARN-1458.001.patch, which will avoid losing precision 
for type conversion from double to integer.
[~sandyr], Could you review it? thanks

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue

2014-08-20 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2394:
--

Attachment: YARN-2394-2.patch

Update a patch following the same way as YARN-2395. Each queue inherits 
fairSharePreemptionThreshold from its parent queue if it doesn't configure in 
the allocation file. Will rebase the patch once YARN-2395 is in.

 Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
 

 Key: YARN-2394
 URL: https://issues.apache.org/jira/browse/YARN-2394
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2394-1.patch, YARN-2394-2.patch


 Preemption based on fair share starvation happens when usage of a queue is 
 less than 50% of its fair share. This 50% is hardcoded. We'd like to make 
 this configurable on a per queue basis, so that we can choose the threshold 
 at which we want to preempt. Calling this config 
 fairSharePreemptionThreshold. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2179) Initial cache manager structure and context

2014-08-20 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2179:
---

Attachment: YARN-2179-trunk-v4.patch

Rebase again for shell changes.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2433) Stale token used by restarted AM (with previous containers retained) to request new container

2014-08-20 Thread Yingda Chen (JIRA)
Yingda Chen created YARN-2433:
-

 Summary: Stale token used by restarted AM (with previous 
containers retained) to request new container
 Key: YARN-2433
 URL: https://issues.apache.org/jira/browse/YARN-2433
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1, 2.4.0
Reporter: Yingda Chen


With Hadoop 2.4, container retention is supported across AM crash-and-restart. 
However, after an AM is restarted with containers retained, it appears to be 
using the stale token to start new container. This leads to the error below. To 
truly support container retention, AM should be able to communicate with 
previous container(s) with the old token and ask for new container with new 
token. 

This could be similar to YARN-1321 which was reported and fixed earlier.

ERROR: 
Unauthorized request to start container. \nNMToken for application attempt : 
appattempt_1408130608672_0065_01 was used for starting container with 
container token issued for application attempt : 
appattempt_1408130608672_0065_02

STACK trace:

hadoop.ipc.ProtobufRpcEngine$Invoker.invoke 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103: 
Response - YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454: 
startContainers {services_meta_data { key: mapreduce_shuffle value: 
\000\0004\372 } failed_requests { container_id { app_attempt_id { 
application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2 
} exception { message: Unauthorized request to start container. \nNMToken for 
application attempt : appattempt_1408130608672_0065_01 was used for 
starting container with container token issued for application attempt : 
appattempt_1408130608672_0065_02 trace: 
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start 
container. \nNMToken for application attempt : 
appattempt_1408130608672_0065_01 was used for starting container with 
container token issued for application attempt : 
appattempt_1408130608672_0065_02\r\n\tat 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:48)\r\n\tat 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeStartRequest(ContainerManagerImpl.java:508)\r\n\tat
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainerInternal(ContainerManagerImpl.java:571)\r\n\tat
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:538)\r\n\tat
 
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)\r\n\tat
 
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)\r\n\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\r\n\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)\r\n\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)\r\n\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)\r\n\tat 
java.security.AccessController.doPrivileged(Native Method)\r\n\tat 
javax.security.auth.Subject.doAs(Subject.java:415)\r\n\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)\r\n\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)\r\n class_name: 
org.apache.hadoop.yarn.exceptions.YarnException } }}








--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2433) Stale token used by restarted AM (with previous containers retained) to request new container

2014-08-20 Thread Yingda Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingda Chen updated YARN-2433:
--

Description: 
With Hadoop 2.4, container retention is supported across AM crash-and-restart. 
However, after an AM is restarted with containers retained, it appears to be 
using the stale token to start new container. This leads to the error below. To 
truly support container retention, AM should be able to communicate with 
previous container(s) with the old token and ask for new container with new 
token. 

This could be similar to YARN-1321 which was reported and fixed earlier.

ERROR: 
Unauthorized request to start container. \nNMToken for application attempt : 
appattempt_1408130608672_0065_01 was used for starting container with 
container token issued for application attempt : 
appattempt_1408130608672_0065_02

STACK trace:
{code}
hadoop.ipc.ProtobufRpcEngine$Invoker.invoke 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103: 
Response - YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454: 
startContainers {services_meta_data { key: mapreduce_shuffle value: 
\000\0004\372 } failed_requests { container_id { app_attempt_id { 
application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2 
} exception { message: Unauthorized request to start container. \nNMToken for 
application attempt : appattempt_1408130608672_0065_01 was used for 
starting container with container token issued for application attempt : 
appattempt_1408130608672_0065_02 trace: 
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start 
container. \nNMToken for application attempt : 
appattempt_1408130608672_0065_01 was used for starting container with 
container token issued for application attempt : 
appattempt_1408130608672_0065_02\r\n\tat 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:48)\r\n\tat 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeStartRequest(ContainerManagerImpl.java:508)\r\n\tat
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainerInternal(ContainerManagerImpl.java:571)\r\n\tat
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:538)\r\n\tat
 
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)\r\n\tat
 
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)\r\n\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\r\n\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)\r\n\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)\r\n\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)\r\n\tat 
java.security.AccessController.doPrivileged(Native Method)\r\n\tat 
javax.security.auth.Subject.doAs(Subject.java:415)\r\n\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)\r\n\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)\r\n class_name: 
org.apache.hadoop.yarn.exceptions.YarnException } }}
{code}






  was:
With Hadoop 2.4, container retention is supported across AM crash-and-restart. 
However, after an AM is restarted with containers retained, it appears to be 
using the stale token to start new container. This leads to the error below. To 
truly support container retention, AM should be able to communicate with 
previous container(s) with the old token and ask for new container with new 
token. 

This could be similar to YARN-1321 which was reported and fixed earlier.

ERROR: 
Unauthorized request to start container. \nNMToken for application attempt : 
appattempt_1408130608672_0065_01 was used for starting container with 
container token issued for application attempt : 
appattempt_1408130608672_0065_02

STACK trace:

hadoop.ipc.ProtobufRpcEngine$Invoker.invoke 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103: 
Response - YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454: 
startContainers {services_meta_data { key: mapreduce_shuffle value: 
\000\0004\372 } failed_requests { container_id { app_attempt_id { 
application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2 
} exception { message: Unauthorized request to start container. \nNMToken for 
application attempt : appattempt_1408130608672_0065_01 was used for 
starting container with container token issued for application attempt : 
appattempt_1408130608672_0065_02 trace: 
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start 
container. \nNMToken for application attempt : 

[jira] [Updated] (YARN-2189) Admin service for cache manager

2014-08-20 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2189:
---

Attachment: YARN-2189-trunk-v3.patch

Rebase again for shell changes.

 Admin service for cache manager
 ---

 Key: YARN-2189
 URL: https://issues.apache.org/jira/browse/YARN-2189
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2189-trunk-v1.patch, YARN-2189-trunk-v2.patch, 
 YARN-2189-trunk-v3.patch


 Implement the admin service for the shared cache manager. This service is 
 responsible for handling administrative commands such as manually running a 
 cleaner task.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-08-20 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-1492:
---

Attachment: YARN-1492-all-trunk-v3.patch

Rebase again.

 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Chris Trezzo
 Attachments: YARN-1492-all-trunk-v1.patch, 
 YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
 shared_cache_design.pdf, shared_cache_design_v2.pdf, 
 shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
 shared_cache_design_v5.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic

2014-08-20 Thread Carlo Curino (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-1707:
---

Attachment: YARN-1707.2.patch

 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.2.patch, YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-20 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104872#comment-14104872
 ] 

Zhijie Shen commented on YARN-2035:
---

bq. In my scenario, the dir already exists and so I don't want to crash trying 
to create an existing dir.

Hm... If so, maybe we can separate the issues, as we will migrate to timeline 
store soon (YARN-2033).

 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-08-20 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104873#comment-14104873
 ] 

Carlo Curino commented on YARN-1707:


This patch is a more minimal set of changes rebased on trunk after we 
committed to trunk YARN-2378, YARN-2389.  
We also simplified and added more tests. The dynamic behavior is for PlanQueue 
and ReservationQueue.

 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.2.patch, YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104893#comment-14104893
 ] 

Ravi Prakash commented on YARN-2424:


I reviewed the code and the changes make sense to me. I'm a +1 on the patch as 
is.

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2056) Disable preemption at Queue level

2014-08-20 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2056:
-

Attachment: YARN-2056.201408202039.txt

This patch keeps the 
{{yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity}} 
property as a global parameter, and then adds a per-queue property in this 
format: 
{{yarn.resourcemanager.monitor.capacity.preemption.queue-path.max_ignored_over_capacity}}

The preemption code makes two sets of passes through the queues. The first time 
through, it calculates the ideal resource allocation per queue based on 
normalized guaranteed capacity, and the second time through, it selects which 
queue's resources to preempt, taking into consideration the 
{{max_ignored_over_capacity))

In this patch, the per-queue {{...max_ignored_over_capacity}} is taken into 
consideration in the first pass to help determine which queues have resources 
available for preempting. This is necessary because without it, queues that 
could fulfill the need would otherwise be removed from the list of available 
resources. Then, for the second pass, the global 
{{...max_ignored_over_capacity}} setting is used, as before, to determine which 
resources out of the remaining available resources to use.

This patch still requires an RM restart if the queue properties have changed.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1801) NPE in public localizer

2014-08-20 Thread Beckham007 (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104921#comment-14104921
 ] 

Beckham007 commented on YARN-1801:
--

[~jlowe] we use hadoop 2.2.0. We got both problems in  YARN-1575 and YARN-1801.
When the hdfs has some problems , the NPE in YARN-1801 happens. Otherwise, the 
problems is including in YARN-1575.
We will build a version that includes the fix from YARN-1575. 
Even if the assoc is null, should we close the threadpool?

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical
 Attachments: YARN-1801.patch


 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-08-20 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104926#comment-14104926
 ] 

Wangda Tan commented on YARN-796:
-

bq. As I've said before, I basically want something similar to the health check 
code: I provide something executable that the NM can run at runtime that will 
provide the list of labels. If we need to add labels, it's updating the script 
which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere.
I understand now, it's meaningful since it's a flexible way for admin to set 
labels in NM side. Maybe add a {{NodeLabelCheckerService}} to NM similar to 
{{NodeHealthCheckerService}} should work. I'll create a separated JIRA for 
setting labels in NM side under this ticket and leave design/implementation 
discussion here.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, 
 Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, 
 YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2056) Disable preemption at Queue level

2014-08-20 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104931#comment-14104931
 ] 

Wangda Tan commented on YARN-2056:
--

May a another way to do this is add a config in per queue, like 
{{..queue-path.disable_preemption}}. And in 
{{ProportionalCapacityPreemptionPolicy#cloneQueues}}, if a queue's used 
capacity more than guaranteed resource, and it set disable preemption. We will 
not create a TempQueue for it.
We will not require RM restart if queue property changed (queue property will 
be refreshed and PreemptionPolicy will get such changes.

Does it make sense?

Thanks,
Wangda

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2432) RMStateStore should process the pending events before close

2014-08-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104963#comment-14104963
 ] 

Hadoop QA commented on YARN-2432:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663208/YARN-2432.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 3 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4676//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4676//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4676//console

This message is automatically generated.

 RMStateStore should process the pending events before close
 ---

 Key: YARN-2432
 URL: https://issues.apache.org/jira/browse/YARN-2432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-2432.patch


 Refer to discussion on YARN-2136 
 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
  
 As pointed out by [~jianhe], we should process the dispatcher event queue 
 before closing the state store by flipping over the following statements in 
 code.
 {code:title=RMStateStore.java|borderStyle=solid}
  protected void serviceStop() throws Exception {
 closeInternal();
 dispatcher.stop();
   }
 {code}
 Currently, if the state store is being stopped on events such as switching to 
 standby, it will first close the state store(in case of ZKRMStateStore, close 
 connection with ZK) and then process the pending events. Instead, we should 
 first process the pending events and then call close.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2434) RM should not recover containers from previously failed attempt

2014-08-20 Thread Jian He (JIRA)
Jian He created YARN-2434:
-

 Summary: RM should not recover containers from previously failed 
attempt
 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He


If container-preserving AM restart is not enabled and AM failed during RM 
restart, RM on restart should not recover containers from previously failed 
attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2434) RM should not recover containers from previously failed attempt

2014-08-20 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2434:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-556

 RM should not recover containers from previously failed attempt
 ---

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He

 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2434) RM should not recover containers from previously failed attempt

2014-08-20 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2434:
--

Attachment: YARN-2434.1.patch

 RM should not recover containers from previously failed attempt
 ---

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt

2014-08-20 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104974#comment-14104974
 ] 

Jian He commented on YARN-2434:
---

Patch to not recover containers from previously  failed attempt if 
container-preserving AM restart is not enabled.

 RM should not recover containers from previously failed attempt
 ---

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-08-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104976#comment-14104976
 ] 

Karthik Kambatla commented on YARN-415:
---

A quick comment before we commit this. 

IIUC, we are tracking the *allocation* and not *utilization*. Actual 
utilization could be smaller than the amount of resources allocated (or asked 
for). Can we update the title and the corresponding class/field names 
accordingly? Also, the values are accumulated for the duration of the app. Can 
we add *aggregate* in the required class/field names? 

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104985#comment-14104985
 ] 

Hadoop QA commented on YARN-2035:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644022/YARN-2035.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 3 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4677//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4677//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4677//console

This message is automatically generated.

 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated YARN-2424:


Attachment: Y2424-1.patch

Added a version with a log statement that warns on startup. [~tucu00], is this 
sufficient? The config docs are pretty clear about the effect of setting the 
parameter, and this should be noticed if someone is experimenting with LCE.

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: Y2424-1.patch, YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-20 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104993#comment-14104993
 ] 

Alejandro Abdelnur commented on YARN-2424:
--

sure, fine, enough cycles spent on this, thx.

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: Y2424-1.patch, YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2432) RMStateStore should process the pending events before close

2014-08-20 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105015#comment-14105015
 ] 

Varun Saxena commented on YARN-2432:


1. No new tests are needed. Just flipped over the sequence of statements.
2. Release Audit warnings are unrelated to the code changed. Its showing 
problems in some HDFS file.
3. Core test failure is unrelated to code change as well.

Will cancel and submit patch again

 RMStateStore should process the pending events before close
 ---

 Key: YARN-2432
 URL: https://issues.apache.org/jira/browse/YARN-2432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-2432.patch


 Refer to discussion on YARN-2136 
 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266).
  
 As pointed out by [~jianhe], we should process the dispatcher event queue 
 before closing the state store by flipping over the following statements in 
 code.
 {code:title=RMStateStore.java|borderStyle=solid}
  protected void serviceStop() throws Exception {
 closeInternal();
 dispatcher.stop();
   }
 {code}
 Currently, if the state store is being stopped on events such as switching to 
 standby, it will first close the state store(in case of ZKRMStateStore, close 
 connection with ZK) and then process the pending events. Instead, we should 
 first process the pending events and then call close.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop

2014-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105060#comment-14105060
 ] 

Hudson commented on YARN-1919:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6091 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6091/])
YARN-1919. Potential NPE in EmbeddedElectorService#stop. (Tsuyoshi Ozawa via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619251)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java


 Potential NPE in EmbeddedElectorService#stop
 

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)