[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103546#comment-14103546 ] Hadoop QA commented on YARN-1492: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662918/YARN-1492-all-trunk-v2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4674//console This message is automatically generated. truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2432) RMStateStore should process the pending events before close
Varun Saxena created YARN-2432: -- Summary: RMStateStore should process the pending events before close Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being closed on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2136) RMStateStore can explicitly handle store/update events when fenced
[ https://issues.apache.org/jira/browse/YARN-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103618#comment-14103618 ] Varun Saxena commented on YARN-2136: Hi [~jianhe], for flipping over these statements, I will raise a separate JIRA RMStateStore can explicitly handle store/update events when fenced -- Key: YARN-2136 URL: https://issues.apache.org/jira/browse/YARN-2136 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena RMStateStore can choose to handle/ignore store/update events upfront instead of invoking more ZK operations if state store is at fenced state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before close
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2432: --- Description: Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. was: Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being closed on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. RMStateStore should process the pending events before close --- Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1801) NPE in public localizer
[ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103649#comment-14103649 ] Beckham007 commented on YARN-1801: -- When something got wrong with hdfs, this error would happen. This NPE make NM crash.So I think we should fix this in yarn. 2014-08-20 10:21:04,004 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://...:54310/tmp/temp-793434835/tmp-707424512/CosAgent.jar, 1408501159584, FILE, null },pending,[(container_1407229860715_13071531_01_87)],18021755091999344,DOWNLOADING} java.io.FileNotFoundException: File does not exist: hdfs://...:54310/tmp/temp-793434835/tmp-707424512/CosAgent.jar 2014-08-20 10:21:04,032 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2014-08-20 10:21:04,032 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting 2014-08-20 10:21:04,052 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.util.concurrent.RejectedExecutionException NPE in public localizer --- Key: YARN-1801 URL: https://issues.apache.org/jira/browse/YARN-1801 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Assignee: Hong Zhiguo Priority: Critical Attachments: YARN-1801.patch While investigating YARN-1800 found this in the NM logs that caused the public localizer to shutdown: {noformat} 2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, 1390440382009, FILE, null } 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(726)) - Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(728)) - Public cache exiting {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-1458: --- Assignee: zhihai xu In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103678#comment-14103678 ] zhihai xu commented on YARN-1458: - The patch didn't consider type conversion from double to integer in computeShare will lose precision. So break when zero will cause all Schedulable's FairShare to be zero if all Schedulable's Weight and MinShare are less than 1. In the unit test, the queues' Weight are 0.25 and 0.75, the queues' MinShare are Resources.none(). I will create a new patch. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1800) YARN NodeManager with java.util.concurrent.RejectedExecutionException
[ https://issues.apache.org/jira/browse/YARN-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103700#comment-14103700 ] Beckham007 commented on YARN-1800: -- [~vinodkv] [~jlowe] [~vvasudev] I think we shouldn't catch this exception. As [~jlowe] mentioned,NM will be running in a damaged state where every public localization will fail the container. Mostly, those container will failed. But the cpu/memory are free, other container would assigned to the NM. The new container would alse failed. This would decrease throughput of whole cluster. Maybe Let NM crashing would be a good choice. YARN NodeManager with java.util.concurrent.RejectedExecutionException - Key: YARN-1800 URL: https://issues.apache.org/jira/browse/YARN-1800 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Paul Isaychuk Assignee: Varun Vasudev Priority: Critical Fix For: 2.4.0 Attachments: apache-yarn-1800.0.patch, apache-yarn-1800.1.patch, yarn-yarn-nodemanager-host-2.log.zip Noticed this on tests running on Apache Hadoop 2.2 cluster {code} 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource (LocalizedResource.java:handle(196)) - Resource hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar transitioned from INIT to DOWNLOADING 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource (LocalizedResource.java:handle(196)) - Resource hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.splitmetainfo transitioned from INIT to DOWNLOADING 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource (LocalizedResource.java:handle(196)) - Resource hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.split transitioned from INIT to DOWNLOADING 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource (LocalizedResource.java:handle(196)) - Resource hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.xml transitioned from INIT to DOWNLOADING 2014-01-23 01:30:28,576 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, 1390440627435, FILE, null } 2014-01-23 01:30:28,576 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(141)) - Error in dispatcher thread java.util.concurrent.RejectedExecutionException at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:678) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:583) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:525) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81) at java.lang.Thread.run(Thread.java:662) 2014-01-23 01:30:28,577 INFO event.AsyncDispatcher (AsyncDispatcher.java:dispatch(144)) - Exiting, bbye.. 2014-01-23 01:30:28,596 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped SelectChannelConnector@0.0.0.0:50060 2014-01-23 01:30:28,597 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(328)) - Applications still running : [application_1389742077466_0396] 2014-01-23 01:30:28,597 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(336)) - Wa {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2426) ResourceManger is not able renew WebHDFS token when application submitted by Yarn WebService
[ https://issues.apache.org/jira/browse/YARN-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karam Singh updated YARN-2426: -- Description: Encountered this issue during using new YARN's RM WS for application submission, on single node cluster while submitting Distributed Shell application using RM WS(webservice). For this we need pass custom script and AppMaster jar along with webhdfs token. Application was failing with ResouceManager was failing to renew token for user (appOwner). So RM was Rejecting application with following exception trace in RM log: {code} 2014-08-19 03:12:54,733 WARN security.DelegationTokenRenewer (DelegationTokenRenewer.java:handleDTRenewerAppSubmitEvent(661)) - Unable to add the application to the delegation token renewer. java.io.IOException: Failed to renew token: Kind: WEBHDFS delegation, Service: NNHOST:FSPORT, Ident: (WEBHDFS delegation token for hrt_qa) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:394) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$5(DelegationTokenRenewer.java:357) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:657) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:638) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unexpected HTTP response: code=-1 != 200, op=RENEWDELEGATIONTOKEN, message=null at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:331) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:90) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:598) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:448) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:477) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:473) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.renewDelegationToken(WebHdfsFileSystem.java:1318) at org.apache.hadoop.hdfs.web.TokenAspect$TokenManager.renew(TokenAspect.java:73) at org.apache.hadoop.security.token.Token.renew(Token.java:377) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:477) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:473) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:392) ... 6 more Caused by: java.io.IOException: The error stream is null. at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.jsonParse(WebHdfsFileSystem.java:304) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:329) ... 24 more 2014-08-19 03:12:54,735 DEBUG event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRejectedEvent.EventType: APP_REJECTED {code} From exception trace it is clear that RM is try contact to Namenode on FSPort instead of Http port and failing to renew token Looks like it is because WebHDFS token Namenodes IP and FSPort in delegation token instead of http. Causing RM to contact WebHDFS on FSPort and failing to renew token was: Encountered this issue during using new YARN's RM WS for application submission, on single node cluster while submitting Distributed Shell application using RM WS(webservice). For this we need pass custom script and AppMaster jar along with webhdfs token to NodeManager for localization. Distributed Shell Application was failing as
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.demo.patch.1 Hi guys, Thanks for your input in the past several weeks, I implemented a patch based the design doc: https://issues.apache.org/jira/secure/attachment/12662291/Node-labels-Requirements-Design-doc-V2.pdf during the past two weeks. Really appreciate if you can take a look. The patch is: YARN-796.node-label.demo.patch.1 (I made a longer name to not confuse with other patches). *Already included in this patch:* * Protocol changes for ResourceRequest, ApplicationSubmissionContext (leveraged contribution from Yuliya's patch, thanks). also updated AMRMClient * RMAdmin changes to dynamically update labels of node (add/set/remove), also updated RMAdmin CLI * Capacity scheduler related changes including: ** headroom calculation, preemption, container allocation respect labels. ** Allow user set list of labels of a queue can access in capacity-scheduler.xml * A centralized node label manager can be updated dynamically to add/set/remove labels, and can store labels to file system. It will work with RM restart/HA scenario (Similar to RMStateStore). * Support set {{--labels}} option in distributed shell, we can use distributed shell to test this feature * Related unit tests *Will include later:* * RM REST APIs for node label * Distributed configuration (set labels in yarn-site.xml of NMs) * Support labels in FairScheduler *Try this patch* 1. Create a capacity-scheduler.xml with labels accessible on queues {code} root / \ ab || a1 b1 a.capacity = 50, b.capacity = 50 a1.capacity = 100, b1.capacity = 100 And a.label = red,blue; b.label = blue,green property nameyarn.scheduler.capacity.root.a.labels/name valuered, blue/value /property property nameyarn.scheduler.capacity.root.b.labels/name valueblue, green/value /property) {code} This means queue a (And its sub queues) CAN access label red and blue; queue b (And its sub queues) CAN access label blue and green 2. Create a node-labels.json locally, this is initial labels on nodes, (you can dynamically change it using rmadmin CLI while RM is running, you don't have to do it). And set {{yarn.resourcemanager.labels.node-to-label-json.path}} to {{file:///path/to/node-labels.xml}} {code} { host1:{ labels:[red, blue] }, host2:{ labels:[blue, green] } } {code} This sets red/blue labels on host1, and sets blue/green labels on host2 3. Start Yarn cluster (if you have several nodes in the cluster, you need launch HDFS to use distributed shell) * Submit a distributed shell: {code} hadoop jar path/to/*distributedshell*.jar org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command hostname -jar path/to/*distributedshell*.jar -num_containers 10 -labels red blue -queue a1 {code} This will run a distributed shell, launch 10 containers, and the command run is hostname, asked label is red blue, all containers will be allocated on host1, Some other examples: * {{-queue a1 -labels red green}}, this will be rejected, because queue a1 cannot access label green * {{-queue a1 -labels blue}}, some containers will be allocated on host1, and some others will be allocated to host2, because both of host1/host2 contain blue label * {{-queue b1 -labels green}}, all containers will be allocated on host2 4. Dynamically update labels using rmadmin CLI {code} // dynamically add labels x, y to label manager yarn rmadmin -addLabels x,y // dynamically set label x on node1, and label y on node2 yarn rmadmin -setNodeToLabels node1:x;node2:x,y // remove labels from label manager, and also remove labels on nodes yarn rmadmin -removeLabels x {code} *Two more examples for node label* 1. Labels as constraints: {code} Queue structure: root / | \ a b c a has label: WINDOWS, LINUX, GPU b has label: WINDOWS, LINUX, LARGE_MEM c doesn't have label 25 nodes in the cluster: h1-h5: LINUX, GPU h6-h10: LINUX, h11-h15: LARGE_MEM, LINUX h16-h20: LARGE_MEM, WINDOWS h21-h25: empty {code} If you want LINUX GPU resource, you should submit to queue-a, and set label in Resource Request to LINUX GPU If you want LARGE_MEM resource, and don't mind its OS, you can submit to queue-b, and set label in Resource Request to LARGE_MEM If you want to allocate on nodes don't have labels (h21-h25), you can submit it to any queue, and leave label in Resource Request empty 2. Labels to hard partition cluster {code} Queue structure: root / | \ a b c a has label: MARKETING b has label: HR c has label: RD 15 nodes in the cluster: h1-h5: MARKETING h6-h10: HR h11-h15: RD {code} Now cluster is hard partitioned to 3 small clusters, h1-h5 for marketing, only queue-A can use it, you should set label in Resource Request to a. Similar to HR/RD cluster. I appreciate your
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103885#comment-14103885 ] Allen Wittenauer commented on YARN-796: --- I might have missed it, but I don't see dynamic labels generated from an admin provided script or class on the NM listed above. That's a must have feature to make this viable for any large installation. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1801) NPE in public localizer
[ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1801: - Target Version/s: 2.6.0 (was: 2.4.1) Affects Version/s: 2.2.0 [~beckham007] what Hadoop version corresponds to the log messages above? The logs imply it might be something close to Hadoop 2.2 since the NPE is on the same line number as originally reported. The core problem with the original NPE is that assoc should never be null unless there's a code bug, and we closed a race condition that could cause that in YARN-1575. It would be good to know if you're already running on a version that includes the fix from YARN-1575, and if not, can you reproduce the problem after including that fix. NPE in public localizer --- Key: YARN-1801 URL: https://issues.apache.org/jira/browse/YARN-1801 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Hong Zhiguo Priority: Critical Attachments: YARN-1801.patch While investigating YARN-1800 found this in the NM logs that caused the public localizer to shutdown: {noformat} 2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, 1390440382009, FILE, null } 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(726)) - Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(728)) - Public cache exiting {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before clo
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2432: --- Summary: RMStateStore should process the pending events before clo (was: RMStateStore should process the pending events before close) RMStateStore should process the pending events before clo - Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before close
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2432: --- Summary: RMStateStore should process the pending events before close (was: RMStateStore should process the pending events before clo) RMStateStore should process the pending events before close --- Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-314) Schedulers should allow resource requests of different sizes at the same priority and location
[ https://issues.apache.org/jira/browse/YARN-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-314: - Assignee: Karthik Kambatla (was: Sandy Ryza) I would like to take a stab at this. Schedulers should allow resource requests of different sizes at the same priority and location -- Key: YARN-314 URL: https://issues.apache.org/jira/browse/YARN-314 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Karthik Kambatla Fix For: 2.6.0 Currently, resource requests for the same container and locality are expected to all be the same size. While it it doesn't look like it's needed for apps currently, and can be circumvented by specifying different priorities if absolutely necessary, it seems to me that the ability to request containers with different resource requirements at the same priority level should be there for the future and for completeness sake. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2409) Active to StandBy transition does not stop rmDispatcher that causes 1 AsyncDispatcher thread leak.
[ https://issues.apache.org/jira/browse/YARN-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103914#comment-14103914 ] Hudson commented on YARN-2409: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #652 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/652/]) YARN-2409. RM ActiveToStandBy transition missing stoping previous rmDispatcher. Contributed by Rohith (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618915) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java Active to StandBy transition does not stop rmDispatcher that causes 1 AsyncDispatcher thread leak. --- Key: YARN-2409 URL: https://issues.apache.org/jira/browse/YARN-2409 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Fix For: 2.6.0 Attachments: YARN-2409.patch {code} at java.lang.Thread.run(Thread.java:662) 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.ya {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103912#comment-14103912 ] Hudson commented on YARN-2249: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #652 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/652/]) YARN-2249. Avoided AM release requests being lost on work preserving RM restart. Contributed by Jian He. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618972) * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java AM release request may be lost on RM restart Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103919#comment-14103919 ] Allen Wittenauer commented on YARN-796: --- bq. set labels on yarn-site.xml in each NM, and NM will report such labels to RM This breaks configuration management; changing the yarn-site.xml on a per-node basis means ops folks will lose the ability to use system tools to verify the file's integrity (e.g., rpm -V). bq. If it's not, could you please give me more details about what is dynamic labels generated from an admin on the NM in your thinking As I've said before, I basically want something similar to the health check code: I provide something executable that the NM can run at runtime that will provide the list of labels. If we need to add labels, it's updating the script which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2345: --- Assignee: Hao Gao yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2345: --- Component/s: resourcemanager nodemanager yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104002#comment-14104002 ] Allen Wittenauer commented on YARN-2345: I've made [~haogao] a contributor and assigned this jira. yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104067#comment-14104067 ] Jason Lowe commented on YARN-2034: -- +1, committing this. Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Labels: documentation Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104155#comment-14104155 ] Owen O'Malley commented on YARN-2424: - This is a pretty clear case of trying to fix the breakage from YARN-1253. Yahoo ran clusters for a year with LCE before security was turned on and got significant value from that. The largest being that it prevents killall -9 java type mistakes on the part of users. (Yes that did actually happen.) LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104179#comment-14104179 ] Alejandro Abdelnur commented on YARN-2424: -- I disagree on YARN-1253 being a breakage. Personally, I would never recommend using this in production. Given that, I'm OK with the patch if: * the NM logs print a WARN at startup stating the setting. * the container stdout/stderr also prints a WARN to alert the user of the setting. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104196#comment-14104196 ] Owen O'Malley commented on YARN-2424: - Alejandro, after I told you that users have run in production with that setting, it is very rude to say that removing the feature is not breakage. It is *obviously* breakage. A warning makes sense, but it should only be once when the ResourceManager boots. It is a system level configuration and warning more than once is wrong. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2409) Active to StandBy transition does not stop rmDispatcher that causes 1 AsyncDispatcher thread leak.
[ https://issues.apache.org/jira/browse/YARN-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104224#comment-14104224 ] Hudson commented on YARN-2409: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1843 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1843/]) YARN-2409. RM ActiveToStandBy transition missing stoping previous rmDispatcher. Contributed by Rohith (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618915) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java Active to StandBy transition does not stop rmDispatcher that causes 1 AsyncDispatcher thread leak. --- Key: YARN-2409 URL: https://issues.apache.org/jira/browse/YARN-2409 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Fix For: 2.6.0 Attachments: YARN-2409.patch {code} at java.lang.Thread.run(Thread.java:662) 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:697) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:779) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:760) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-12 07:03:00,839 ERROR org.apache.hadoop.ya {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104222#comment-14104222 ] Hudson commented on YARN-2249: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1843 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1843/]) YARN-2249. Avoided AM release requests being lost on work preserving RM restart. Contributed by Jian He. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618972) * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java AM release request may be lost on RM restart Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104257#comment-14104257 ] Alejandro Abdelnur edited comment on YARN-2424 at 8/20/14 6:10 PM: --- I disagree on me being rude (or very rude) just for disagreeing with something. IMO security fixes trump backwards compatibility. Anyway, I'm -0 with the patch if the WARNs are printed in in the RM at startup as Owen suggests. I insists that the WARN should be in the stderr/stdout of every container. Otherwise this will go completely unnoticed to users running apps. It should be obvious to them that they are exposed. was (Author: tucu00): I disagree in me being rude (or very rude) just for disagreeing with something. IMO security fixes trump backwards compatibility. Anyway, I'm -0 with the patch if the WARNs are printed in in the RM at startup as Owen suggests. I insists that the WARN should be in the stderr/stdout of every container. Otherwise this will go completely unnoticed to users running apps. It should be obvious to them that they are exposed. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104257#comment-14104257 ] Alejandro Abdelnur commented on YARN-2424: -- I disagree in me being rude (or very rude) just for disagreeing with something. IMO security fixes trump backwards compatibility. Anyway, I'm -0 with the patch if the WARNs are printed in in the RM at startup as Owen suggests. I insists that the WARN should be in the stderr/stdout of every container. Otherwise this will go completely unnoticed to users running apps. It should be obvious to them that they are exposed. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104303#comment-14104303 ] Allen Wittenauer commented on YARN-2424: bq. It should be obvious to them that they are exposed. Then we should return a WARN whenever isSecurityEnabled returns false since that's the only way they are secure. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104314#comment-14104314 ] Alejandro Abdelnur commented on YARN-2424: -- if you don't have to kinit it is obvious security is OFF, no? LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104322#comment-14104322 ] Allen Wittenauer commented on YARN-2424: Apparently not, given: bq. Otherwise this will go completely unnoticed to users running apps. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104331#comment-14104331 ] Hudson commented on YARN-2174: -- FAILURE: Integrated in Hadoop-trunk-Commit #6089 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6089/]) YARN-2174. Enable HTTPs for the writer REST API of TimelineServer. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619160) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java Enabling HTTPs for the writer REST API of TimelineServer Key: YARN-2174 URL: https://issues.apache.org/jira/browse/YARN-2174 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2174.1.patch, YARN-2174.2.patch, YARN-2174.3.patch Since we'd like to allow the application to put the timeline data at the client, the AM and even the containers, we need to provide the way to distribute the keystore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104332#comment-14104332 ] Hudson commented on YARN-2034: -- FAILURE: Integrated in Hadoop-trunk-Commit #6089 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6089/]) YARN-2034. Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect. Contributed by Chen He (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619176) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Labels: documentation Fix For: 3.0.0, 2.6.0 Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104391#comment-14104391 ] Jonathan Eagles commented on YARN-2035: --- [~zjshen], can you take a quick look at this? This has been a little bit of a pain for testing since it can't come up when the namenode is in safemode. FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104426#comment-14104426 ] Hudson commented on YARN-2174: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1869 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1869/]) YARN-2174. Enable HTTPs for the writer REST API of TimelineServer. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619160) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineAuthenticator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java Enabling HTTPs for the writer REST API of TimelineServer Key: YARN-2174 URL: https://issues.apache.org/jira/browse/YARN-2174 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2174.1.patch, YARN-2174.2.patch, YARN-2174.3.patch Since we'd like to allow the application to put the timeline data at the client, the AM and even the containers, we need to provide the way to distribute the keystore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104427#comment-14104427 ] Hudson commented on YARN-2034: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1869 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1869/]) YARN-2034. Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect. Contributed by Chen He (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619176) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Labels: documentation Fix For: 3.0.0, 2.6.0 Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104464#comment-14104464 ] Tsuyoshi OZAWA commented on YARN-1919: -- [~kkambatl], could you take a look, please? Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE -- Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2432) RMStateStore should process the pending events before close
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2432: --- Attachment: YARN-2432.patch RMStateStore should process the pending events before close --- Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2432.patch Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104587#comment-14104587 ] Zhijie Shen commented on YARN-2035: --- [~jeagles], is the problematic scenario that NN and TimelineServer (TS) start around the same time? Therefore, while NN still stays in the safe mode, TS is trying to create a directory on it, result in SafeModeException. In the patch, checking whether dir exists seems to be necessary. Moreover, shall we do something similar to that we did for MR job history server? See HistoryFileManager#serviceInit. {code} long maxFSWaitTime = conf.getLong( JHAdminConfig.MR_HISTORY_MAX_START_WAIT_TIME, JHAdminConfig.DEFAULT_MR_HISTORY_MAX_START_WAIT_TIME); createHistoryDirs(new SystemClock(), 10 * 1000, maxFSWaitTime); {code} createHistoryDirs is going to retry dir creation until using up waiting time. FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104608#comment-14104608 ] Jonathan Eagles commented on YARN-2035: --- In my scenario, the dir already exists and so I don't want to crash trying to create an existing dir. The code you mentioned could be helpful for the first time startup but it's a slightly different scenario I care about. Let me know if you if we should handle that as part of this jira or separately. FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2395) Fair Scheduler : implement fair share preemption at parent queue based on fairSharePreemptionTimeout
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2395: -- Attachment: YARN-2395-1.patch Discussed with Karthik offline. We agree on the solution that each queue can specify its own fairSharePreemptionTimeout. If not specified, the queue will inherits the value from its parent queue. Another issue here: I removed the old defaultFairSharePreemptionTimeout and added a new one rootFairSharePreemptionTimeout, which configures the timeout value for the root queue. I didn't use the name defaultFairSharePreemptionTimeout as it may confuse users that the queue will use this value if not configure, which is not true and the queue will take value from its parent queue. Fair Scheduler : implement fair share preemption at parent queue based on fairSharePreemptionTimeout Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104661#comment-14104661 ] Tsuyoshi OZAWA commented on YARN-2035: -- Hi [~jeagles], how about adding tests as follows to cover the scenario by adding a helper method like {{initRootPath(fs, path)}} to make FileSystem object injectable? {code} @Test public void testInitExistingWorkingDirectoryInSafeMode() throws IOException { LOG.info(Starting testInitWorkingDirectoryInSafeMode); store.stop(); doThrow(new IOException(emulating safe mode exception)).when(fs) .mkdirs(any(Path.class)); FileSystemApplicationHistoryStore store = new FileSystemApplicationHistoryStore(); try { store.initRootPath(fs, fsWorkingPath); } catch (Exception e) { Assert.fail(Exception should not be thrown: + e); } } @Test public void testInitNonExistingWorkingDirectoryInSafeMode() throws IOException { LOG.info(Starting testInitNonExistingWorkingDirectoryInSafeMode); store.stop(); fs.delete(fsWorkingPath, true); doThrow(new IOException(emulating safe mode exception)).when(fs) .mkdirs(any(Path.class)); FileSystemApplicationHistoryStore store = new FileSystemApplicationHistoryStore(); try { store.initRootPath(fs, fsWorkingPath); Assert.fail(Exception should be thrown); } catch (Exception e) { // expected behavior. } } {code} FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104681#comment-14104681 ] Karthik Kambatla commented on YARN-1919: +1. Committing this. Potential NPE in EmbeddedElectorService#stop Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1919) Potential NPE in EmbeddedElectorService#stop
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1919: --- Summary: Potential NPE in EmbeddedElectorService#stop (was: Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE) Potential NPE in EmbeddedElectorService#stop Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104695#comment-14104695 ] Karthik Kambatla commented on YARN-1919: Thanks [~ozawa] for this fix. Just committed this to trunk and branch-2. Potential NPE in EmbeddedElectorService#stop Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104696#comment-14104696 ] Tsuyoshi OZAWA commented on YARN-1919: -- Thanks Jian and Karthik for your review. Potential NPE in EmbeddedElectorService#stop Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Fix For: 2.6.0 Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201408181938.txt Reattaching latest patch in order to trigger Hadoopqa. [~jianhe], thank you for all of your help and input. This patch will charge container usage to the current attempt, whether the container is running or completed. Will you please take a look at it again? Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.001.patch In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104744#comment-14104744 ] zhihai xu commented on YARN-1458: - I uploaded a new patch YARN-1458.001.patch, which will avoid losing precision for type conversion from double to integer. [~sandyr], Could you review it? thanks In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
[ https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2394: -- Attachment: YARN-2394-2.patch Update a patch following the same way as YARN-2395. Each queue inherits fairSharePreemptionThreshold from its parent queue if it doesn't configure in the allocation file. Will rebase the patch once YARN-2395 is in. Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue Key: YARN-2394 URL: https://issues.apache.org/jira/browse/YARN-2394 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2394-1.patch, YARN-2394-2.patch Preemption based on fair share starvation happens when usage of a queue is less than 50% of its fair share. This 50% is hardcoded. We'd like to make this configurable on a per queue basis, so that we can choose the threshold at which we want to preempt. Calling this config fairSharePreemptionThreshold. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2179: --- Attachment: YARN-2179-trunk-v4.patch Rebase again for shell changes. Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2433) Stale token used by restarted AM (with previous containers retained) to request new container
Yingda Chen created YARN-2433: - Summary: Stale token used by restarted AM (with previous containers retained) to request new container Key: YARN-2433 URL: https://issues.apache.org/jira/browse/YARN-2433 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.1, 2.4.0 Reporter: Yingda Chen With Hadoop 2.4, container retention is supported across AM crash-and-restart. However, after an AM is restarted with containers retained, it appears to be using the stale token to start new container. This leads to the error below. To truly support container retention, AM should be able to communicate with previous container(s) with the old token and ask for new container with new token. This could be similar to YARN-1321 which was reported and fixed earlier. ERROR: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 STACK trace: hadoop.ipc.ProtobufRpcEngine$Invoker.invoke org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103: Response - YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454: startContainers {services_meta_data { key: mapreduce_shuffle value: \000\0004\372 } failed_requests { container_id { app_attempt_id { application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2 } exception { message: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 trace: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02\r\n\tat org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:48)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeStartRequest(ContainerManagerImpl.java:508)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainerInternal(ContainerManagerImpl.java:571)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:538)\r\n\tat org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)\r\n\tat org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)\r\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\r\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)\r\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)\r\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)\r\n\tat java.security.AccessController.doPrivileged(Native Method)\r\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\r\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)\r\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)\r\n class_name: org.apache.hadoop.yarn.exceptions.YarnException } }} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2433) Stale token used by restarted AM (with previous containers retained) to request new container
[ https://issues.apache.org/jira/browse/YARN-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yingda Chen updated YARN-2433: -- Description: With Hadoop 2.4, container retention is supported across AM crash-and-restart. However, after an AM is restarted with containers retained, it appears to be using the stale token to start new container. This leads to the error below. To truly support container retention, AM should be able to communicate with previous container(s) with the old token and ask for new container with new token. This could be similar to YARN-1321 which was reported and fixed earlier. ERROR: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 STACK trace: {code} hadoop.ipc.ProtobufRpcEngine$Invoker.invoke org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103: Response - YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454: startContainers {services_meta_data { key: mapreduce_shuffle value: \000\0004\372 } failed_requests { container_id { app_attempt_id { application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2 } exception { message: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 trace: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02\r\n\tat org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:48)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeStartRequest(ContainerManagerImpl.java:508)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainerInternal(ContainerManagerImpl.java:571)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:538)\r\n\tat org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)\r\n\tat org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)\r\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\r\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)\r\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)\r\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)\r\n\tat java.security.AccessController.doPrivileged(Native Method)\r\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\r\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)\r\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)\r\n class_name: org.apache.hadoop.yarn.exceptions.YarnException } }} {code} was: With Hadoop 2.4, container retention is supported across AM crash-and-restart. However, after an AM is restarted with containers retained, it appears to be using the stale token to start new container. This leads to the error below. To truly support container retention, AM should be able to communicate with previous container(s) with the old token and ask for new container with new token. This could be similar to YARN-1321 which was reported and fixed earlier. ERROR: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 STACK trace: hadoop.ipc.ProtobufRpcEngine$Invoker.invoke org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103: Response - YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454: startContainers {services_meta_data { key: mapreduce_shuffle value: \000\0004\372 } failed_requests { container_id { app_attempt_id { application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2 } exception { message: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 trace: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. \nNMToken for application attempt :
[jira] [Updated] (YARN-2189) Admin service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2189: --- Attachment: YARN-2189-trunk-v3.patch Rebase again for shell changes. Admin service for cache manager --- Key: YARN-2189 URL: https://issues.apache.org/jira/browse/YARN-2189 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2189-trunk-v1.patch, YARN-2189-trunk-v2.patch, YARN-2189-trunk-v3.patch Implement the admin service for the shared cache manager. This service is responsible for handling administrative commands such as manually running a cleaner task. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-1492: --- Attachment: YARN-1492-all-trunk-v3.patch Rebase again. truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlo Curino updated YARN-1707: --- Attachment: YARN-1707.2.patch Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104872#comment-14104872 ] Zhijie Shen commented on YARN-2035: --- bq. In my scenario, the dir already exists and so I don't want to crash trying to create an existing dir. Hm... If so, maybe we can separate the issues, as we will migrate to timeline store soon (YARN-2033). FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104873#comment-14104873 ] Carlo Curino commented on YARN-1707: This patch is a more minimal set of changes rebased on trunk after we committed to trunk YARN-2378, YARN-2389. We also simplified and added more tests. The dynamic behavior is for PlanQueue and ReservationQueue. Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104893#comment-14104893 ] Ravi Prakash commented on YARN-2424: I reviewed the code and the changes make sense to me. I'm a +1 on the patch as is. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2056: - Attachment: YARN-2056.201408202039.txt This patch keeps the {{yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity}} property as a global parameter, and then adds a per-queue property in this format: {{yarn.resourcemanager.monitor.capacity.preemption.queue-path.max_ignored_over_capacity}} The preemption code makes two sets of passes through the queues. The first time through, it calculates the ideal resource allocation per queue based on normalized guaranteed capacity, and the second time through, it selects which queue's resources to preempt, taking into consideration the {{max_ignored_over_capacity)) In this patch, the per-queue {{...max_ignored_over_capacity}} is taken into consideration in the first pass to help determine which queues have resources available for preempting. This is necessary because without it, queues that could fulfill the need would otherwise be removed from the list of available resources. Then, for the second pass, the global {{...max_ignored_over_capacity}} setting is used, as before, to determine which resources out of the remaining available resources to use. This patch still requires an RM restart if the queue properties have changed. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1801) NPE in public localizer
[ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104921#comment-14104921 ] Beckham007 commented on YARN-1801: -- [~jlowe] we use hadoop 2.2.0. We got both problems in YARN-1575 and YARN-1801. When the hdfs has some problems , the NPE in YARN-1801 happens. Otherwise, the problems is including in YARN-1575. We will build a version that includes the fix from YARN-1575. Even if the assoc is null, should we close the threadpool? NPE in public localizer --- Key: YARN-1801 URL: https://issues.apache.org/jira/browse/YARN-1801 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Hong Zhiguo Priority: Critical Attachments: YARN-1801.patch While investigating YARN-1800 found this in the NM logs that caused the public localizer to shutdown: {noformat} 2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, 1390440382009, FILE, null } 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(726)) - Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(728)) - Public cache exiting {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104926#comment-14104926 ] Wangda Tan commented on YARN-796: - bq. As I've said before, I basically want something similar to the health check code: I provide something executable that the NM can run at runtime that will provide the list of labels. If we need to add labels, it's updating the script which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere. I understand now, it's meaningful since it's a flexible way for admin to set labels in NM side. Maybe add a {{NodeLabelCheckerService}} to NM similar to {{NodeHealthCheckerService}} should work. I'll create a separated JIRA for setting labels in NM side under this ticket and leave design/implementation discussion here. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104931#comment-14104931 ] Wangda Tan commented on YARN-2056: -- May a another way to do this is add a config in per queue, like {{..queue-path.disable_preemption}}. And in {{ProportionalCapacityPreemptionPolicy#cloneQueues}}, if a queue's used capacity more than guaranteed resource, and it set disable preemption. We will not create a TempQueue for it. We will not require RM restart if queue property changed (queue property will be refreshed and PreemptionPolicy will get such changes. Does it make sense? Thanks, Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2432) RMStateStore should process the pending events before close
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104963#comment-14104963 ] Hadoop QA commented on YARN-2432: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663208/YARN-2432.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 3 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4676//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4676//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4676//console This message is automatically generated. RMStateStore should process the pending events before close --- Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2432.patch Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2434) RM should not recover containers from previously failed attempt
Jian He created YARN-2434: - Summary: RM should not recover containers from previously failed attempt Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2434) RM should not recover containers from previously failed attempt
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2434: -- Issue Type: Sub-task (was: Bug) Parent: YARN-556 RM should not recover containers from previously failed attempt --- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2434) RM should not recover containers from previously failed attempt
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2434: -- Attachment: YARN-2434.1.patch RM should not recover containers from previously failed attempt --- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2434.1.patch If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104974#comment-14104974 ] Jian He commented on YARN-2434: --- Patch to not recover containers from previously failed attempt if container-preserving AM restart is not enabled. RM should not recover containers from previously failed attempt --- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2434.1.patch If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104976#comment-14104976 ] Karthik Kambatla commented on YARN-415: --- A quick comment before we commit this. IIUC, we are tracking the *allocation* and not *utilization*. Actual utilization could be smaller than the amount of resources allocated (or asked for). Can we update the title and the corresponding class/field names accordingly? Also, the values are accumulated for the duration of the app. Can we add *aggregate* in the required class/field names? Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104985#comment-14104985 ] Hadoop QA commented on YARN-2035: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644022/YARN-2035.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 3 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4677//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4677//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4677//console This message is automatically generated. FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated YARN-2424: Attachment: Y2424-1.patch Added a version with a log statement that warns on startup. [~tucu00], is this sufficient? The config docs are pretty clear about the effect of setting the parameter, and this should be noticed if someone is experimenting with LCE. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: Y2424-1.patch, YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104993#comment-14104993 ] Alejandro Abdelnur commented on YARN-2424: -- sure, fine, enough cycles spent on this, thx. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: Y2424-1.patch, YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2432) RMStateStore should process the pending events before close
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105015#comment-14105015 ] Varun Saxena commented on YARN-2432: 1. No new tests are needed. Just flipped over the sequence of statements. 2. Release Audit warnings are unrelated to the code changed. Its showing problems in some HDFS file. 3. Core test failure is unrelated to code change as well. Will cancel and submit patch again RMStateStore should process the pending events before close --- Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2432.patch Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Potential NPE in EmbeddedElectorService#stop
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105060#comment-14105060 ] Hudson commented on YARN-1919: -- FAILURE: Integrated in Hadoop-trunk-Commit #6091 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6091/]) YARN-1919. Potential NPE in EmbeddedElectorService#stop. (Tsuyoshi Ozawa via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619251) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java Potential NPE in EmbeddedElectorService#stop Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Fix For: 2.6.0 Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)