[jira] [Resolved] (YARN-4770) Auto-restart of containers should work across NM restarts.
[ https://issues.apache.org/jira/browse/YARN-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-4770. Resolution: Not A Bug > Auto-restart of containers should work across NM restarts. > -- > > Key: YARN-4770 > URL: https://issues.apache.org/jira/browse/YARN-4770 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > See my comment > [here|https://issues.apache.org/jira/browse/YARN-3998?focusedCommentId=15133367=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133367] > on YARN-3998. Need to take care of two things: > - The relaunch feature needs to work across NM restarts, so we should save > the retry-context and policy per container into the state-store and reload it > for continue relaunching after NM restart. > - We should also handle restarting of any containers that may have crashed > during the NM reboot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5372) TestRMWebServicesAppsModification fails in trunk
[ https://issues.apache.org/jira/browse/YARN-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-5372. Resolution: Not A Problem > TestRMWebServicesAppsModification fails in trunk > > > Key: YARN-5372 > URL: https://issues.apache.org/jira/browse/YARN-5372 > Project: Hadoop YARN > Issue Type: Test >Reporter: Jun Gong > > Some test cases in TestRMWebServicesAppsModification fails in trunk: > {code} > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[0] > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[0] > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[1] > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized[1] > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[1] > > {code} > The test case errors are at > https://builds.apache.org/job/PreCommit-YARN-Build/12310/testReport/. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5372) TestRMWebServicesAppsModification fails in trunk
Jun Gong created YARN-5372: -- Summary: TestRMWebServicesAppsModification fails in trunk Key: YARN-5372 URL: https://issues.apache.org/jira/browse/YARN-5372 Project: Hadoop YARN Issue Type: Test Reporter: Jun Gong Some test cases in TestRMWebServicesAppsModification fails in trunk: {code} org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[0] org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[0] org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[1] org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized[1] org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[1] {code} The test case errors are at https://builds.apache.org/job/PreCommit-YARN-Build/12310/testReport/. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5333) apps are rejected when RM HA
Jun Gong created YARN-5333: -- Summary: apps are rejected when RM HA Key: YARN-5333 URL: https://issues.apache.org/jira/browse/YARN-5333 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong Enable RM HA and use FairScheduler. Reproduce steps: 1. Start two RMs. 2. After RMs are running, change both RM's file {{etc/hadoop/fair-scheduler.xml}}, then add some queues. 3. Submit some apps to the new added queues. 4. Stop the active RM, then the standby RM will transit to active and recover apps. However the new active RM will reject recovered apps because it might have not loaded the new {{fair-scheduler.xml}}. We need call {{initScheduler}} before start active services or bring {{refreshAll()}} in front of {{rm.transitionToActive()}}. *It seems it is aslo important for other scheduler*. Related logs are as following: {quote} 2016-07-07 16:55:34,756 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recover ended ... 2016-07-07 16:55:34,824 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /gaia/hadoop/etc/hadoop/fair-scheduler.xml 2016-07-07 16:55:34,826 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application rejected by queue placement policy 2016-07-07 16:55:34,828 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1467803586002_0006_01 is done. finalState=FAILED 2016-07-07 16:55:34,828 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Unknown application appattempt_1467803586002_0006_01 has completed! 2016-07-07 16:55:34,828 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application rejected by queue placement policy 2016-07-07 16:55:34,828 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1467803586002_0004_01 is done. finalState=FAILED 2016-07-07 16:55:34,828 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Unknown application appattempt_1467803586002_0004_01 has completed! 2016-07-07 16:55:34,828 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APP_REJECTED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:697) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:88) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:718) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:702) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124) at java.lang.Thread.run(Thread.java:745) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5286) Add RPC port info in RM web service's response when getting app status
Jun Gong created YARN-5286: -- Summary: Add RPC port info in RM web service's response when getting app status Key: YARN-5286 URL: https://issues.apache.org/jira/browse/YARN-5286 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong When getting app status by RM web service({{/ws/v1/cluster/apps/\{appid\}}}), there is no RPC port info in the response. The port info is very important to communicate with AM. BTW: there is RPC port info when running {{bin/yarn application -status appid}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5168) Add port mapping handling when docker container use bridge network
Jun Gong created YARN-5168: -- Summary: Add port mapping handling when docker container use bridge network Key: YARN-5168 URL: https://issues.apache.org/jira/browse/YARN-5168 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jun Gong YARN-4007 addresses different network setups when launching the docker container. We need support port mapping when docker container uses bridge network. The following problems are what we faced: 1. Add "-P" to map docker container's exposed ports to automatically. 2. Add "-p" to let user specify specific ports to map. 3. Add service registry support for bridge network case, then app could find each other. It could be done out of YARN, however it might be more convenient to support it natively in YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5116) Failed to execute "yarn application"
Jun Gong created YARN-5116: -- Summary: Failed to execute "yarn application" Key: YARN-5116 URL: https://issues.apache.org/jira/browse/YARN-5116 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong Use the trunk code. {code} $ bin/yarn application -list 16/05/20 11:35:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -list at org.apache.commons.cli.Parser.processOption(Parser.java:363) at org.apache.commons.cli.Parser.parse(Parser.java:199) at org.apache.commons.cli.Parser.parse(Parser.java:85) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:172) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:90) {code} It is cause by that the subcommand 'application' is deleted from command args. The following command is OK. {code} $ bin/yarn application application -list 16/05/20 11:39:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0 Application-Id Application-NameApplication-Type User Queue State Final-State ProgressTracking-URL {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5063) Fail to launch AM continuously on a lost NM
Jun Gong created YARN-5063: -- Summary: Fail to launch AM continuously on a lost NM Key: YARN-5063 URL: https://issues.apache.org/jira/browse/YARN-5063 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong If a NM node shuts down, RM will not mark it as LOST until liveness monitor finds it timeout. However before that, RM might continuously allocate AM on that NM. We found this case in our cluster: RM continuously allocated a same AM on a lost NM before RM found it lost, and AMLauncher always failed because it could not connect to the lost NM. To solve the problem, we could add the NM to AM blacklist if RM failed to launch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-4910) Fix incomplete log info in ResourceLocalizationService
Jun Gong created YARN-4910: -- Summary: Fix incomplete log info in ResourceLocalizationService Key: YARN-4910 URL: https://issues.apache.org/jira/browse/YARN-4910 Project: Hadoop YARN Issue Type: Improvement Reporter: Jun Gong Assignee: Jun Gong Priority: Trivial When debugging, find a lot of incomplete log info from ResourceLocalizationService, it is a little confusing. {quote} 2016-03-30 22:47:29,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /data6/yarnenv/local/nmPrivate/container_1456839788316_4159_01_04_37.tokens. Credentials list: {quote} The content of credentials list will only be printed for DEBUG log level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4735) Remove stale LogAggregationReport from NM's context
Jun Gong created YARN-4735: -- Summary: Remove stale LogAggregationReport from NM's context Key: YARN-4735 URL: https://issues.apache.org/jira/browse/YARN-4735 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong {quote} All LogAggregationReport(current and previous) are only added to *context.getLogAggregationStatusForApps*, and never removed. So for long running service, the LogAggregationReport list NM sends to RM will grow over time. {quote} Per discussion in YARN-4720, we need remove stale LogAggregationReport from NM's context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing
Jun Gong created YARN-4497: -- Summary: RM might fail to restart when recovering apps whose attempts are missing Key: YARN-4497 URL: https://issues.apache.org/jira/browse/YARN-4497 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong Find following problem when discussing in YARN-3480. If RM fails to store some attempts in RMStateStore, there will be missing attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored attempt1 and attempt3, but failed to store attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When recovering attempt2, we call *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find its ApplicationAttemptStateData, but it could not find it, an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4494) Recover completed apps asynchronously
Jun Gong created YARN-4494: -- Summary: Recover completed apps asynchronously Key: YARN-4494 URL: https://issues.apache.org/jira/browse/YARN-4494 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong With RM HA enabled, when recovering apps, recover completed apps asynchronously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4459) container-executor might kill process wrongly
Jun Gong created YARN-4459: -- Summary: container-executor might kill process wrongly Key: YARN-4459 URL: https://issues.apache.org/jira/browse/YARN-4459 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong When calling 'signal_container_as_user' in container-executor, it first checks whether process group exists, if not, it will kill the process itself(if it the process exists). It is not reasonable because that the process group does not exist means corresponding container has finished, if we kill the process itself, we just kill wrong process. We found it happened in our cluster many times. We used same account for starting NM and submitted app, and container-executor sometimes killed NM(the wrongly killed process might just be a newly started thread and was NM's child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4316) Make NM's version information useful for upgrade
Jun Gong created YARN-4316: -- Summary: Make NM's version information useful for upgrade Key: YARN-4316 URL: https://issues.apache.org/jira/browse/YARN-4316 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Priority: Minor When upgrading all NM to a new bug fix version, we often upgrade some NM first, then upgrade rest NM if all looks right. We could avoid breakdown whole cluster in this way if new version of NM does not work well. But there is no easy way to tell us whether we have missed upgrading some NM. We could see all NM's version info in RM's web page as attached. These version info are too generic, e.g. 2.4.1, 2.6.1, 2.6.2. For small bug fix version, version will remain same. If we could change the version info more detailed(e.g. 2.4.1.12), we could make sure whether we have upgrade all NM to the new bug fix version. I propose to add a new config(yarn.nodemanager.version) in yarn-site.xml to solve this problem. When upgrading NM, we configure it to the new version at the same time. NM will report this version to RM, then we could see it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4316) Make NM's version information useful for upgrade
[ https://issues.apache.org/jira/browse/YARN-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-4316. Resolution: Implemented > Make NM's version information useful for upgrade > > > Key: YARN-4316 > URL: https://issues.apache.org/jira/browse/YARN-4316 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong >Priority: Minor > Attachments: nodes.png > > > When upgrading all NM to a new bug fix version, we often upgrade some NM > first, then upgrade rest NM if all looks right. We could avoid breakdown > whole cluster in this way if new version of NM does not work well. But there > is no easy way to tell us whether we have missed upgrading some NM. > We could see all NM's version info in RM's web page as attached. These > version info are too generic, e.g. 2.4.1, 2.6.1, 2.6.2. For small bug fix > version, version will remain same. If we could change the version info more > detailed(e.g. 2.4.1.12), we could make sure whether we have upgrade all NM to > the new bug fix version. > I propose to add a new config(yarn.nodemanager.version) in yarn-site.xml to > solve this problem. When upgrading NM, we configure it to the new version at > the same time. NM will report this version to RM, then we could see it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4201) AMBlacklist does not work for minicluster
Jun Gong created YARN-4201: -- Summary: AMBlacklist does not work for minicluster Key: YARN-4201 URL: https://issues.apache.org/jira/browse/YARN-4201 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong For minicluster (scheduler.include-port-in-node-name is set to TRUE), AMBlacklist does not work. It is because RM just puts host to AMBlacklist whether scheduler.include-port-in-node-name is set or not. In fact RM should put "host + port" to AMBlacklist when it is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4122) Add support for GPU as a resource
Jun Gong created YARN-4122: -- Summary: Add support for GPU as a resource Key: YARN-4122 URL: https://issues.apache.org/jira/browse/YARN-4122 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong Use [cgroups devcies|https://www.kernel.org/doc/Documentation/cgroups/devices.txt] to isolate GPUs for containers. For docker containers, we could use 'docker run --device=...'. Reference: [SLURM Resources isolation through cgroups|http://slurm.schedmd.com/slurm_ug_2011/SLURM_UserGroup2011_cgroups.pdf]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3998. Resolution: Won't Fix > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4005) Completed container whose app is finished is not removed from NMStateStore
Jun Gong created YARN-4005: -- Summary: Completed container whose app is finished is not removed from NMStateStore Key: YARN-4005 URL: https://issues.apache.org/jira/browse/YARN-4005 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong If a container is completed and its corresponding app is finished, NM only removes it from its context and does not remove it from NMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
Jun Gong created YARN-3998: -- Summary: Add retry-times to let NM re-launch container when it fails to run Key: YARN-3998 URL: https://issues.apache.org/jira/browse/YARN-3998 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0). It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.) We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset
Jun Gong created YARN-3896: -- Summary: RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset Key: YARN-3896 URL: https://issues.apache.org/jira/browse/YARN-3896 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong {noformat} 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 10.208.132.153 to /default-rack 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect from the node at: 10.208.132.153 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 10.208.132.153:8041 2015-07-03 16:49:39,104 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far behind rm response id:2506413 nm response id:0 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node 10.208.132.153:8041 as it is now REBOOTED 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it
[ https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3831. Resolution: Not A Problem Localization failed when a local disk turns from bad to good without NM initializes it -- Key: YARN-3831 URL: https://issues.apache.org/jira/browse/YARN-3831 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong A local disk turns from bad to good without NM initializes it(create /path-to-local-dir/usercache and /path-to-local-dir/filecache). When localizing a container, container-executor will try to create directories under /path-to-local-dir/usercache, and it will fail. Then container's localization will fail. Related log is as following: {noformat} 2015-06-19 18:00:01,205 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1431957472783_38706012_01_000465 2015-06-19 18:00:01,212 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens. Credentials list: 2015-06-19 18:00:01,216 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1431957472783_38706012_01_000465 startLocalizer is : 20 org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command provided 0 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is tdwadmin 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.IOException: Application application_1431957472783_38706012 initialization failed (exitCode=20) with output: main : command provided 0 main : user is tdwadmin Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) Caused by: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) ... 1 more 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1431957472783_38706012_01_000465 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3833) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk
Jun Gong created YARN-3833: -- Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk Key: YARN-3833 URL: https://issues.apache.org/jira/browse/YARN-3833 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 282.811 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.445 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it
Jun Gong created YARN-3831: -- Summary: Localization failed when a local disk turns from bad to good without NM initializes it Key: YARN-3831 URL: https://issues.apache.org/jira/browse/YARN-3831 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong A local disk turns from bad to good without NM initializes it(create /path-to-local-dir/usercache and /path-to-local-dir/filecache). When localizing a container, container-executor will try to create directories under /path-to-local-dir/usercache, and it will fail. Then container's localization will fail. Related log is as following: {noformat} 2015-06-19 18:00:01,205 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1431957472783_38706012_01_000465 2015-06-19 18:00:01,212 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens. Credentials list: 2015-06-19 18:00:01,216 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1431957472783_38706012_01_000465 startLocalizer is : 20 org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command provided 0 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is tdwadmin 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.IOException: Application application_1431957472783_38706012 initialization failed (exitCode=20) with output: main : command provided 0 main : user is tdwadmin Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) Caused by: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) ... 1 more 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1431957472783_38706012_01_000465 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3833) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk
[ https://issues.apache.org/jira/browse/YARN-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3833. Resolution: Duplicate TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk Key: YARN-3833 URL: https://issues.apache.org/jira/browse/YARN-3833 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 282.811 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.445 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
Jun Gong created YARN-3809: -- Summary: Failed to launch new attempts because ApplicationMasterLauncher's threads all hang Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers
[ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3474. Resolution: Invalid Add a way to let NM wait RM to come back, not kill running containers - Key: YARN-3474 URL: https://issues.apache.org/jira/browse/YARN-3474 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3474.01.patch When RM HA is enabled and active RM shuts down, standby RM will become active, recover apps and attempts. Apps will not be affected. If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM could not connect with ZK well). NM will kill containers running on it when it could not heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will be killed. In production cluster, we might come across above cases and fixing these bugs might need time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM start normally, clear the flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3480) Make AM max attempts stored in RMStateStore to be configurable
Jun Gong created YARN-3480: -- Summary: Make AM max attempts stored in RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3469) Do not set watch for most cases in ZKRMStateStore
Jun Gong created YARN-3469: -- Summary: Do not set watch for most cases in ZKRMStateStore Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail](https://issues.apache.org/jira/browse/ZOOKEEPER-706). Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3389) Two attempts might operate on same data structures concurrently
Jun Gong created YARN-3389: -- Summary: Two attempts might operate on same data structures concurrently Key: YARN-3389 URL: https://issues.apache.org/jira/browse/YARN-3389 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong In AttemptFailedTransition, the new attempt will get state('justFinishedContainers' and 'finishedContainersSentToAM') reference from the failed attempt. Then the two attempts might operate on these two variables concurrently, e.g. they might update 'justFinishedContainers' concurrently when they are both handling CONTAINER_FINISHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3161) Containers' information are lost in some cases when RM restart
[ https://issues.apache.org/jira/browse/YARN-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3161. Resolution: Duplicate Containers' information are lost in some cases when RM restart -- Key: YARN-3161 URL: https://issues.apache.org/jira/browse/YARN-3161 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong When RM restart, containers' information will be lost for the following scenarios: 1. NM restarts before it sends containers' information to the new active RM. 2. NM stops and it could not send containers' information to the new active RM. Without those containers' information, corresponding AM will never get their status through RM, and AM would just wait them for ever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3161) Containers' information are lost in some cases when RM restart
Jun Gong created YARN-3161: -- Summary: Containers' information are lost in some cases when RM restart Key: YARN-3161 URL: https://issues.apache.org/jira/browse/YARN-3161 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong When RM restart, containers' information will be lost for the following scenarios: 1. NM restarts before it sends containers' information to the new active RM. 2. NM stops and it could send containers' information to the new active RM. Without those containers' information, corresponding AM will never get their status through RM, and AM could just wait them for ever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3094) reset timer for liveness monitors after RM recovery
Jun Gong created YARN-3094: -- Summary: reset timer for liveness monitors after RM recovery Key: YARN-3094 URL: https://issues.apache.org/jira/browse/YARN-3094 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong When RM restarts, it will recover RMAppAttempts and registry them to AMLivenessMonitor if they are not in final state. AM will time out in RM if the recover process takes long time due to some reasons(e.g. too many apps). In our system, we found the recover process took about 3 mins, and all AM time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3057) Need update apps' runnability when reloading allocation files for FairScheduler
Jun Gong created YARN-3057: -- Summary: Need update apps' runnability when reloading allocation files for FairScheduler Key: YARN-3057 URL: https://issues.apache.org/jira/browse/YARN-3057 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong If we submit a app and the number of running app in its corresponding leaf queue has reached its max limit, the app will be put into 'nonRunnableApps'. And its runnabiltiy will only be updated when removing a appattempt(FairScheduler will call `updateRunnabilityOnAppRemoval` at that time). Suppose there are only service apps running, they will not finish, so the submitted app will not be scheduled even we change leaf queue's max limit. I think we need update apps' runnability when reloading allocation files for FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
Jun Gong created YARN-2640: -- Summary: TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-2612. Resolution: Duplicate Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. We think when RMAppAttempt call BaseFinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
Jun Gong created YARN-2617: -- Summary: NM does not need to send finished container whose APP is not running to RM Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from {code}app.context.getApplications(){code} when ApplicaitonImpl received evnet {code}ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED{code} , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2170) Fix components' version information in the web page 'About the Cluster'
Jun Gong created YARN-2170: -- Summary: Fix components' version information in the web page 'About the Cluster' Key: YARN-2170 URL: https://issues.apache.org/jira/browse/YARN-2170 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Priority: Minor In the web page 'About the Cluster', YARN's component's build version(e.g. ResourceManager) is the same as Hadoop version now. It is caused by calling getVersion() instead of _getVersion() in VersionInfo.java by mistake. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2164) Add switch 'restart' for yarn-daemon.sh
Jun Gong created YARN-2164: -- Summary: Add switch 'restart' for yarn-daemon.sh Key: YARN-2164 URL: https://issues.apache.org/jira/browse/YARN-2164 Project: Hadoop YARN Issue Type: Improvement Reporter: Jun Gong Priority: Minor For convenience, add an switch 'restart' for yarn-daemon.sh. e.g. We could use yarn-daemon.sh restart nodemanager instead of yarn-daemon.sh stop nodemanager; yarn-daemon.sh start nodemanager. -- This message was sent by Atlassian JIRA (v6.2#6252)