[jira] [Commented] (YARN-3645) ResourceManager can't start success if attribute value of "aclSubmitApps" is null in fair-scheduler.xml
[ https://issues.apache.org/jira/browse/YARN-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547601#comment-14547601 ] Mohammad Shahid Khan commented on YARN-3645: Loading with invalid node configuration is not feasible. But instead of throwing the NullPointerException, we can *AllocationConfigurationException* with proper message so that the reason of failure could be identified easily. {code} if ("aclAdministerApps".equals(field.getTagName())) { Text aclText = (Text)field.getFirstChild(); if (aclText == null) { throw new AllocationConfigurationException( "Invalid admin ACL configuration in allocation file"); } String text = ((Text)field.getFirstChild()).getData(); acls.put(QueueACL.ADMINISTER_QUEUE, new AccessControlList(text)); } {code} > ResourceManager can't start success if attribute value of "aclSubmitApps" is > null in fair-scheduler.xml > > > Key: YARN-3645 > URL: https://issues.apache.org/jira/browse/YARN-3645 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.5.2 >Reporter: zhoulinlin > > The "aclSubmitApps" is configured in fair-scheduler.xml like below: > > > > The resourcemanager log: > 2015-05-14 12:59:48,623 INFO org.apache.hadoop.service.AbstractService: > Service ResourceManager failed in state INITED; cause: > org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed > to initialize FairScheduler > org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed > to initialize FairScheduler > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:493) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:920) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1159) > Caused by: java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1301) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1318) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > ... 7 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:458) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:337) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1299) > ... 9 more > 2015-05-14 12:59:48,623 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning > to standby state > 2015-05-14 12:59:48,623 INFO > com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory: plugin > transitionToStandbyIn > 2015-05-14 12:59:48,623 WARN org.apache.hadoop.service.AbstractService: When > stopping the service ResourceManager : java.lang.NullPointerException > java.lang.NullPointerException > at > com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory.transitionToStandbyIn(YarnPlatformPluginProxyFactory.java:71) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1058) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) > at > org.apache.hadoop.yarn.server.resourcemana
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547562#comment-14547562 ] Naganarasimha G R commented on YARN-2729: - Thanks [~vinodkv] for replying, bq. I think once we start marking this script-based provider feature as public, the expected output from the script will automatically become a public interface unless we explicitly say no. We should start thinking about this now to avoid uncertainty in the future? True, its better to think about it for both script(YARN-2729) and config (YARN-2923) based providers now itself if we are making it as public. Initial thought what i have is {code} NODE_LABELS=|[,Labels] where Label Type = Partition, Constraint and default if not specified can be Partition. {code} Going further i think distributed labels will be more suitable for constraints/attribute [YARN-3409] so we can think of having {{Constraint}} as default too. Also we need not specify Whether Partition is Exclusive or Non-Exclusive, it does not make significance from the NM as Exclusivity of partition labels are already specified while adding it to cluster labels set in RM. One More suggestion is In Node Label object, can we think of having Enum instead of isExclusive, Enum can have ExclusivePartion,NonExclusivePartition, (Constraint in future) and so on. bq. Isn't AbstractNodeLabelsProvider a good place to do these steps? Well AbstractNodeLabelsProvider will only be applicable to the whitelisted providers (Config and script) currently, And also the purpose was for removing duplicate code related to Timertask and related configs. So is your suggestion to expose the AbstractNodeLabelsProvider to be a public interface ? or can we think of having intermediate manager class and have configurations for timer requirement and leave the NodeLabelsProvider interface as is. ? > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup > --- > > Key: YARN-2729 > URL: https://issues.apache.org/jira/browse/YARN-2729 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, > YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, > YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, > YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, > YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch, > YARN-2729.20150517-1.patch > > > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547544#comment-14547544 ] Raju Bairishetti commented on YARN-3644: W can have a new config like NODEMANAGER_ALIVE_ON_RM_CONNECTION_FAILURES? Based on this config value NM takes a decision on shutdown. In this way we can honour the existing behaviour as well. I will provide a patch shortly. Not able to assign myself. Can anyone help me in assigning? > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547537#comment-14547537 ] Raju Bairishetti commented on YARN-3646: [~vinodkv] I will provide a patch shortly. I am not able to assign myself. Can anyone help me in assigning myself? > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547529#comment-14547529 ] Raju Bairishetti commented on YARN-3646: bq. Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default policy is not sufficient, but also RetryPolicies.RetryForever.shouldRetry() should check for Connect exceptions and handle it. Otherwise shouldRetry always return RetryAction.RETRY action. Do we need to catch exception in shouldRetry if we have separate exceptionToPolicy map which contains only connectionException entry. ( like exceptiontoPolicyMap.put(connectionException, FOREVER polcicy)) Seems we do not even require exceptionToPolicy for FOREVER policy if we catch the exception in shouldRetry method. thoughts? > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547506#comment-14547506 ] Srikanth Sundarrajan commented on YARN-3644: [~vinodkv], YARN-3644 is independent of this. In our setup we ran into this before we ran into YARN-3646. NM gives up trying for about 30 odd mts by default (default settings) before *attempting* to shut itself down. Is there an issue if this wait time is much (infinitely) longer (for both HA & Non-HA setup). An orthogonal issue is that when NM attempts to shut itself down, it doesn't actually go down and lingers around for days without actually accepting any containers, unless restarted (will file another issue for this). > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3651) Tracking url in ApplicationCLI wrong for running application
[ https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3651: -- Assignee: (was: Jian He) > Tracking url in ApplicationCLI wrong for running application > > > Key: YARN-3651 > URL: https://issues.apache.org/jira/browse/YARN-3651 > Project: Hadoop YARN > Issue Type: Bug > Components: applications, resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Priority: Minor > > Application URL in Application CLI wrong > Steps to reproduce > == > 1. Start HA setup insecure mode > 2.Configure HTTPS_ONLY > 3.Submit application to cluster > 4.Execute command ./yarn application -list > 5.Observer tracking URL shown > {code} > 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History > server at /:45034 > Total number of applications (application-types: [] and states: [SUBMITTED, > ACCEPTED, RUNNING]):1 > Application-Id --- Tracking-URL > application_1431672734347_0003 *http://host-10-19-92-117:13013* > {code} > *Expected* > https://:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3651) Tracking url in ApplicationCLI wrong for running application
[ https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-3651: - Assignee: Jian He > Tracking url in ApplicationCLI wrong for running application > > > Key: YARN-3651 > URL: https://issues.apache.org/jira/browse/YARN-3651 > Project: Hadoop YARN > Issue Type: Bug > Components: applications, resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Jian He >Priority: Minor > > Application URL in Application CLI wrong > Steps to reproduce > == > 1. Start HA setup insecure mode > 2.Configure HTTPS_ONLY > 3.Submit application to cluster > 4.Execute command ./yarn application -list > 5.Observer tracking URL shown > {code} > 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History > server at /:45034 > Total number of applications (application-types: [] and states: [SUBMITTED, > ACCEPTED, RUNNING]):1 > Application-Id --- Tracking-URL > application_1431672734347_0003 *http://host-10-19-92-117:13013* > {code} > *Expected* > https://:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547496#comment-14547496 ] sandflee commented on YARN-3668: I don't want the service to terminated if AM goes down, yarn will also restart AM until it is launched successfully. By outside ways we could detect this situation and replace a new AM jar. > Long run service shouldn't be killed even if Yarn crashed > - > > Key: YARN-3668 > URL: https://issues.apache.org/jira/browse/YARN-3668 > Project: Hadoop YARN > Issue Type: Wish >Reporter: sandflee > > For long running service, it shouldn't be killed even if all yarn component > crashed, with RM work preserving and NM restart, yarn could take over > applications again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547494#comment-14547494 ] Vinod Kumar Vavilapalli commented on YARN-3668: --- So you don't want the service to be terminated even if the ApplicationMaster goes down and will never get launched again? > Long run service shouldn't be killed even if Yarn crashed > - > > Key: YARN-3668 > URL: https://issues.apache.org/jira/browse/YARN-3668 > Project: Hadoop YARN > Issue Type: Wish >Reporter: sandflee > > For long running service, it shouldn't be killed even if all yarn component > crashed, with RM work preserving and NM restart, yarn could take over > applications again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547492#comment-14547492 ] Vinod Kumar Vavilapalli commented on YARN-3644: --- Actually, for all the above cases, we want NMs to just continue for a while without losing any work and finally give up after some time. The only difference between a HA vs non-HA setup is that in HA setup NMs will just wait many times over trying each of the RMs. Getting into the business of detecting and acting on partitions is best left up to admins/tools. > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547489#comment-14547489 ] Vinod Kumar Vavilapalli commented on YARN-3644: --- bq. In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. bq. Right now, NM shuts down itself only in case of connection failures. NM ignores all other kinds of exceptions and errors while sending heartbeats. This path usually shouldn't happen at all as the RMProxy layer is supposed to retry _enough_, except perhaps for the bug at YARN-3646. We eventually want to give up if the retry layer itself gives up. Given that, is this JIRA simply a dup of YARN-3646? /cc [~jianhe] [~xgong] > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3646: -- Target Version/s: 2.8.0, 2.7.1 [~raju.bairishetti], would you like to provide a patch? /cc [~xgong], [~jianhe] who wrote most of this code. Targeting 2.7.1/2.8.0, but more likely one is 2.8.0. Can see if we can get it into earlier releases too depending on their schedule. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547480#comment-14547480 ] Vinod Kumar Vavilapalli commented on YARN-3480: --- bq. I think we need to have a lower limit on the failure-validaty interval to avoid situations like this. Filed YARN-3669. > Recovery may get very slow with lots of services with lots of app-attempts > -- > > Key: YARN-3480 > URL: https://issues.apache.org/jira/browse/YARN-3480 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3480.01.patch, YARN-3480.02.patch, > YARN-3480.03.patch, YARN-3480.04.patch > > > When RM HA is enabled and running containers are kept across attempts, apps > are more likely to finish successfully with more retries(attempts), so it > will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However > it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make > RM recover process much slower. It might be better to set max attempts to be > stored in RMStateStore. > BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to > a small value, retried attempts might be very large. So we need to delete > some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3669) Attempt-failures validatiy interval should have a global admin configurable lower limit
Vinod Kumar Vavilapalli created YARN-3669: - Summary: Attempt-failures validatiy interval should have a global admin configurable lower limit Key: YARN-3669 URL: https://issues.apache.org/jira/browse/YARN-3669 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Found this while reviewing YARN-3480. bq. When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. I think we need to have a lower limit on the failure-validaty interval to avoid situations like this. Having this will avoid pardoning too-many failures in too-short a duration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547478#comment-14547478 ] Weiwei Yang commented on YARN-3526: --- Thanks [~xgong] > ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster > - > > Key: YARN-3526 > URL: https://issues.apache.org/jira/browse/YARN-3526 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 2.6.0 > Environment: Red Hat Enterprise Linux Server 6.4 >Reporter: Weiwei Yang >Assignee: Weiwei Yang > Labels: BB2015-05-TBR > Fix For: 2.7.1 > > Attachments: YARN-3526.001.patch, YARN-3526.002.patch > > > On a QJM HA cluster, view RM web UI to track job status, it shows > This is standby RM. Redirecting to the current active RM: > http://:8088/proxy/application_1427338037905_0008/mapreduce > it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547477#comment-14547477 ] Weiwei Yang commented on YARN-3526: --- Thanks [~xgong] > ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster > - > > Key: YARN-3526 > URL: https://issues.apache.org/jira/browse/YARN-3526 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 2.6.0 > Environment: Red Hat Enterprise Linux Server 6.4 >Reporter: Weiwei Yang >Assignee: Weiwei Yang > Labels: BB2015-05-TBR > Fix For: 2.7.1 > > Attachments: YARN-3526.001.patch, YARN-3526.002.patch > > > On a QJM HA cluster, view RM web UI to track job status, it shows > This is standby RM. Redirecting to the current active RM: > http://:8088/proxy/application_1427338037905_0008/mapreduce > it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547473#comment-14547473 ] Vinod Kumar Vavilapalli commented on YARN-3480: --- bq. we might need keep failed attempts those are in validity window, so it is the minimum number of attempts that we should keep. So when apps specify how much they want the platform to remember, we need consider it as another minimum number of attempts that we should keep. What I proposed is a global limit on attempts-to-remember that can be overridden to a lower value by individual apps if needed. So, yes, like you are saying, this global limit should usually be such that RM can _atleast_ remember attempts that can happen in all apps' one failure-validity-interval. bq. It makes recovery more fast, and does not lose any attempts' history. However it will makes recovery process a little more complicated. The former method(removing attempts) is more concise, and just likes logrotate, if we could accept the absence of some attempts' history information, I would prefer it. Without doing this, we will unnecessarily be forcing apps to lose history simply because the platform cannot recover quickly enough. Thinking more, how about we only have (limits + asynchronous recovery) for services, once YARN-1039 goes in? Non-service apps anyways are not expected to have a lot of app-attempts. > Recovery may get very slow with lots of services with lots of app-attempts > -- > > Key: YARN-3480 > URL: https://issues.apache.org/jira/browse/YARN-3480 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3480.01.patch, YARN-3480.02.patch, > YARN-3480.03.patch, YARN-3480.04.patch > > > When RM HA is enabled and running containers are kept across attempts, apps > are more likely to finish successfully with more retries(attempts), so it > will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However > it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make > RM recover process much slower. It might be better to set max attempts to be > stored in RMStateStore. > BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to > a small value, retried attempts might be very large. So we need to delete > some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547468#comment-14547468 ] Vinod Kumar Vavilapalli commented on YARN-2729: --- bq. I think the format expected from the command should be more structured. Specifically as we expect more per-label attributes in line with YARN-3565. bq. So IMHO if there is plan to make this interface public & stable then would be better do these changes now itself if not it would better done after requirement for constraint labels, so that more clarity on structure would be there? Tan, Wangda and you can share your opinion on this, based on it will do the modifications. I think once we start marking this script-based provider feature as public, the expected output from the script will automatically become a public interface unless we explicitly say no. We should start thinking about this now to avoid uncertainty in the future? bq. These needs to be done irrespective of the label provider (system or user's) hence kept it in NodeStatusUpdaterImpl , but if req to be moved out then we need to bring in some intermediate manager(/helper/delegator) class between NodeStatusUpdaterImpl and NodeLabelsProvider. Isn't AbstractNodeLabelsProvider a good place to do these steps? > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup > --- > > Key: YARN-2729 > URL: https://issues.apache.org/jira/browse/YARN-2729 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, > YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, > YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, > YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, > YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch, > YARN-2729.20150517-1.patch > > > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3547) FairScheduler: Apps that have no resource demand should not participate scheduling
[ https://issues.apache.org/jira/browse/YARN-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547461#comment-14547461 ] Xianyin Xin commented on YARN-3547: --- Agree, [~leftnoteasy]. Now we have YARN-3547.004.patch using {{SchedulerApplicationAttempt.getAppAttemptResourceUsage().getPending()}} and YARN-3547.005.patch using {{getDemand() - getResourceUsage()}}. > FairScheduler: Apps that have no resource demand should not participate > scheduling > -- > > Key: YARN-3547 > URL: https://issues.apache.org/jira/browse/YARN-3547 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Xianyin Xin >Assignee: Xianyin Xin > Attachments: YARN-3547.001.patch, YARN-3547.002.patch, > YARN-3547.003.patch, YARN-3547.004.patch, YARN-3547.005.patch > > > At present, all of the 'running' apps participate the scheduling process, > however, most of them may have no resource demand on a production cluster, as > the app's status is running other than waiting for resource at the most of > the app's lifetime. It's not a wise way we sort all the 'running' apps and > try to fulfill them, especially on a large-scale cluster which has heavy > scheduling load. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547456#comment-14547456 ] Vinod Kumar Vavilapalli commented on YARN-3561: --- I see you filed HADOOP-11989. Assuming _that_ is the root-cause, we can close this as a duplicate. > Non-AM Containers continue to run even after AM is stopped > -- > > Key: YARN-3561 > URL: https://issues.apache.org/jira/browse/YARN-3561 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.0 > Environment: debian 7 >Reporter: Gour Saha >Priority: Critical > Attachments: app0001.zip, application_1431771946377_0001.zip > > > Non-AM containers continue to run even after application is stopped. This > occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a > Hadoop 2.6 deployment. > Following are the NM logs from 2 different nodes: > *host-07* - where Slider AM was running > *host-03* - where Storm NIMBUS container was running. > *Note:* The logs are partial, starting with the time when the relevant Slider > AM and NIMBUS containers were allocated, till the time when the Slider AM was > stopped. Also, the large number of "Memory usage" log lines were removed > keeping only a few starts and ends of every segment. > *NM log from host-07 where Slider AM container was running:* > {noformat} > 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for > container_1428575950531_0020_02_01 > 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - > Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) > 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for > container_1428575950531_0021_01_01 by user yarn > 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new > application reference for app application_1428575950531_0021 > 2015-04-29 00:41:10,323 INFO application.Application > (ApplicationImpl.java:handle(464)) - Application > application_1428575950531_0021 transitioned from NEW to INITING > 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger > (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 > OPERATION=Start Container Request TARGET=ContainerManageImpl > RESULT=SUCCESS APPID=application_1428575950531_0021 > CONTAINERID=container_1428575950531_0021_01_01 > 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService > (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root > Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: > [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple > users. > 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:(182)) - rollingMonitorInterval is set as > -1. The log rolling mornitoring interval is disabled. The logs will be > aggregated after this application is finished. > 2015-04-29 00:41:10,351 INFO application.Application > (ApplicationImpl.java:transition(304)) - Adding > container_1428575950531_0021_01_01 to application > application_1428575950531_0021 > 2015-04-29 00:41:10,352 INFO application.Application > (ApplicationImpl.java:handle(464)) - Application > application_1428575950531_0021 transitioned from INITING to RUNNING > 2015-04-29 00:41:10,356 INFO container.Container > (ContainerImpl.java:handle(999)) - Container > container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING > 2015-04-29 00:41:10,357 INFO containermanager.AuxServices > (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId > application_1428575950531_0021 > 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource > (LocalizedResource.java:handle(203)) - Resource > hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar > transitioned from INIT to DOWNLOADING > 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource > (LocalizedResource.java:handle(203)) - Resource > hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar > transitioned from INIT to DOWNLOADING > 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource > (LocalizedResource.java:handle(203)) - Resource > hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar > transitioned from INIT to DOWNLOADING > 2015-04-29 00:41:10,358 INFO localizer.LocalizedRes
[jira] [Commented] (YARN-3652) A SchedulerMetrics may be need for evaluating the scheduler's performance
[ https://issues.apache.org/jira/browse/YARN-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547435#comment-14547435 ] Xianyin Xin commented on YARN-3652: --- Thanks [~vinodkv], that's very helpful. > A SchedulerMetrics may be need for evaluating the scheduler's performance > - > > Key: YARN-3652 > URL: https://issues.apache.org/jira/browse/YARN-3652 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, scheduler >Reporter: Xianyin Xin > > As discussed in YARN-3630, a {{SchedulerMetrics}} may be need for evaluating > the scheduler's performance. The performance indexes includes #events waiting > for being handled by scheduler, the throughput, the scheduling delay and/or > other indicators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547434#comment-14547434 ] sandflee commented on YARN-3668: seems not enough,if AM crashed on launch because of AM's bug, application will fail finally. I think it's the problem of AM not application, yarn should handle this. > Long run service shouldn't be killed even if Yarn crashed > - > > Key: YARN-3668 > URL: https://issues.apache.org/jira/browse/YARN-3668 > Project: Hadoop YARN > Issue Type: Wish >Reporter: sandflee > > For long running service, it shouldn't be killed even if all yarn component > crashed, with RM work preserving and NM restart, yarn could take over > applications again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547424#comment-14547424 ] Xuan Gong commented on YARN-3668: - bq. If am crashed and reaches am max fail times, applications are killed. If we set am max fail times to a big one or unlimit am max fail times, RM may have too many AppAttempt to store in memory and RMStateStore, YARN-611 and YARN-614 are not enough to cover the cases you described ? > Long run service shouldn't be killed even if Yarn crashed > - > > Key: YARN-3668 > URL: https://issues.apache.org/jira/browse/YARN-3668 > Project: Hadoop YARN > Issue Type: Wish >Reporter: sandflee > > For long running service, it shouldn't be killed even if all yarn component > crashed, with RM work preserving and NM restart, yarn could take over > applications again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2729: Attachment: YARN-2729.20150517-1.patch Hi [~wangda] # rebased the patch on top of 3565 # Moved common code which was earlier here to YARN-2923, as 2923 jira will be going first > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup > --- > > Key: YARN-2729 > URL: https://issues.apache.org/jira/browse/YARN-2729 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, > YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, > YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, > YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, > YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch, > YARN-2729.20150517-1.patch > > > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String
[ https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547411#comment-14547411 ] Naganarasimha G R commented on YARN-3565: - Thanks [~aw], for looking it . > NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object > instead of String > - > > Key: YARN-3565 > URL: https://issues.apache.org/jira/browse/YARN-3565 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R >Priority: Blocker > Attachments: YARN-3565-20150502-1.patch, YARN-3565.20150515-1.patch, > YARN-3565.20150516-1.patch > > > Now NM HB/Register uses Set, it will be hard to add new fields if we > want to support specifying NodeLabel type such as exclusivity/constraints, > etc. We need to make sure rolling upgrade works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String
[ https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547316#comment-14547316 ] Allen Wittenauer commented on YARN-3565: bq. I think currently white space is getting calculated on the diff output rather just the modified lines only (diff has some lines before and after the modifications). That's not how it works. But I'll look to see if there is an off-by-one error here. > NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object > instead of String > - > > Key: YARN-3565 > URL: https://issues.apache.org/jira/browse/YARN-3565 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R >Priority: Blocker > Attachments: YARN-3565-20150502-1.patch, YARN-3565.20150515-1.patch, > YARN-3565.20150516-1.patch > > > Now NM HB/Register uses Set, it will be hard to add new fields if we > want to support specifying NodeLabel type such as exclusivity/constraints, > etc. We need to make sure rolling upgrade works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547304#comment-14547304 ] Li Lu commented on YARN-3051: - Hi [~varun_saxena], I think the new patch name pattern should be, YARN-3051-YARN-2928.***.patch. Would you please try that again? Thanks! > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, > YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547299#comment-14547299 ] Hadoop QA commented on YARN-3051: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732621/YARN-3051.wip.02.YARN-2928.patch | | Optional Tests | shellcheck javadoc javac unit findbugs checkstyle | | git revision | trunk / cab0dad | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7963/console | This message was automatically generated. > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, > YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3133) Move NodeHealthStatus and associated protobuf to hadoop common
[ https://issues.apache.org/jira/browse/YARN-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3133: --- Description: Move NodeHealthStatus and associated protobuf to hadoop common as HDFS needs to use it. (was: Move NodeHealthStatus and associated protobuf to hadoop common as HDFS needs to use it,) > Move NodeHealthStatus and associated protobuf to hadoop common > -- > > Key: YARN-3133 > URL: https://issues.apache.org/jira/browse/YARN-3133 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Varun Saxena >Assignee: Varun Saxena > > Move NodeHealthStatus and associated protobuf to hadoop common as HDFS needs > to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3339) TestDockerContainerExecutor should pull a single image and not the entire centos repository
[ https://issues.apache.org/jira/browse/YARN-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3339: --- Assignee: Ravindra Kumar Naik > TestDockerContainerExecutor should pull a single image and not the entire > centos repository > --- > > Key: YARN-3339 > URL: https://issues.apache.org/jira/browse/YARN-3339 > Project: Hadoop YARN > Issue Type: Test > Components: test >Affects Versions: 2.6.0 > Environment: Linux >Reporter: Ravindra Kumar Naik >Assignee: Ravindra Kumar Naik >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3339-branch-2.6.0.001.patch, > YARN-3339-trunk.001.patch > > > TestDockerContainerExecutor test pulls the entire centos repository which is > time consuming. > Pulling a specific image (e.g. centos7) will be sufficient to run the test > successfully and will save time -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job
[ https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3560: --- Target Version/s: 2.8.0 Affects Version/s: 2.7.0 > Not able to navigate to the cluster from tracking url (proxy) generated after > submission of job > --- > > Key: YARN-3560 > URL: https://issues.apache.org/jira/browse/YARN-3560 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Anushri >Priority: Minor > Attachments: YARN-3560.patch > > > a standalone web proxy server is enabled in the cluster > when a job is submitted the url generated contains proxy > track this url > in the web page , if we try to navigate to the cluster links [about. > applications, or scheduler] it gets redirected to some default port instead > of actual RM web port configured > as such it throws "webpage not available" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job
[ https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3560: --- Attachment: (was: YARN-3560.patch) > Not able to navigate to the cluster from tracking url (proxy) generated after > submission of job > --- > > Key: YARN-3560 > URL: https://issues.apache.org/jira/browse/YARN-3560 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anushri >Priority: Minor > Attachments: YARN-3560.patch > > > a standalone web proxy server is enabled in the cluster > when a job is submitted the url generated contains proxy > track this url > in the web page , if we try to navigate to the cluster links [about. > applications, or scheduler] it gets redirected to some default port instead > of actual RM web port configured > as such it throws "webpage not available" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job
[ https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3560: --- Attachment: YARN-3560.patch Please review the attached patch. > Not able to navigate to the cluster from tracking url (proxy) generated after > submission of job > --- > > Key: YARN-3560 > URL: https://issues.apache.org/jira/browse/YARN-3560 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anushri >Priority: Minor > Attachments: YARN-3560.patch > > > a standalone web proxy server is enabled in the cluster > when a job is submitted the url generated contains proxy > track this url > in the web page , if we try to navigate to the cluster links [about. > applications, or scheduler] it gets redirected to some default port instead > of actual RM web port configured > as such it throws "webpage not available" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job
[ https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3560: --- Attachment: YARN-3560.patch Please review the attached patch > Not able to navigate to the cluster from tracking url (proxy) generated after > submission of job > --- > > Key: YARN-3560 > URL: https://issues.apache.org/jira/browse/YARN-3560 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anushri >Priority: Minor > Attachments: YARN-3560.patch > > > a standalone web proxy server is enabled in the cluster > when a job is submitted the url generated contains proxy > track this url > in the web page , if we try to navigate to the cluster links [about. > applications, or scheduler] it gets redirected to some default port instead > of actual RM web port configured > as such it throws "webpage not available" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547262#comment-14547262 ] Chackaravarthy commented on YARN-3561: -- Improper (specific to this env) kill command construction was the issue. I tested by making changes in Shell.java class to construct the kill command as follows (including two hyphens) : {noformat} kill -signalNo -- - {noformat} It works fine with this change in debian7. > Non-AM Containers continue to run even after AM is stopped > -- > > Key: YARN-3561 > URL: https://issues.apache.org/jira/browse/YARN-3561 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.0 > Environment: debian 7 >Reporter: Gour Saha >Priority: Critical > Attachments: app0001.zip, application_1431771946377_0001.zip > > > Non-AM containers continue to run even after application is stopped. This > occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a > Hadoop 2.6 deployment. > Following are the NM logs from 2 different nodes: > *host-07* - where Slider AM was running > *host-03* - where Storm NIMBUS container was running. > *Note:* The logs are partial, starting with the time when the relevant Slider > AM and NIMBUS containers were allocated, till the time when the Slider AM was > stopped. Also, the large number of "Memory usage" log lines were removed > keeping only a few starts and ends of every segment. > *NM log from host-07 where Slider AM container was running:* > {noformat} > 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for > container_1428575950531_0020_02_01 > 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - > Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) > 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for > container_1428575950531_0021_01_01 by user yarn > 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new > application reference for app application_1428575950531_0021 > 2015-04-29 00:41:10,323 INFO application.Application > (ApplicationImpl.java:handle(464)) - Application > application_1428575950531_0021 transitioned from NEW to INITING > 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger > (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 > OPERATION=Start Container Request TARGET=ContainerManageImpl > RESULT=SUCCESS APPID=application_1428575950531_0021 > CONTAINERID=container_1428575950531_0021_01_01 > 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService > (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root > Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: > [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple > users. > 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:(182)) - rollingMonitorInterval is set as > -1. The log rolling mornitoring interval is disabled. The logs will be > aggregated after this application is finished. > 2015-04-29 00:41:10,351 INFO application.Application > (ApplicationImpl.java:transition(304)) - Adding > container_1428575950531_0021_01_01 to application > application_1428575950531_0021 > 2015-04-29 00:41:10,352 INFO application.Application > (ApplicationImpl.java:handle(464)) - Application > application_1428575950531_0021 transitioned from INITING to RUNNING > 2015-04-29 00:41:10,356 INFO container.Container > (ContainerImpl.java:handle(999)) - Container > container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING > 2015-04-29 00:41:10,357 INFO containermanager.AuxServices > (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId > application_1428575950531_0021 > 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource > (LocalizedResource.java:handle(203)) - Resource > hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar > transitioned from INIT to DOWNLOADING > 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource > (LocalizedResource.java:handle(203)) - Resource > hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar > transitioned from INIT to DOWNLOADING > 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource > (LocalizedResource.java:handle(203)) - Resource > hdfs://zsexp/user/yarn/.slider/cluster/storm1/tm
[jira] [Updated] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2923: Attachment: YARN-2923.20150517-1.patch Hi [~wangda] Attaching a WIP patch (may be need some more static checks after jenkins run) # rebased the patch on top of 3565 # Moved common code which was earlier in 2729 to here as this jira will be going first # corrected most of [~vinodkv]'s comments in 2729 and pending are : * I think the format expected from the command should be more structured. Specifically as we expect more per-label attributes in line with YARN-3565. * Not caused by your patch but worth fixing here. NodeStatusUpdaterImpl shouldn't worry about invalid label-set, previous-valid-labels and label validation. You should move all that functionality into NodeLabelsProvider. * Can you add the documentation for setting this up too too? For these wanted to discuss with you before working on it. > Support configuration based NodeLabelsProvider Service in Distributed Node > Label Configuration Setup > - > > Key: YARN-2923 > URL: https://issues.apache.org/jira/browse/YARN-2923 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Fix For: 2.8.0 > > Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, > YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, > YARN-2923.20150517-1.patch > > > As part of Distributed Node Labels configuration we need to support Node > labels to be configured in Yarn-site.xml. And on modification of Node Labels > configuration in yarn-site.xml, NM should be able to get modified Node labels > from this NodeLabelsprovider service without NM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547168#comment-14547168 ] sandflee commented on YARN-3668: If am crashed and reaches am max fail times, applications are killed. If we set am max fail times to a big one or unlimit am max fail times, RM may have too many AppAttempt to store in memory and RMStateStore, YARN-3480 could resolve this problem by storing limited appAttempt. > Long run service shouldn't be killed even if Yarn crashed > - > > Key: YARN-3668 > URL: https://issues.apache.org/jira/browse/YARN-3668 > Project: Hadoop YARN > Issue Type: Wish >Reporter: sandflee > > For long running service, it shouldn't be killed even if all yarn component > crashed, with RM work preserving and NM restart, yarn could take over > applications again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547165#comment-14547165 ] sandflee commented on YARN-3668: If all RM crashed, all running containers will be killed, YARN-3644 discuss this > Long run service shouldn't be killed even if Yarn crashed > - > > Key: YARN-3668 > URL: https://issues.apache.org/jira/browse/YARN-3668 > Project: Hadoop YARN > Issue Type: Wish >Reporter: sandflee > > For long running service, it shouldn't be killed even if all yarn component > crashed, with RM work preserving and NM restart, yarn could take over > applications again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed
sandflee created YARN-3668: -- Summary: Long run service shouldn't be killed even if Yarn crashed Key: YARN-3668 URL: https://issues.apache.org/jira/browse/YARN-3668 Project: Hadoop YARN Issue Type: Wish Reporter: sandflee For long running service, it shouldn't be killed even if all yarn component crashed, with RM work preserving and NM restart, yarn could take over applications again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547159#comment-14547159 ] sandflee commented on YARN-3644: In our cluster we also have to face this problem, I'd like to have some work on this if possible, expecting more comments! > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547155#comment-14547155 ] sandflee commented on YARN-3644: [~raju.bairishetti] thanks for your reply, If RM HA is not enabled, we can fix it like this. But with RM HA, there're some condition to consider. 1, both RM A and RM B reset the connection, seems RMs are in trouble, NM keep containers alive 2, both RM A and RM B socket timeout, seems NM are network partitioned with RMs or RM machine all crashed(Any way to distinguish them?), NM kills all containers 3, one RM reset the connection and the other socket timeout, It's difficult to handle, sine we knows nothing about active RM, both RM maybe all crashed, or just active RM are network partitioned. I suggest backup RM also responses and tells NM I'm backup RM. So It becomes 3.1 one RM reset the connection and the other socket timeout, seems RM in trouble, just keep containers alive 3.2 one RM are backup and the other RM socket timeout, seems NM are network partitioned with active RM, kill all containers > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT
[ https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547152#comment-14547152 ] Hadoop QA commented on YARN-126: \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 36s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 31s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 5s | The applied patch generated 15 new checkstyle issues (total was 42, now 56). | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 15 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 40s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | common tests | 22m 45s | Tests failed in hadoop-common. | | | | 59m 40s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.util.TestGenericOptionsParser | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733377/YARN-126.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cab0dad | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7962/artifact/patchprocess/diffcheckstylehadoop-common.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7962/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7962/artifact/patchprocess/testrun_hadoop-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7962/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7962/console | This message was automatically generated. > yarn rmadmin help message contains reference to hadoop cli and JT > - > > Key: YARN-126 > URL: https://issues.apache.org/jira/browse/YARN-126 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.0.3-alpha >Reporter: Thomas Graves >Assignee: Rémy SAISSY > Labels: usability > Attachments: YARN-126.002.patch, YARN-126.patch > > > has option to specify a job tracker and the last line for general command > line syntax had "bin/hadoop command [genericOptions] [commandOptions]" > ran "yarn rmadmin" to get usage: > RMAdmin > Usage: java RMAdmin >[-refreshQueues] >[-refreshNodes] >[-refreshUserToGroupsMappings] >[-refreshSuperUserGroupsConfiguration] >[-refreshAdminAcls] >[-refreshServiceAcl] >[-help [cmd]] > Generic options supported are > -conf specify an application configuration file > -D use value for given property > -fs specify a namenode > -jt specify a job tracker > -files specify comma separated files to be > copied to the map reduce cluster > -libjars specify comma separated jar files > to include in the classpath. > -archives specify comma separated > archives to be unarchived on the compute machines. > The general command line syntax is > bin/hadoop command [genericOptions] [commandOptions] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT
[ https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rémy SAISSY updated YARN-126: - Attachment: YARN-126.002.patch > yarn rmadmin help message contains reference to hadoop cli and JT > - > > Key: YARN-126 > URL: https://issues.apache.org/jira/browse/YARN-126 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.0.3-alpha >Reporter: Thomas Graves >Assignee: Rémy SAISSY > Labels: usability > Attachments: YARN-126.002.patch, YARN-126.patch > > > has option to specify a job tracker and the last line for general command > line syntax had "bin/hadoop command [genericOptions] [commandOptions]" > ran "yarn rmadmin" to get usage: > RMAdmin > Usage: java RMAdmin >[-refreshQueues] >[-refreshNodes] >[-refreshUserToGroupsMappings] >[-refreshSuperUserGroupsConfiguration] >[-refreshAdminAcls] >[-refreshServiceAcl] >[-help [cmd]] > Generic options supported are > -conf specify an application configuration file > -D use value for given property > -fs specify a namenode > -jt specify a job tracker > -files specify comma separated files to be > copied to the map reduce cluster > -libjars specify comma separated jar files > to include in the classpath. > -archives specify comma separated > archives to be unarchived on the compute machines. > The general command line syntax is > bin/hadoop command [genericOptions] [commandOptions] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT
[ https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rémy SAISSY updated YARN-126: - Attachment: (was: YARN-126.002.patch) > yarn rmadmin help message contains reference to hadoop cli and JT > - > > Key: YARN-126 > URL: https://issues.apache.org/jira/browse/YARN-126 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.0.3-alpha >Reporter: Thomas Graves >Assignee: Rémy SAISSY > Labels: usability > Attachments: YARN-126.patch > > > has option to specify a job tracker and the last line for general command > line syntax had "bin/hadoop command [genericOptions] [commandOptions]" > ran "yarn rmadmin" to get usage: > RMAdmin > Usage: java RMAdmin >[-refreshQueues] >[-refreshNodes] >[-refreshUserToGroupsMappings] >[-refreshSuperUserGroupsConfiguration] >[-refreshAdminAcls] >[-refreshServiceAcl] >[-help [cmd]] > Generic options supported are > -conf specify an application configuration file > -D use value for given property > -fs specify a namenode > -jt specify a job tracker > -files specify comma separated files to be > copied to the map reduce cluster > -libjars specify comma separated jar files > to include in the classpath. > -archives specify comma separated > archives to be unarchived on the compute machines. > The general command line syntax is > bin/hadoop command [genericOptions] [commandOptions] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3651) Tracking url in ApplicationCLI wrong for running application
[ https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt resolved YARN-3651. Resolution: Won't Fix [~devraj.jaiman] . Thank you for looking into the same. Closing the issue as won't fix since its done intentionally. > Tracking url in ApplicationCLI wrong for running application > > > Key: YARN-3651 > URL: https://issues.apache.org/jira/browse/YARN-3651 > Project: Hadoop YARN > Issue Type: Bug > Components: applications, resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Priority: Minor > > Application URL in Application CLI wrong > Steps to reproduce > == > 1. Start HA setup insecure mode > 2.Configure HTTPS_ONLY > 3.Submit application to cluster > 4.Execute command ./yarn application -list > 5.Observer tracking URL shown > {code} > 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History > server at /:45034 > Total number of applications (application-types: [] and states: [SUBMITTED, > ACCEPTED, RUNNING]):1 > Application-Id --- Tracking-URL > application_1431672734347_0003 *http://host-10-19-92-117:13013* > {code} > *Expected* > https://:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547113#comment-14547113 ] Hadoop QA commented on YARN-3044: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 26s | Pre-patch YARN-2928 compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 43s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 52s | The applied patch generated 1 new checkstyle issues (total was 241, now 242). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 40s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 42s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 4m 13s | The patch appears to introduce 7 new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 0m 27s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 52m 56s | Tests passed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 0m 55s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 97m 9s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | | | Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:[line 79] | | | Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:[line 76] | | | Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:[line 73] | | | Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:[line 67] | | | Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent) At AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent in org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent
[jira] [Updated] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3044: Attachment: YARN-3044-YARN-2928.008.patch Hi [~zjshen] Uploading a patch with following corrections, Please review : 1. RMContainerEntity has been removed and instead ContainerEntity with new event is published 2. Removed the duplicated code by having a abstract class for TimelineServicePublisher 3. Removed the code for Application Config (as per Zhijie's suggestion) 4. yarn.system-metrics-publisher.rm.publish.container-metrics -> yarn.rm.system-metrics-publisher.emit-container-events 5. corrected ??Methods/innner classes in SystemMetricsPublisher don't need to be changed to "public"?? > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Labels: BB2015-05-TBR > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, > YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)