[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529224#comment-14529224 ] Craig Welch commented on YARN-1680: --- I've been looking over [~airbots] prior patches, the discussion, etc, this was what I was going to suggest as an approach. As I mentioned before, I think that accuracy will unfortunately require holding on to the blacklist in the scheduler app, I think this is OK because these should be relatively small, but it is still a drawback. We could impose a limit to size as a mitigating factor, but that could affect accuracy in some cases as well. In any event, this is the approach I'm suggesting: Retain a node/rack blacklist in the scheduler application based on addition/removals from the application master Add a last change timestamp or incrementing counter to track node addition/removal at the cluster level (which is what exists for cluster black/white listing afaict), updated when those events occur Add a last change timestamp/counter to the application to track blacklist changes have last updated values on the application to track the above two last change values, updated when blacklist values are recalculated On headroom calculation, the app checks if it has any entries in the blacklist or if it has a blacklist deduction value in it's resourceusage entry (see below), to determine if blacklist must be taken into account if blacklist must be taken into account, check the last updated values for both cluster and app blacklist changes, if and only if either is stale (last updated != last change) then recalculate the blacklist deduction when calculating the blacklist deduction use [~airbots] basic logic from existing patches. Place the deduction value into a new enumeration index type in ResourceUsage. NodeLables could be taken into account as well, there is some logic about label(s) of interest on the application, in addition to a no label value which is generally applicable, a value for the label(s) of interest could be generated whenever the headroom is handed out by the provider, add a step which applies the proper blacklist deduction if present Thoughts on the approach? availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Craig Welch Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3579) getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String
Sunil G created YARN-3579: - Summary: getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String Key: YARN-3579 URL: https://issues.apache.org/jira/browse/YARN-3579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Priority: Minor CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is not passing information such as Exclusivity etc back to REST interface apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529155#comment-14529155 ] Vinod Kumar Vavilapalli commented on YARN-3385: --- Actually, there is a checkstyle warning and a test related problem. Please look at them. Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion. --- Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3385.000.patch, YARN-3385.001.patch, YARN-3385.002.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-3023. since the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended
[ https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529194#comment-14529194 ] Dustin Cote commented on YARN-2369: --- [~vinodkv] thanks for the feedback! I would expect (b) for general apps as a specific config for MR jobs. Should I put the config in MRConfig or MRJobConfig instead then? I'll get the specific comments you have fixed once I have the test case in too. I'll put them all in the next patch. Thanks again! Environment variable handling assumes values should be appended --- Key: YARN-2369 URL: https://issues.apache.org/jira/browse/YARN-2369 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Dustin Cote Attachments: YARN-2369-1.patch, YARN-2369-2.patch When processing environment variables for a container context the code assumes that the value should be appended to any pre-existing value in the environment. This may be desired behavior for handling path-like environment variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a non-intuitive and harmful way to handle any variable that does not have path-like semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3579) getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String
[ https://issues.apache.org/jira/browse/YARN-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3579: -- Issue Type: Sub-task (was: Bug) Parent: YARN-2492 getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String Key: YARN-3579 URL: https://issues.apache.org/jira/browse/YARN-3579 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Priority: Minor CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is not passing information such as Exclusivity etc back to REST interface apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures
[ https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528908#comment-14528908 ] Wangda Tan commented on YARN-3514: -- [~cnauroth], bq. I've seen a few mentions online that Active Directory is not case-sensitive but is case-preserving. That means it will preserve the case you used in usernames, but the case doesn't matter for comparisons. I've also seen references that DNS has similar behavior with regards to case. Good point! I've found one post about this: https://msdn.microsoft.com/en-us/library/bb726984.aspx: bq. Note: Although Windows 2000 stores user names in the case that you enter, user names aren't case sensitive. For example, you can access the Administrator account with the user name Administrator or administrator. Thus, user names are case aware but not case sensitive.. So I think it's safe to make this change too. Active directory usernames like domain\login cause YARN failures Key: YARN-3514 URL: https://issues.apache.org/jira/browse/YARN-3514 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Environment: CentOS6 Reporter: john lilley Assignee: Chris Nauroth Priority: Minor Attachments: YARN-3514.001.patch, YARN-3514.002.patch We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is Kerberos-enabled and uses an external AD domain controller for the KDC. We are able to authenticate, browse HDFS, etc. However, YARN fails during localization because it seems to get confused by the presence of a \ character in the local user name. Our AD authentication on the nodes goes through sssd and set configured to map AD users onto the form domain\username. For example, our test user has a Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user domain\hadoopuser. We have no problem validating that user with PAM, logging in as that user, su-ing to that user, etc. However, when we attempt to run a YARN application master, the localization step fails when setting up the local cache directory for the AM. The error that comes out of the RM logs: 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, diagnostics='Application application_1429295486450_0001 failed 1 times due to AM Container for appattempt_1429295486450_0001_01 exited with exitCode: -1000 due to: Application application_1429295486450_0001 initialization failed (exitCode=255) with output: main : command provided 0 main : user is DOMAIN\hadoopuser main : requested yarn user is domain\hadoopuser org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create directory: /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10 at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347) .Failing this attempt.. Failing the application.' However, when we look on the node launching the AM, we see this: [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache [root@rpb-cdh-kerb-2 usercache]# ls -l drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser There appears to be different treatment of the \ character in different places. Something creates the directory as domain\hadoopuser but something else later attempts to use it as domain%5Chadoopuser. I’m not sure where or why the URL escapement converts the \ to %5C or why this is not consistent. I should also mention, for the sake of completeness, our auth_to_local rule is set up to map u...@domain.com to domain\user: RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService
[ https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528990#comment-14528990 ] Hudson commented on YARN-3396: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #185 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/185/]) YARN-3396. Handle URISyntaxException in ResourceLocalizationService. (Contributed by Brahma Reddy Battula) (junping_du: rev 38102420621308f5ba91cdeb6a18a63aa5acf640) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Handle URISyntaxException in ResourceLocalizationService Key: YARN-3396 URL: https://issues.apache.org/jira/browse/YARN-3396 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chengbing Liu Assignee: Brahma Reddy Battula Labels: newbie Fix For: 2.8.0 Attachments: YARN-3396-002.patch, YARN-3396.patch There are two occurrences of the following code snippet: {code} //TODO fail? Already translated several times... {code} It should be handled correctly in case that the resource URI is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2918) Don't fail RM if queue's configured labels are not existed in cluster-node-labels
[ https://issues.apache.org/jira/browse/YARN-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528991#comment-14528991 ] Wangda Tan commented on YARN-2918: -- Added more details to description, I plan to do following stuffs in the patch: - Stop checking label's existence while init queue - Continue check label's capacity setting ({{Σchild-queue.label.capacity = 100}}) - Reject application/resource-request if label is not exist. Don't fail RM if queue's configured labels are not existed in cluster-node-labels - Key: YARN-2918 URL: https://issues.apache.org/jira/browse/YARN-2918 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Wangda Tan Currently, if admin setup labels on queues {{queue-path.accessible-node-labels = ...}}. And the label is not added to RM, queue's initialization will fail and RM will fail too: {noformat} 2014-12-03 20:11:50,126 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager ... Caused by: java.io.IOException: NodeLabelManager doesn't include label = x, please check. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:109) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:324) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) {noformat} This is not a good user experience, we should stop fail RM so that admin can configure queue/labels in following steps: - Configure queue (with label) - Start RM - Add labels to RM - Submit applications Now admin has to: - Configure queue (without label) - Start RM - Add labels to RM - Refresh queue's config (with label) - Submit applications -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3552) RM Web UI shows -1 running containers for completed apps
[ https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528987#comment-14528987 ] Hudson commented on YARN-3552: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #185 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/185/]) YARN-3552. RM Web UI shows -1 running containers for completed apps. Contributed by Rohith (jlowe: rev 9356cf8676fd18f78655e8a6f2e6c946997dbd40) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/FairSchedulerAppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java RM Web UI shows -1 running containers for completed apps Key: YARN-3552 URL: https://issues.apache.org/jira/browse/YARN-3552 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Rohith Assignee: Rohith Priority: Trivial Labels: newbie Fix For: 2.8.0 Attachments: 0001-YARN-3552.patch, 0001-YARN-3552.patch, 0002-YARN-3552.patch, yarn-3352.PNG In the RMServerUtils, the default values are negative number which results in the displayiing the RM web UI also negative number. {code} public static final ApplicationResourceUsageReport DUMMY_APPLICATION_RESOURCE_USAGE_REPORT = BuilderUtils.newApplicationResourceUsageReport(-1, -1, Resources.createResource(-1, -1), Resources.createResource(-1, -1), Resources.createResource(-1, -1), 0, 0); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529004#comment-14529004 ] Jian He commented on YARN-3343: --- cool, thanks for the testing ! committing this. TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk --- Key: YARN-3343 URL: https://issues.apache.org/jira/browse/YARN-3343 Project: Hadoop YARN Issue Type: Test Reporter: Xuan Gong Assignee: Rohith Priority: Minor Attachments: 0001-YARN-3343.patch Error Message test timed out after 3 milliseconds Stacktrace java.lang.Exception: test timed out after 3 milliseconds at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getAllByName0(InetAddress.java:1246) at java.net.InetAddress.getAllByName(InetAddress.java:1162) at java.net.InetAddress.getAllByName(InetAddress.java:1098) at java.net.InetAddress.getByName(InetAddress.java:1048) at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563) at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reassigned YARN-3526: - Assignee: Yang Weiwei [~cheersyang], assigning this to you. Please consider writing a test-case.. ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster - Key: YARN-3526 URL: https://issues.apache.org/jira/browse/YARN-3526 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Environment: Red Hat Enterprise Linux Server 6.4 Reporter: Yang Weiwei Assignee: Yang Weiwei Attachments: YARN-3526.patch On a QJM HA cluster, view RM web UI to track job status, it shows This is standby RM. Redirecting to the current active RM: http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528891#comment-14528891 ] zhihai xu commented on YARN-3385: - By the way, I forget to mention, if NoNodeException happened due to this race condition, it means one of the delete operations was done, because {{zkClient.multi}} will either execute all of the Op's or none of them, all of the delete operations must be done. [This|http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html] is a good article which talks about [multi update for zookeeper|http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html] Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion. --- Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3385.000.patch, YARN-3385.001.patch, YARN-3385.002.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-3023. since the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1878) Yarn standby RM taking long to transition to active
[ https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528934#comment-14528934 ] Xuan Gong commented on YARN-1878: - Cancel the patch. Looks like that we need more discussion on this one Yarn standby RM taking long to transition to active --- Key: YARN-1878 URL: https://issues.apache.org/jira/browse/YARN-1878 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Xuan Gong Attachments: YARN-1878.1.patch In our HA tests we are noticing that some times it can take upto 10s for the standby RM to transition to active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control
[ https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528952#comment-14528952 ] Xuan Gong commented on YARN-1912: - Cancel the patch since current patch does not apply anymore. [~iwasakims] Could you re-base the patch, please ? ResourceLocalizer started without any jvm memory control Key: YARN-1912 URL: https://issues.apache.org/jira/browse/YARN-1912 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: stanley shi Attachments: YARN-1912-0.patch, YARN-1912-1.patch In the LinuxContainerExecutor.java#startLocalizer, it does not specify any -Xmx configurations in the command, this caused the ResourceLocalizer to be started with default memory setting. In an server-level hardware, it will use 25% of the system memory as the max heap size, this will cause memory issue in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2918) Don't fail RM if queue's configured labels are not existed in cluster-node-labels
[ https://issues.apache.org/jira/browse/YARN-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2918: - Summary: Don't fail RM if queue's configured labels are not existed in cluster-node-labels (was: RM starts up fails if accessible-node-labels are configured to queue without cluster lables) Don't fail RM if queue's configured labels are not existed in cluster-node-labels - Key: YARN-2918 URL: https://issues.apache.org/jira/browse/YARN-2918 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Wangda Tan I configured accessible-node-labels to queue. But RM startup fails with below exception. I see current steps to configure NodeLabel is first need to add via rmadmin and later need to configure for queues. But it will be good if both cluster and queue node labels has consitency in configuring it. {noformat} 2014-12-03 20:11:50,126 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager org.apache.hadoop.service.ServiceStateException: java.io.IOException: NodeLabelManager doesn't include label = x, please check. at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:982) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:249) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1203) Caused by: java.io.IOException: NodeLabelManager doesn't include label = x, please check. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:109) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:324) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3578) HistoryFileManager.scanDirectory() should check if the dateString path exists else it throw FileNotFoundException
Siddhi Mehta created YARN-3578: -- Summary: HistoryFileManager.scanDirectory() should check if the dateString path exists else it throw FileNotFoundException Key: YARN-3578 URL: https://issues.apache.org/jira/browse/YARN-3578 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.3.0 Reporter: Siddhi Mehta When the job client tries to access counters for a recently completed job. Here is what I think is happening. 1. The job in question started an completed on 05/02/2015. So ideally the history file location should be /mapred/history/done/2015/05/02/{02}/ 2. But instead HistoryFileManager looks at directory /mapred/history/done/2015/04/02/{02}/ and fails Looking at the logic in {code}org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanOldDirsForJob(JobId) {code} of how the idtoDateString cache is created looks like the key is independent of the RM start time, So if you had 2 jobs job_RMstarttime1_0001 and job_RMstarttime2_0001, the idtoDateString cache will have the following entries 1 - { job_RMstarttime1_0001historydir, job_RMstarttime2_0001historyDir). 3. If job_RMstarttime1_0001 is older than mapreduce.jobhistory.max-age-ms we delete the history info from HDFS. 4. For job_RMstarttime2_0001historyDir when we try and query it fails with a filenotFoundException. Either the keys should be aware of RM starttime or before HistoryFileManager.scanDirectory does a list status it should check if the path exists to avoid file not found exception. {code} private static ListFileStatus scanDirectory(Path path, FileContext fc, PathFilter pathFilter) throws IOException { path = fc.makeQualified(path); ListFileStatus jhStatusList = new ArrayListFileStatus(); if(!fc.exists(path)) { return jhStatusList } RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path); while (fileStatusIter.hasNext()) { FileStatus fileStatus = fileStatusIter.next(); Path filePath = fileStatus.getPath(); if (fileStatus.isFile() pathFilter.accept(filePath)) { jhStatusList.add(fileStatus); } } return jhStatusList; } {code} Complete stack trace: gslog`20150504141445.816``263424`0`0189246-10858515`754671855`/ex/UnhandledException.jsp`JAVA.FileNotFoundException - java.io.IOException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException): java.io.FileNotFoundException: File /mapred/history/done/2015/04/02/02 does not exist. at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:147) at org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:217) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:203) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:199) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:199) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:231) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) Caused by: java.io.FileNotFoundException: File /mapred/history/done/2015/04/02/02 does not exist. at org.apache.hadoop.fs.Hdfs$DirListingIterator.init(Hdfs.java:205) at org.apache.hadoop.fs.Hdfs$DirListingIterator.init(Hdfs.java:189) at org.apache.hadoop.fs.Hdfs$2.init(Hdfs.java:171) at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:171) at org.apache.hadoop.fs.FileContext$20.next(FileContext.java:1392) at
[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529030#comment-14529030 ] Vinod Kumar Vavilapalli commented on YARN-3561: --- bq. Could this be OS specific (debian 7)? Possible. Can you post the full NM logs? Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Environment: debian 7 Reporter: Gour Saha Priority: Critical Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/confdir/log4j-server.properties transitioned from INIT to
[jira] [Commented] (YARN-2123) Progress bars in Web UI always at 100% due to non-US locale
[ https://issues.apache.org/jira/browse/YARN-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528855#comment-14528855 ] Xuan Gong commented on YARN-2123: - +1 LGTM. Will commit Progress bars in Web UI always at 100% due to non-US locale --- Key: YARN-2123 URL: https://issues.apache.org/jira/browse/YARN-2123 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.3.0 Reporter: Johannes Simon Assignee: Akira AJISAKA Attachments: NaN_after_launching_RM.png, YARN-2123-001.patch, YARN-2123-002.patch, YARN-2123-003.patch, YARN-2123-004.patch, YARN-2123-branch-2.7.001.patch, fair-scheduler-ajisaka.xml, screenshot-noPatch.png, screenshot-patch.png, screenshot.png, yarn-site-ajisaka.xml In our cluster setup, the YARN web UI always shows progress bars at 100% (see screenshot, progress of the reduce step is roughly at 32.82%). I opened the HTML source code to check (also see screenshot), and it seems the problem is that it uses a comma as decimal mark, where most browsers expect a dot for floating-point numbers. This could possibly be due to localized number formatting being used in the wrong place, which would also explain why this bug is not always visible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1843) LinuxContainerExecutor should always log output
[ https://issues.apache.org/jira/browse/YARN-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528913#comment-14528913 ] Xuan Gong commented on YARN-1843: - [~liangly] Could you rebase the patch, please ? Current patch does not apply anymore. LinuxContainerExecutor should always log output --- Key: YARN-1843 URL: https://issues.apache.org/jira/browse/YARN-1843 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Liyin Liang Assignee: Liyin Liang Priority: Trivial Attachments: YARN-1843-1.diff, YARN-1843-2.diff, YARN-1843.diff If debug is enable, LinuxContainerExecutor should aloways log output after shExec.execute(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts
[ https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528966#comment-14528966 ] Xuan Gong commented on YARN-2684: - Cancel the patch since it does not apply anymore. [~rohithsharma] Could you re-base the patch, please ? FairScheduler should tolerate queue configuration changes across RM restarts Key: YARN-2684 URL: https://issues.apache.org/jira/browse/YARN-2684 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2684.patch YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities
[ https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528969#comment-14528969 ] Hadoop QA commented on YARN-3448: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 42s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 40s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 50s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | native | 3m 14s | Pre-build of native portion | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 3m 26s | Tests passed in hadoop-yarn-server-applicationhistoryservice. | | {color:green}+1{color} | hdfs tests | 0m 15s | Tests passed in hadoop-hdfs-client. | | | | 46m 17s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730542/YARN-3448.16.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 9b01f81 | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7710/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-applicationhistoryservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/7710/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt | | hadoop-hdfs-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/7710/artifact/patchprocess/testrun_hadoop-hdfs-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7710/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7710/console | This message was automatically generated. Add Rolling Time To Lives Level DB Plugin Capabilities -- Key: YARN-3448 URL: https://issues.apache.org/jira/browse/YARN-3448 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-3448.1.patch, YARN-3448.10.patch, YARN-3448.12.patch, YARN-3448.13.patch, YARN-3448.14.patch, YARN-3448.15.patch, YARN-3448.16.patch, YARN-3448.2.patch, YARN-3448.3.patch, YARN-3448.4.patch, YARN-3448.5.patch, YARN-3448.7.patch, YARN-3448.8.patch, YARN-3448.9.patch For large applications, the majority of the time in LeveldbTimelineStore is spent deleting old entities record at a time. An exclusive write lock is held during the entire deletion phase which in practice can be hours. If we are to relax some of the consistency constraints, other performance enhancing techniques can be employed to maximize the throughput and minimize locking time. Split the 5 sections of the leveldb database (domain, owner, start time, entity, index) into 5 separate databases. This allows each database to maximize the read cache effectiveness based on the unique usage patterns of each database. With 5 separate databases each lookup is much faster. This can also help with I/O to have the entity and index databases on separate disks. Rolling DBs for entity and index DBs. 99.9% of the data are in these two sections 4:1 ration (index to entity) at least for tez. We replace DB record removal with file system removal if we create a rolling set of databases that age out and can be efficiently removed. To do this we must place a constraint to always place an entity's events into it's correct rolling db instance based on start time. This allows us to stitching the data back together while reading and artificial paging. Relax the synchronous writes constraints. If we are willing to accept losing some
[jira] [Updated] (YARN-3578) Accessing Counters after RM restart results in stale cache (fails with FileNotFoundException)
[ https://issues.apache.org/jira/browse/YARN-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddhi Mehta updated YARN-3578: --- Summary: Accessing Counters after RM restart results in stale cache (fails with FileNotFoundException) (was: Accessing Counters after RM restart fails with FileNotFoundException) Accessing Counters after RM restart results in stale cache (fails with FileNotFoundException) - Key: YARN-3578 URL: https://issues.apache.org/jira/browse/YARN-3578 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.3.0 Reporter: Siddhi Mehta When the job client tries to access counters for a recently completed job. Here is what I think is happening. 1. The job in question started an completed on 05/02/2015. So ideally the history file location should be /mapred/history/done/2015/05/02/{02}/ 2. But instead HistoryFileManager looks at directory /mapred/history/done/2015/04/02/{02}/ and fails Looking at the logic in {code}org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanOldDirsForJob(JobId) {code} of how the idtoDateString cache is created looks like the key is independent of the RM start time, So if you had 2 jobs job_RMstarttime1_0001 and job_RMstarttime2_0001, the idtoDateString cache will have the following entries 1 - { job_RMstarttime1_0001historydir, job_RMstarttime2_0001historyDir). 3. If job_RMstarttime1_0001 is older than mapreduce.jobhistory.max-age-ms we delete the history info from HDFS. 4. For job_RMstarttime2_0001historyDir when we try and query it fails with a filenotFoundException. Either the keys should be aware of RM starttime or before HistoryFileManager.scanDirectory does a list status it should check if the path exists to avoid file not found exception. {code} private static ListFileStatus scanDirectory(Path path, FileContext fc, PathFilter pathFilter) throws IOException { path = fc.makeQualified(path); ListFileStatus jhStatusList = new ArrayListFileStatus(); if(!fc.exists(path)) { return jhStatusList } RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path); while (fileStatusIter.hasNext()) { FileStatus fileStatus = fileStatusIter.next(); Path filePath = fileStatus.getPath(); if (fileStatus.isFile() pathFilter.accept(filePath)) { jhStatusList.add(fileStatus); } } return jhStatusList; } {code} Complete stack trace: gslog`20150504141445.816``263424`0`0189246-10858515`754671855`/ex/UnhandledException.jsp`JAVA.FileNotFoundException - java.io.IOException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException): java.io.FileNotFoundException: File /mapred/history/done/2015/04/02/02 does not exist. at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:147) at org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:217) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:203) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:199) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:199) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:231) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) Caused by: java.io.FileNotFoundException: File /mapred/history/done/2015/04/02/02 does
[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529037#comment-14529037 ] Hudson commented on YARN-3343: -- FAILURE: Integrated in Hadoop-trunk-Commit #7739 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7739/]) YARN-3343. Increased TestCapacitySchedulerNodeLabelUpdate#testNodeUpdate timeout. Contributed by Rohith Sharmaks (jianhe: rev e4c3b52c896291012f869ebc0a21e85e643fadd1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerNodeLabelUpdate.java * hadoop-yarn-project/CHANGES.txt TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk --- Key: YARN-3343 URL: https://issues.apache.org/jira/browse/YARN-3343 Project: Hadoop YARN Issue Type: Test Reporter: Xuan Gong Assignee: Rohith Priority: Minor Fix For: 2.8.0 Attachments: 0001-YARN-3343.patch Error Message test timed out after 3 milliseconds Stacktrace java.lang.Exception: test timed out after 3 milliseconds at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getAllByName0(InetAddress.java:1246) at java.net.InetAddress.getAllByName(InetAddress.java:1162) at java.net.InetAddress.getAllByName(InetAddress.java:1098) at java.net.InetAddress.getByName(InetAddress.java:1048) at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563) at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178) at org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529052#comment-14529052 ] Wangda Tan commented on YARN-3521: -- [~sunilg], Thanks for updating, I have a offline sync with Vinod about using object or string in API, some suggestions: - addToClusterNodeLabel should be object, (you've done this in your patch) - getLabelsOnNode, getNodeToLabels, getLabelsToNodes should use object, this will make user can easily understand attributes of labels on nodes without calling getClusterNodeLabels. (You have done some of them, but getLabelsToNodes should be updated as well) - replace/remove should use list of label name only, label name is unique key of node labels, using NodeLabelInfo object here is unnecessary. - I found in your patch, when calling getNodeToLabels, it returns NodeLabelInfo with default attributes, we can fix this in separated patch (we need make changes to NodeLabelsManager too) - RPC API should be consistent with this, should be addressed in a separated JIRA. I'm fine with dropping NodeLabelNames as well, if it can keep the REST returned structure clean :). Support return structured NodeLabel objects in REST API when call getClusterNodeLabels -- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 0003-YARN-3521.patch, 0004-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528194#comment-14528194 ] Rohith commented on YARN-3543: -- It would be good if it can be done in different JIRA since it is different module. I feel it need not to mix with this. ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Attachments: 0001-YARN-3543.patch Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3543: - Attachment: 0001-YARN-3543.patch Updated the patch fixing test failures. ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2442) ResourceManager JMX UI does not give HA State
[ https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2442: - Target Version/s: 2.8.0 Affects Version/s: 2.6.0 2.7.0 ResourceManager JMX UI does not give HA State - Key: YARN-2442 URL: https://issues.apache.org/jira/browse/YARN-2442 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0, 2.6.0, 2.7.0 Reporter: Nishan Shetty Assignee: Rohith ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, STOPPED) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State
[ https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528230#comment-14528230 ] Rohith commented on YARN-2442: -- Apologies for looking back after very long time. IMHO, some customers has use case of using JMX for monitoring the daemons. When there is system similar to Ambari i.e Hadoop Cluster Operation Manager uses JMX for monitoring the daemons like healtch check service. Such systems start the daemons with JMX enabled mode default. Currently, ClusterMetrics, QueueMetrics and RMNMInfor are registered with JMX and these metrics are able to retrieve from using JMX. Similarly, If there there is another MBean for RMInfoMbean and register with basic RM info such as state,securityEnabled and other required RM attributes would be helpfull for JMX dependent users. This is very similar to HDFS NameNodeStatusMXBean. Kindly give your opinion, thoughts on this. ResourceManager JMX UI does not give HA State - Key: YARN-2442 URL: https://issues.apache.org/jira/browse/YARN-2442 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: Nishan Shetty Assignee: Rohith ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, STOPPED) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService
[ https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528247#comment-14528247 ] Junping Du commented on YARN-3396: -- Thanks [~brahmareddy] for delivering a patch here. Several quick feedback here: 1. Why we are setting log level to INFO in the first case while setting ERROR for the second case? I think we should keep consistent here, probably ERROR is suitable for both cases. 2. This particular exception get thrown when decoding path from URL it contains, so we should put rsrc.getResource() and next.getResource() there instead of others you are putting now. 3. More informative words than just Got exception parsing. - may be something like Got exception in parsing URL of LocalResource: + next.getResource()? Handle URISyntaxException in ResourceLocalizationService Key: YARN-3396 URL: https://issues.apache.org/jira/browse/YARN-3396 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Brahma Reddy Battula Attachments: YARN-3396.patch There are two occurrences of the following code snippet: {code} //TODO fail? Already translated several times... {code} It should be handled correctly in case that the resource URI is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2918) RM starts up fails if accessible-node-labels are configured to queue without cluster lables
[ https://issues.apache.org/jira/browse/YARN-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528202#comment-14528202 ] Rohith commented on YARN-2918: -- Sure.. thank for your interest.. RM starts up fails if accessible-node-labels are configured to queue without cluster lables --- Key: YARN-2918 URL: https://issues.apache.org/jira/browse/YARN-2918 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith I configured accessible-node-labels to queue. But RM startup fails with below exception. I see current steps to configure NodeLabel is first need to add via rmadmin and later need to configure for queues. But it will be good if both cluster and queue node labels has consitency in configuring it. {noformat} 2014-12-03 20:11:50,126 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager org.apache.hadoop.service.ServiceStateException: java.io.IOException: NodeLabelManager doesn't include label = x, please check. at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:982) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:249) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1203) Caused by: java.io.IOException: NodeLabelManager doesn't include label = x, please check. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:109) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:324) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528216#comment-14528216 ] Hadoop QA commented on YARN-3543: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 43s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:red}-1{color} | javac | 3m 27s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730462/0001-YARN-3543.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 318081c | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7705/console | This message was automatically generated. ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2267) Auxiliary Service support in RM
[ https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528248#comment-14528248 ] Rohith commented on YARN-2267: -- Thanks [~sunilg] for your views and [~zjshen] for your inputs. As of now we are not working on this, I prefer to close this. Will make a note that when we reopen the JIRA will come up with better proposal document. Auxiliary Service support in RM --- Key: YARN-2267 URL: https://issues.apache.org/jira/browse/YARN-2267 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Naganarasimha G R Assignee: Rohith Currently RM does not have a provision to run any Auxiliary services. For health/monitoring in RM, its better to make a plugin mechanism in RM itself, similar to NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3385: Attachment: YARN-3385.003.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion. --- Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3385.000.patch, YARN-3385.001.patch, YARN-3385.002.patch, YARN-3385.003.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-3023. since the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529278#comment-14529278 ] zhihai xu commented on YARN-3385: - I attached a new patch YARN-3385.003.patch which is to fix the check style issue. Also it is strange the test report log didn't show any test failure. Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion. --- Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3385.000.patch, YARN-3385.001.patch, YARN-3385.002.patch, YARN-3385.003.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-3023. since the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities
[ https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529314#comment-14529314 ] Jonathan Eagles commented on YARN-3448: --- [~zjshen], [~jlowe], Can you have another look now that I have gotten my hadoopqa +1? Add Rolling Time To Lives Level DB Plugin Capabilities -- Key: YARN-3448 URL: https://issues.apache.org/jira/browse/YARN-3448 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-3448.1.patch, YARN-3448.10.patch, YARN-3448.12.patch, YARN-3448.13.patch, YARN-3448.14.patch, YARN-3448.15.patch, YARN-3448.16.patch, YARN-3448.2.patch, YARN-3448.3.patch, YARN-3448.4.patch, YARN-3448.5.patch, YARN-3448.7.patch, YARN-3448.8.patch, YARN-3448.9.patch For large applications, the majority of the time in LeveldbTimelineStore is spent deleting old entities record at a time. An exclusive write lock is held during the entire deletion phase which in practice can be hours. If we are to relax some of the consistency constraints, other performance enhancing techniques can be employed to maximize the throughput and minimize locking time. Split the 5 sections of the leveldb database (domain, owner, start time, entity, index) into 5 separate databases. This allows each database to maximize the read cache effectiveness based on the unique usage patterns of each database. With 5 separate databases each lookup is much faster. This can also help with I/O to have the entity and index databases on separate disks. Rolling DBs for entity and index DBs. 99.9% of the data are in these two sections 4:1 ration (index to entity) at least for tez. We replace DB record removal with file system removal if we create a rolling set of databases that age out and can be efficiently removed. To do this we must place a constraint to always place an entity's events into it's correct rolling db instance based on start time. This allows us to stitching the data back together while reading and artificial paging. Relax the synchronous writes constraints. If we are willing to accept losing some records that we not flushed in the operating system during a crash, we can use async writes that can be much faster. Prefer Sequential writes. sequential writes can be several times faster than random writes. Spend some small effort arranging the writes in such a way that will trend towards sequential write performance over random write performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2267) Auxiliary Service support in RM
[ https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith resolved YARN-2267. -- Resolution: Won't Fix Auxiliary Service support in RM --- Key: YARN-2267 URL: https://issues.apache.org/jira/browse/YARN-2267 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Naganarasimha G R Assignee: Rohith Currently RM does not have a provision to run any Auxiliary services. For health/monitoring in RM, its better to make a plugin mechanism in RM itself, similar to NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService
[ https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3396: - Affects Version/s: (was: 2.7.0) Labels: newbie (was: ) Handle URISyntaxException in ResourceLocalizationService Key: YARN-3396 URL: https://issues.apache.org/jira/browse/YARN-3396 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chengbing Liu Assignee: Brahma Reddy Battula Labels: newbie Attachments: YARN-3396.patch There are two occurrences of the following code snippet: {code} //TODO fail? Already translated several times... {code} It should be handled correctly in case that the resource URI is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528256#comment-14528256 ] Naganarasimha G R commented on YARN-3543: - bq. It would be good if it can be done in different JIRA since it is different module. I feel it need not to mix with this. Well as its only a small change to store it in ATS and also in earlier jira's where in most of the data displayed in RM web ui is also tried to be shown in ATS, so i feel better to capture ATS modifications in this jira itself... ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3423) RM HA setup, Cluster tab links populated with AM hostname instead of RM
[ https://issues.apache.org/jira/browse/YARN-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528268#comment-14528268 ] Junping Du commented on YARN-3423: -- The latest patch LGTM. getResolvedRemoteRMWebAppURLWithoutScheme() make more sense in HA case, also it should be work in non-HA case too. [~kasha] and [~xgong], do you think we should replace all getResolvedRMWebAppURLWithoutScheme with getResolvedRemoteRMWebAppURLWithoutScheme for RM HA case? RM HA setup, Cluster tab links populated with AM hostname instead of RM -- Key: YARN-3423 URL: https://issues.apache.org/jira/browse/YARN-3423 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: centOS-6.x Reporter: Aroop Maliakkal Priority: Minor Attachments: YARN-3423.patch In RM HA setup ( e.g. http://rm-1.vip.abc.com:50030/proxy/application_1427789305393_0002/ ), go to the job details and click on the Cluster tab on left top side. Click on any of the links , About, Applications , Scheduler. You can see that the hyperlink is pointing to http://am-1.vip.abc.com:port/cluster ). The port details for secure and unsecure cluster is given below :- 8088 ( DEFAULT_RM_WEBAPP_PORT = 8088 ) 8090 ( DEFAULT_RM_WEBAPP_HTTPS_PORT = 8090 ) Ideally, it should have pointed to resourcemanager hostname instead of AM hostname. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3460) Test TestSecureRMRegistryOperations failed with IBM_JAVA JVM
[ https://issues.apache.org/jira/browse/YARN-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528277#comment-14528277 ] Junping Du commented on YARN-3460: -- A quick comments: can we use %n instead of /n? The former one is independent of platforms but the later one works only on Linux. Test TestSecureRMRegistryOperations failed with IBM_JAVA JVM Key: YARN-3460 URL: https://issues.apache.org/jira/browse/YARN-3460 Project: Hadoop YARN Issue Type: Test Affects Versions: 3.0.0, 2.6.0 Environment: $ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T11:37:52-06:00) Maven home: /opt/apache-maven-3.2.1 Java version: 1.7.0, vendor: IBM Corporation Java home: /usr/lib/jvm/ibm-java-ppc64le-71/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 3.10.0-229.ael7b.ppc64le, arch: ppc64le, family: unix Reporter: pascal oliva Attachments: HADOOP-11810-1.patch TestSecureRMRegistryOperations failed with JBM IBM JAVA mvn test -X -Dtest=org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations ModuleTotal Failure Error Skipped - hadoop-yarn-registry 12 0 12 0 - Total 12 0 12 0 With javax.security.auth.login.LoginException: Bad JAAS configuration: unrecognized option: isInitiator and Bad JAAS configuration: unrecognized option: storeKey -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService
[ https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3396: --- Attachment: YARN-3396-002.patch Handle URISyntaxException in ResourceLocalizationService Key: YARN-3396 URL: https://issues.apache.org/jira/browse/YARN-3396 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chengbing Liu Assignee: Brahma Reddy Battula Labels: newbie Attachments: YARN-3396-002.patch, YARN-3396.patch There are two occurrences of the following code snippet: {code} //TODO fail? Already translated several times... {code} It should be handled correctly in case that the resource URI is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService
[ https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528279#comment-14528279 ] Brahma Reddy Battula commented on YARN-3396: [~djp] thanks for taking a look into this issue..Updated the patch based on your comments..Kindly review!! Handle URISyntaxException in ResourceLocalizationService Key: YARN-3396 URL: https://issues.apache.org/jira/browse/YARN-3396 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chengbing Liu Assignee: Brahma Reddy Battula Labels: newbie Attachments: YARN-3396-002.patch, YARN-3396.patch There are two occurrences of the following code snippet: {code} //TODO fail? Already translated several times... {code} It should be handled correctly in case that the resource URI is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)