[jira] [Updated] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator
[ https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2716: --- Issue Type: Improvement (was: Bug) Refactor ZKRMStateStore retry code with Apache Curator -- Key: YARN-2716 URL: https://issues.apache.org/jira/browse/YARN-2716 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Karthik Kambatla Attachments: yarn-2716-1.patch, yarn-2716-prelim.patch, yarn-2716-prelim.patch, yarn-2716-super-prelim.patch Per suggestion by [~kasha] in YARN-2131, it's nice to use curator to simplify the retry logic in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator
[ https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2716: --- Attachment: yarn-2716-1.patch Fixed TestZKRMStateStoreZKClientConnections as well. I believe this patch the v1 patch could some more eyes. Appreciate any feedback. Refactor ZKRMStateStore retry code with Apache Curator -- Key: YARN-2716 URL: https://issues.apache.org/jira/browse/YARN-2716 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Karthik Kambatla Attachments: yarn-2716-1.patch, yarn-2716-prelim.patch, yarn-2716-prelim.patch, yarn-2716-super-prelim.patch Per suggestion by [~kasha] in YARN-2131, it's nice to use curator to simplify the retry logic in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566993#comment-14566993 ] Rohith commented on YARN-3585: -- This is race condition when the NodeManager is shutting down and container is launched. By the time container is launched and returned to ContainerImpl, NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed }} NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566998#comment-14566998 ] zhihai xu commented on YARN-3017: - Hi [~mufeed.usman], thanks for working on this issue. The name for appAttemptIdAndEpochFormat will become confusing after the fix. Could you rename appAttemptIdAndEpochFormat to epochFormat? ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID -- Key: YARN-3017 URL: https://issues.apache.org/jira/browse/YARN-3017 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.8.0 Reporter: MUFEED USMAN Priority: Minor Labels: PatchAvailable Attachments: YARN-3017.patch, YARN-3017_1.patch Not sure if this should be filed as a bug or not. In the ResourceManager log in the events surrounding the creation of a new application attempt, ... ... 2014-11-14 17:45:37,258 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1412150883650_0001_02 ... ... The application attempt has the ID format _1412150883650_0001_02. Whereas the associated ContainerID goes by _1412150883650_0001_02_. ... ... 2014-11-14 17:45:37,260 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1412150883650_0001_02_01, NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: memory:2048, vCores:1, disks:0.0, Priority: 0, Token: Token { kind: ContainerToken, service: 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 ... ... Curious to know if this is kept like that for a reason. If not while using filtering tools to, say, grep events surrounding a specific attempt by the numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567189#comment-14567189 ] Rohith commented on YARN-3733: -- bq. Verify infinity by calling isInfinite(float v). Quoting from jdk7 Since infinity is derived from lhs and rhs, infinity can not be differentiated for the clusterResource=0,0 lhs=1,1, and rhs2,2. Method {{getResourceAsValue()}} return infinity for both l and r which cant compare it. On RM restart AM getting more than maximum possible memory when many tasks in queue - Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567196#comment-14567196 ] Rohith commented on YARN-3585: -- Yes, we can raise different Jira. [~bibinchundatt] Can you raise Jira, we can validate the issue there? NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.2.patch Upload a new patch to fix test cases. Lots of previous tests The HA Configuration has multiple addresses that match local node's address. is because I forgot to set YarnConfiguration.RM_HA_ID before starting NM. The patch also contains 2 minor fix, changed getting conf value of RM_SCHEDULER_ADDRESS from serviceStart to serviceInit in ApplicationMasterService, changed duplicates setRpcAddressForRM in tests to HAUtil. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567186#comment-14567186 ] Rohith commented on YARN-3733: -- bq. 2. The newly added code is duplicated in two places, can you eliminate the duplicate code? sencond time validation is not required ICO NaN,will remove this in next patch. On RM restart AM getting more than maximum possible memory when many tasks in queue - Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567184#comment-14567184 ] Rohith commented on YARN-3733: -- Thanks [~devaraj.k] and [~sunilg] for review bq. Can we check for lhs/rhs emptiness and compare these before ending up with infinite values? If we calculater for emptyness, this would affect specific input values like clusterResource=0,0 lhs=1,1, and rhs2,2. Then which one is considered as dominant? bcs directly dominant component can not be retrieved by memory or cpu. And I listed out what are the possible combination of inputs would ocure in YARN. These are ||Sl.no||clusterResorce||lhs||rhs||Remark|| |1|0,0|0,0|0,0|Valid Input;Handled| |2|0,0|positive integer,positive integer|0,0|NaN vs Infinity: Patch Handle This scenario| |3|0,0|0,0|positive integer,positive integer|Nan vs Infinity: Patch Handle This scenario| |4|0,0|positive integer,positive integer|positive integer,positive integer|Infinity vs Infinity: Can this type can ocur in YARN?| |5|0,0|positive integer,0|0,positive integer|Is this valid input? Can this type can ocur in YARN?| On RM restart AM getting more than maximum possible memory when many tasks in queue - Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567201#comment-14567201 ] Rohith commented on YARN-3585: -- -1 for findbug, does not show any error report, but not sure why -1 given. Test failure is unrelated to this patch. [~jlowe] Kindly review the patch. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567038#comment-14567038 ] Akira AJISAKA commented on YARN-3069: - Thanks [~rchiang] for updating the patch. # Would you reflect the previous comment for {{yarn.node-labels.fs-store.retry-policy-spec}}? # For YARN registry, the parameters are written in core-site.xml. Can we remove them from the patch? My review is almost done. @Watchers: I would appreciate if you could review this patch. It includes a lot of descriptions for parameters, so it should be reviewed by a lot of developers. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567157#comment-14567157 ] Hudson commented on YARN-3725: -- FAILURE: Integrated in Hadoop-Yarn-trunk #945 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/945/]) YARN-3725. App submission via REST API is broken in secure mode due to Timeline DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 5cc3fced957a8471733e0e9490878bd68429fe24) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/CHANGES.txt App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator
[ https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567024#comment-14567024 ] Hadoop QA commented on YARN-2716: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 9s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 3 new checkstyle issues (total was 42, now 8). | | {color:green}+1{color} | whitespace | 0m 5s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 30s | The patch appears to introduce 2 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 23s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 44s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736498/yarn-2716-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 5cc3fce | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8147/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8147/console | This message was automatically generated. Refactor ZKRMStateStore retry code with Apache Curator -- Key: YARN-2716 URL: https://issues.apache.org/jira/browse/YARN-2716 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Karthik Kambatla Attachments: yarn-2716-1.patch, yarn-2716-prelim.patch, yarn-2716-prelim.patch, yarn-2716-super-prelim.patch Per suggestion by [~kasha] in YARN-2131, it's nice to use curator to simplify the retry logic in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567054#comment-14567054 ] Sunil G commented on YARN-3585: --- Hi [~bibinchundatt] and [~rohithsharma] This recent exception trace is different from the focus of this Jira, and the root cause is given by Rohith. I feel you can separate this to another ticket. For DB Close vs Container Launch, we can add a check whether DB is closed while we move container from ACQUIRED state. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2194: --- Component/s: nodemanager Description: In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. was: In previous versions of RedHat, we can build custom cgroup hierarchies with use of the cgconfig command from the libcgroup package. From RedHat 7, package libcgroup is deprecated and it is not recommended to use it since it can easily create conflicts with the default cgroup hierarchy. The systemd is provided and recommended for cgroup management. We need to add support for this. Priority: Critical (was: Major) Target Version/s: 2.8.0 Affects Version/s: 2.7.0 Issue Type: Bug (was: Improvement) Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567158#comment-14567158 ] Hudson commented on YARN-2900: -- FAILURE: Integrated in Hadoop-Yarn-trunk #945 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/945/]) YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: rev 9686261ecb872ad159fac3ca44f1792143c6d7db) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Fix For: 2.7.1 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567029#comment-14567029 ] Karthik Kambatla commented on YARN-2194: +1 otherwise. [~vinodkv], [~tucu00] - is this somewhat hacky approach reasonable? Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
[ https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567041#comment-14567041 ] Akira AJISAKA commented on YARN-3432: - Thanks [~brahmareddy] for taking this issue. The patch seems to revert YARN-656. I'm thinking that's not fine because it will break FairScheduler. This issue should fix CapacityScheduler only. Cluster metrics have wrong Total Memory when there is reserved memory on CS --- Key: YARN-3432 URL: https://issues.apache.org/jira/browse/YARN-3432 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Brahma Reddy Battula Attachments: YARN-3432.patch I noticed that when reservations happen when using the Capacity Scheduler, the UI and web services report the wrong total memory. For example. I have a 300GB of total memory in my cluster. I allocate 50 and I reserve 10. The cluster metrics for total memory get reported as 290GB. This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2194: --- Summary: Cgroups cease to work in RHEL7 (was: Add Cgroup support for RedHat 7) Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2194-1.patch, YARN-2194-2.patch In previous versions of RedHat, we can build custom cgroup hierarchies with use of the cgconfig command from the libcgroup package. From RedHat 7, package libcgroup is deprecated and it is not recommended to use it since it can easily create conflicts with the default cgroup hierarchy. The systemd is provided and recommended for cgroup management. We need to add support for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567025#comment-14567025 ] Karthik Kambatla commented on YARN-2194: Verified the patch works. Can we add more comments to say clarify why the patch replaces cpu,cpuacct with cpu? May be something along the lines of - In RHEL7, the CPU controller is named 'cpu,cpuacct'. The comma in the controller name leads to container launch failure. Symlinks 'cpu' and 'cpuacct' point to 'cpu,cpuacct'. Using 'cpu' solves the issue. Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567153#comment-14567153 ] Brahma Reddy Battula commented on YARN-3528: Thanks [~ste...@apache.org] and [~rkanter] for your inputs...Going to write one common utility where 1) one method for 0 port,we can set back the port for same config 2)another method,As some places above one is not possible, we can use similiar way as [~rkanter] mentioned.. Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567151#comment-14567151 ] Hudson commented on YARN-2900: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #215 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/215/]) YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: rev 9686261ecb872ad159fac3ca44f1792143c6d7db) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/CHANGES.txt Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Fix For: 2.7.1 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567150#comment-14567150 ] Hudson commented on YARN-3725: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #215 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/215/]) YARN-3725. App submission via REST API is broken in secure mode due to Timeline DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 5cc3fced957a8471733e0e9490878bd68429fe24) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/CHANGES.txt App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567417#comment-14567417 ] Hudson commented on YARN-2900: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #213 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/213/]) YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: rev 9686261ecb872ad159fac3ca44f1792143c6d7db) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java * hadoop-yarn-project/CHANGES.txt Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Fix For: 2.7.1 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567448#comment-14567448 ] Hudson commented on YARN-2900: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2161 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2161/]) YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: rev 9686261ecb872ad159fac3ca44f1792143c6d7db) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java * hadoop-yarn-project/CHANGES.txt Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Fix For: 2.7.1 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2194: -- Attachment: YARN-2194-3.patch Thanks, [~kasha]. Updated a patch adding more comments. Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567416#comment-14567416 ] Hudson commented on YARN-3725: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #213 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/213/]) YARN-3725. App submission via REST API is broken in secure mode due to Timeline DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 5cc3fced957a8471733e0e9490878bd68429fe24) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/CHANGES.txt App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567447#comment-14567447 ] Hudson commented on YARN-3725: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2161 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2161/]) YARN-3725. App submission via REST API is broken in secure mode due to Timeline DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 5cc3fced957a8471733e0e9490878bd68429fe24) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java * hadoop-yarn-project/CHANGES.txt App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3748) Cleanup Findbugs volatile warnings
[ https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Liptak updated YARN-3748: --- Attachment: YARN-3748.4.patch Cleanup Findbugs volatile warnings -- Key: YARN-3748 URL: https://issues.apache.org/jira/browse/YARN-3748 Project: Hadoop YARN Issue Type: Bug Reporter: Gabor Liptak Priority: Minor Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch, YARN-3748.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3748) Cleanup Findbugs volatile warnings
[ https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567506#comment-14567506 ] Hadoop QA commented on YARN-3748: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 58s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:red}-1{color} | javac | 3m 19s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736589/YARN-3748.4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 63e3fee | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8152/console | This message was automatically generated. Cleanup Findbugs volatile warnings -- Key: YARN-3748 URL: https://issues.apache.org/jira/browse/YARN-3748 Project: Hadoop YARN Issue Type: Bug Reporter: Gabor Liptak Priority: Minor Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch, YARN-3748.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column
[ https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567573#comment-14567573 ] Junping Du commented on YARN-3699: -- Hi [~jrottinghuis] and [~vrushalic], thanks for your comments and sorry for replying late on this as traveling last week. I fully agree with Joep's above comments that there is no right or wrong schema but just fit-in one for priority scenarios: - if we need more for flow_run under specific flow/flows, then making flow version as column will make this query more efficient. - if we equally (or more) need for flow_run under specific flow version(s), then our decision here could be different. To me, the tricky/interesting part here is the boundary between different flows and flow versions could vague in practice: How big/small changes we made on a flow should start a new flow or new flow version? Why we have more active flow versions instead of having only one active flow version (with adding more flows). These trade-offs in application concepts also affect our trade-off in schema design which is pretty common thing that I saw also from other apps. I would like to trust your priority here given your experience from hRaven which is already in production running well for years. So I agree Phoenix schema should be adjusted slightly to get closed to HBase one. May be we should have a new JIRA for this (Phoenix schema) change? We can either keep this JIRA open for discussion or resolve it as later so in future, if others from community bring other solid scenarios in practice, we can continue the discussion here and try to make better trade-off or innovation. Thoughts? Decide if flow version should be part of row key or column --- Key: YARN-3699 URL: https://issues.apache.org/jira/browse/YARN-3699 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vrushali C Based on discussions in YARN-3411 with [~djp], filing jira for continuing discussion on putting the flow version in rowkey or column. Either phoenix/hbase approach will update the jira with the conclusions.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression
[ https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567542#comment-14567542 ] Sunil G commented on YARN-3686: --- Thank You [~leftnoteasy] for reviewing and committing the same! CapacityScheduler should trim default_node_label_expression --- Key: YARN-3686 URL: https://issues.apache.org/jira/browse/YARN-3686 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Priority: Critical Fix For: 2.7.1 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 0003-YARN-3686.patch, 0004-YARN-3686.patch We should trim default_node_label_expression for queue before using it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3647) RMWebServices api's should use updated api from CommonNodeLabelsManager to get NodeLabel object
[ https://issues.apache.org/jira/browse/YARN-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567546#comment-14567546 ] Sunil G commented on YARN-3647: --- Thank You [~leftnoteasy] for committing the patch. RMWebServices api's should use updated api from CommonNodeLabelsManager to get NodeLabel object --- Key: YARN-3647 URL: https://issues.apache.org/jira/browse/YARN-3647 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.8.0 Attachments: 0001-YARN-3647.patch, 0002-YARN-3647.patch After YARN-3579, RMWebServices apis can use the updated version of apis in CommonNodeLabelsManager which gives full NodeLabel object instead of creating NodeLabel object from plain label name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism
[ https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev reassigned YARN-3542: --- Assignee: Varun Vasudev Re-factor support for CPU as a resource using the new ResourceHandler mechanism --- Key: YARN-3542 URL: https://issues.apache.org/jira/browse/YARN-3542 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sidharta Seethana Assignee: Varun Vasudev Priority: Critical In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier addition of new resource types in the nodemanager (this was used for network as a resource - See YARN-2140 ). We should refactor the existing CPU implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using the new ResourceHandler mechanism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567341#comment-14567341 ] Hadoop QA commented on YARN-3170: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 2m 54s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 55s | Site still builds. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 6m 13s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736559/YARN-3170-009.patch | | Optional Tests | site | | git revision | trunk / 63e3fee | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8150/console | This message was automatically generated. YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567366#comment-14567366 ] Hadoop QA commented on YARN-3170: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 3m 1s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 55s | Site still builds. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 6m 20s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736570/YARN-3170-009.patch | | Optional Tests | site | | git revision | trunk / 63e3fee | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8151/console | This message was automatically generated. YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567380#comment-14567380 ] Hudson commented on YARN-2900: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #204 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/204/]) YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: rev 9686261ecb872ad159fac3ca44f1792143c6d7db) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/CHANGES.txt Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Fix For: 2.7.1 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3170: --- Attachment: YARN-3170-009.patch YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3170: --- Attachment: (was: YARN-3170-009.patch) YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567327#comment-14567327 ] Anubhav Dhoot commented on YARN-3467: - Did you mean its very verbose? Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI --- Key: YARN-3467 URL: https://issues.apache.org/jira/browse/YARN-3467 Project: Hadoop YARN Issue Type: Improvement Components: webapp, yarn Affects Versions: 2.5.0 Reporter: Anthony Rojas Assignee: Anubhav Dhoot Priority: Minor Fix For: 2.8.0 Attachments: ApplicationAttemptPage.png, Screen Shot 2015-05-26 at 5.46.54 PM.png, YARN-3467.001.patch, YARN-3467.002.patch, yarn-3467-1.patch The YARN REST API can report on the following properties: *allocatedMB*: The sum of memory in MB allocated to the application's running containers *allocatedVCores*: The sum of virtual cores allocated to the application's running containers *runningContainers*: The number of containers currently running for the application Currently, the RM Web UI does not report on these items (at least I couldn't find any entries within the Web UI). It would be useful for YARN Application and Resource troubleshooting to have these properties and their corresponding values exposed on the RM WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567354#comment-14567354 ] Brahma Reddy Battula commented on YARN-3170: [~aw] thanks a lot for your comments.. Updated the patch based on your comments..Kindly review.. YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3747) TestLocalDirsHandlerService.java: test directory logDir2 not deleted
[ https://issues.apache.org/jira/browse/YARN-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567355#comment-14567355 ] Hadoop QA commented on YARN-3747: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 6m 34s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 52s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 29s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 12s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 6m 29s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 25m 7s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestDockerContainerExecutor | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736555/YARN-3747.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / 63e3fee | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8149/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8149/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8149/console | This message was automatically generated. TestLocalDirsHandlerService.java: test directory logDir2 not deleted Key: YARN-3747 URL: https://issues.apache.org/jira/browse/YARN-3747 Project: Hadoop YARN Issue Type: Bug Components: test, yarn Affects Versions: 2.7.0 Reporter: David Moore Priority: Minor Labels: patch, test, yarn Attachments: YARN-3747.patch During a code review of hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java I noted that logDir2 is never deleted while logDir1 is deleted twice. This is not in keeping with the rest of the function and appears to be a bug. I will be submitting a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567358#comment-14567358 ] Hudson commented on YARN-2900: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2143 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2143/]) YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: rev 9686261ecb872ad159fac3ca44f1792143c6d7db) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Fix For: 2.7.1 Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1042) add ability to specify affinity/anti-affinity in container requests
[ https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567291#comment-14567291 ] Steve Loughran commented on YARN-1042: -- I like this, though I'd also like PREFERRED to have two Rs in the middle :). Thinking about how I'd use this in slider, I'd probably want to keep the escalation logic, when to decide when to accept shared-note placement, in my own code. That way the AM can choose to wait 1 minute or more for an anti-affine placement before giving up and accepting a node already in use. We already do that when asking for a container back on the host where an instance ran previously. add ability to specify affinity/anti-affinity in container requests --- Key: YARN-1042 URL: https://issues.apache.org/jira/browse/YARN-1042 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Arun C Murthy Attachments: YARN-1042-demo.patch container requests to the AM should be able to request anti-affinity to ensure that things like Region Servers don't come up on the same failure zones. Similarly, you may be able to want to specify affinity to same host or rack without specifying which specific host/rack. Example: bringing up a small giraph cluster in a large YARN cluster would benefit from having the processes in the same rack purely for bandwidth reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567340#comment-14567340 ] Chang Li commented on YARN-2556: [~jeagles] could you please help review the latest patch? Thanks! Tool to measure the performance of the timeline server -- Key: YARN-2556 URL: https://issues.apache.org/jira/browse/YARN-2556 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Chang Li Labels: BB2015-05-TBR Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch We need to be able to understand the capacity model for the timeline server to give users the tools they need to deploy a timeline server with the correct capacity. I propose we create a mapreduce job that can measure timeline server write and read performance. Transactions per second, I/O for both read and write would be a good start. This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567357#comment-14567357 ] Hudson commented on YARN-3725: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2143 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2143/]) YARN-3725. App submission via REST API is broken in secure mode due to Timeline DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 5cc3fced957a8471733e0e9490878bd68429fe24) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java * hadoop-yarn-project/CHANGES.txt App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567369#comment-14567369 ] Hadoop QA commented on YARN-3749: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 20s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 35s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 49m 27s | Tests failed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 50m 14s | Tests passed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 1m 52s | Tests passed in hadoop-yarn-server-tests. | | | | 147m 20s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.client.api.impl.TestAMRMClient | | | org.apache.hadoop.yarn.client.api.impl.TestYarnClient | | | org.apache.hadoop.yarn.client.api.impl.TestNMClient | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736544/YARN-3749.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 5cc3fce | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-tests test log | https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8148/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8148/console | This message was automatically generated. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3747) TestLocalDirsHandlerService.java: test directory logDir2 not deleted
[ https://issues.apache.org/jira/browse/YARN-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Moore updated YARN-3747: -- Attachment: YARN-3747.patch Please review this patch - Thank you Copyright 2015 David Moore Licensed under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. TestLocalDirsHandlerService.java: test directory logDir2 not deleted Key: YARN-3747 URL: https://issues.apache.org/jira/browse/YARN-3747 Project: Hadoop YARN Issue Type: Bug Components: test, yarn Affects Versions: 2.7.0 Reporter: David Moore Priority: Minor Fix For: 2.7.0 Attachments: YARN-3747.patch During a code review of hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java I noted that logDir2 is never deleted while logDir1 is deleted twice. This is not in keeping with the rest of the function and appears to be a bug. I will be submitting a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467
[ https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567323#comment-14567323 ] Anubhav Dhoot commented on YARN-3751: - There was no failure for this class in the jenkins run for YARN-3467. The change LGTM. TestAHSWebServices fails after YARN-3467 Key: YARN-3751 URL: https://issues.apache.org/jira/browse/YARN-3751 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Sunil G Attachments: 0001-YARN-3751.patch YARN-3467 changed AppInfo and assumed that used resource is not null. It's not true as this information is not published to timeline server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567379#comment-14567379 ] Hudson commented on YARN-3725: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #204 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/204/]) YARN-3725. App submission via REST API is broken in secure mode due to Timeline DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 5cc3fced957a8471733e0e9490878bd68429fe24) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3748) Cleanup Findbugs volatile warnings
[ https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567343#comment-14567343 ] Sean Busbey commented on YARN-3748: --- +1 lgtm, presuming that failed test passes locally. nit: I think the numContainers in AbstractCSQueue can be made private now maybe? Cleanup Findbugs volatile warnings -- Key: YARN-3748 URL: https://issues.apache.org/jira/browse/YARN-3748 Project: Hadoop YARN Issue Type: Bug Reporter: Gabor Liptak Priority: Minor Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3170: --- Attachment: YARN-3170-009.patch YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column
[ https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567744#comment-14567744 ] Sangjin Lee commented on YARN-3699: --- Thanks [~djp] for your comments! Once everyone's comfortable with the decision of not making the flow version part of the row key, then we could resolve this JIRA by recording that decision (+1 or -1). Then we could open a separate JIRA for the phoenix writer to relocate the flow version (remove it from the PK). But a bigger question there is, if we're going with the native HBase schema, what is the status of the phoenix writer implementation? For me, (if it wasn't obvious in previous comments), I'm +1 with the flow version *not* being in the row key. Decide if flow version should be part of row key or column --- Key: YARN-3699 URL: https://issues.apache.org/jira/browse/YARN-3699 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vrushali C Based on discussions in YARN-3411 with [~djp], filing jira for continuing discussion on putting the flow version in rowkey or column. Either phoenix/hbase approach will update the jira with the conclusions.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column
[ https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567756#comment-14567756 ] Li Lu commented on YARN-3699: - I'm +1 on removing the flow version section from the row key. I can make the change to our Phoenix writer. However, I agree with [~sjlee0] that we're not sure about the next step plan on the Phoenix writer. I'm OK with leaving it as an aggregation-only writer for now. Decide if flow version should be part of row key or column --- Key: YARN-3699 URL: https://issues.apache.org/jira/browse/YARN-3699 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vrushali C Based on discussions in YARN-3411 with [~djp], filing jira for continuing discussion on putting the flow version in rowkey or column. Either phoenix/hbase approach will update the jira with the conclusions.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3492) AM fails to come up because RM and NM can't connect to each other
[ https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-3492. --- Resolution: Cannot Reproduce Closing this based on previous comments. Please reopen this in case you run into it again. AM fails to come up because RM and NM can't connect to each other - Key: YARN-3492 URL: https://issues.apache.org/jira/browse/YARN-3492 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: pseudo-distributed cluster on a mac Reporter: Karthik Kambatla Priority: Blocker Attachments: mapred-site.xml, yarn-kasha-nodemanager-kasha-mbp.local.log, yarn-kasha-resourcemanager-kasha-mbp.local.log, yarn-site.xml Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The container gets allocated, but doesn't get launched. The NM can't talk to the RM. Logs to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2872) CapacityScheduler: Add disk I/O resource to DRF
[ https://issues.apache.org/jira/browse/YARN-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-2872: - Assignee: Sunil G CapacityScheduler: Add disk I/O resource to DRF --- Key: YARN-2872 URL: https://issues.apache.org/jira/browse/YARN-2872 Project: Hadoop YARN Issue Type: Sub-task Reporter: Karthik Kambatla Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467
[ https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567682#comment-14567682 ] Sunil G commented on YARN-3751: --- Findbugs warning can be skipped as exception handling is done in another method in WebServices.java and giving here as false positive. Existing test case in TestAHSWebServices covers this scenario. TestAHSWebServices fails after YARN-3467 Key: YARN-3751 URL: https://issues.apache.org/jira/browse/YARN-3751 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Sunil G Attachments: 0001-YARN-3751.patch YARN-3467 changed AppInfo and assumed that used resource is not null. It's not true as this information is not published to timeline server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2872) CapacityScheduler: Add disk I/O resource to DRF
[ https://issues.apache.org/jira/browse/YARN-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567621#comment-14567621 ] Sunil G commented on YARN-2872: --- Hi [~kasha] I would like to take up this for CS. I will work on a patch for same. Kindly reassign if otherwise. CapacityScheduler: Add disk I/O resource to DRF --- Key: YARN-2872 URL: https://issues.apache.org/jira/browse/YARN-2872 Project: Hadoop YARN Issue Type: Sub-task Reporter: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567609#comment-14567609 ] Vinod Kumar Vavilapalli commented on YARN-1462: --- bq. Committed to trunk/branch-2/branch-2.7. [~zjshen]/[~xgong], why are we putting this in 2.7? Looks more of an enhancement to me. Unless there is a strong requirement, we should revert it from branch-2.7. AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.7.1 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467
[ https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567672#comment-14567672 ] Hadoop QA commented on YARN-3751: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 56s | Pre-patch trunk has 3 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 54s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 55s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 35s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 1s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-server-common. | | | | 38m 16s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736418/0001-YARN-3751.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 63e3fee | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8153/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8153/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8153/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8153/console | This message was automatically generated. TestAHSWebServices fails after YARN-3467 Key: YARN-3751 URL: https://issues.apache.org/jira/browse/YARN-3751 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Sunil G Attachments: 0001-YARN-3751.patch YARN-3467 changed AppInfo and assumed that used resource is not null. It's not true as this information is not published to timeline server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException
[ https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567762#comment-14567762 ] Masatake Iwasaki commented on YARN-3752: My /etc/hosts works for pseudo distributed cluster and unit tests for NameNode-HA in HDFS. Hostnames are successfully resolved at first in TestRMFailover too at first. {noformat} java.net.ConnectException: Call From centos7/127.0.0.1 to 0.0.0.0:28031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused ... java.io.EOFException: End of File Exception between local host is: centos7/127.0.0.1; destination host is: 0.0.0.0:18031; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException ... {noformat} Client fails to create connection due to UnknownHostException while client retries to connect to next RM after failover. {noformat} java.io.IOException: java.util.concurrent.ExecutionException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: centos7:28032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.hadoop.ipc.Client.getConnection(Client.java:1487) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1371) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy15.getApplications(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:251) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy16.getApplications(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:484) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:461) at org.apache.hadoop.yarn.client.TestRMFailover.verifyClientConnection(TestRMFailover.java:119) at org.apache.hadoop.yarn.client.TestRMFailover.verifyConnections(TestRMFailover.java:133) at org.apache.hadoop.yarn.client.TestRMFailover.testExplicitFailover(TestRMFailover.java:168 . Caused by: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: centos7:28032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:744) at org.apache.hadoop.ipc.Client$Connection.init(Client.java:408) at org.apache.hadoop.ipc.Client$1.call(Client.java:1483) at org.apache.hadoop.ipc.Client$1.call(Client.java:1480) at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) ... 49 more {noformat} It may be timing/environment issue because the test seems to succeed in QA tests. TestRMFailover fails due to intermittent UnknownHostException - Key: YARN-3752 URL: https://issues.apache.org/jira/browse/YARN-3752 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Client fails to create connection due to UnknownHostException while client retries to connect to next RM after failover in unit test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1462: Fix Version/s: (was: 2.7.1) 2.8.0 AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567758#comment-14567758 ] Xuan Gong commented on YARN-1462: - Okay. Reverted it from branch-2.7. And changed the fix version from branch-2.7.1 to branch-2.8. AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567851#comment-14567851 ] Jason Lowe commented on YARN-3585: -- Thanks for the patch, Rohith! I think it would be safer/simpler to assume we shouldn't be calling Exit unless NodeManager.main() was invoked (i.e.: we're likely running in a JVM whose sole purpose is to be the nodemanager). In that sense I'm wondering if we should flip the logic to not exit but then have NodeManager.main override that. This probably precludes the need to update existing tests. We should be using ExitUtil instead of System.exit directly. Nit: setexitOnShutdownEvent s/b setExitOnShutdownEvent NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567778#comment-14567778 ] Hudson commented on YARN-1462: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7940 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7940/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException
[ https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567763#comment-14567763 ] Masatake Iwasaki commented on YARN-3752: While I am doing trial-and-error, using 127.0.0.1 instead of 0.0.0.0 for server addresses fixed this. TestRMFailover fails due to intermittent UnknownHostException - Key: YARN-3752 URL: https://issues.apache.org/jira/browse/YARN-3752 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Client fails to create connection due to UnknownHostException while client retries to connect to next RM after failover in unit test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException
[ https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567775#comment-14567775 ] Masatake Iwasaki commented on YARN-3752: I don't know the exact reason yet but using independent Configuration instances for each RM in MiniYARNCluster as YARN-3749 do also worked. TestRMFailover fails due to intermittent UnknownHostException - Key: YARN-3752 URL: https://issues.apache.org/jira/browse/YARN-3752 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Client fails to create connection due to UnknownHostException while client retries to connect to next RM after failover in unit test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568005#comment-14568005 ] Matthew Jacobs commented on YARN-2194: -- While this may work for the default RHEL7 configuration, this will break if someone happens to have mounted the same controllers to /sys/fs/cgroup/cpuacct,cpu, or if the user mounted other controllers at the same path as well. What do you think about creating the symlink from /sys/fs/cgroup/cpu to the mounted path for cpu in all cases (unless it was actually mounted at /sys/fs/cgroup/cpu of course). Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567969#comment-14567969 ] Vinod Kumar Vavilapalli commented on YARN-2194: --- Thinking out aloud, should we do OS specific checks for this? Also, does the newer CGroupsHandlerImpl also need to change? /cc [~vvasudev]. Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568063#comment-14568063 ] Sidharta Seethana commented on YARN-2194: - Isn't it better to use a different separator that is less likely to be in use ( e.g ':' or '|' instead of ',' ) when invoking container-executor ? Granted that this is a (slightly) bigger change, but it seems like the right thing to do. Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568128#comment-14568128 ] Li Lu commented on YARN-3051: - Hi [~varun_saxena], thanks for the work! Not sure if you've already made progress since the latest patch, but I'm posting some of my comments and questions w.r.t the reader API design in the 003 patch. I may have more comments in the near future, but I won't mind to see a new patch before posting them. # I noticed there is a _readerLimit_ for read operations, which works for ATS v1. I'm wondering if it's fine to use -1 to indicate there's no such limit? Not sure if this feature is already there. # The {{fromId}} parameter, we may need to be careful on the concept of id. In timeline v2 we need context information to identify each entity, such as cluster, user, flow, run. When querying with {{fromId}}, what kind of assumptions should we make on the id here? Are we assuming all entities are of the same cluster, user, and/or flow, or the id is a concatenation of all information, or it's something else? # For all filters related parameters, I'm not sure if the current object model and storage implementation support a trivial solution. I'd certainly welcome any comments/suggestions on this problem. # Based on the previous two issues, a more general question is, shall we focus on a evolution of the v1 API here, or we start a v2 reader API design from the scratch, and then try to make them compatible to the v1 APIs? The current patch looks to be pursuing the evolution approach. # In some APIs, we're requiring clusterID and appID, but not having flow/run information. In the current writer implementations, this indicates a full table scan. Maybe we can have flow and run information as optional parameters so that we can avoid full table scans when the caller does have flow and run information? # The current APIs require a pretty long list of parameters. For most of the use cases, I think we can abstract something much simpler. Do we plan to add those simple APIs in a higher layer? I think having a lot of nulls when calling reader API looks suboptimal, but with only these few APIs we may need to do this frequently? [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051-YARN-2928.003.patch, YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism
[ https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3542: -- Target Version/s: 2.8.0 Let's do this in the 2.8 timeline before the two implementations diverge more. Re-factor support for CPU as a resource using the new ResourceHandler mechanism --- Key: YARN-3542 URL: https://issues.apache.org/jira/browse/YARN-3542 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sidharta Seethana Assignee: Varun Vasudev Priority: Critical In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier addition of new resource types in the nodemanager (this was used for network as a resource - See YARN-2140 ). We should refactor the existing CPU implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using the new ResourceHandler mechanism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3748) Cleanup Findbugs volatile warnings
[ https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Liptak updated YARN-3748: --- Attachment: (was: YARN-3748.4.patch) Cleanup Findbugs volatile warnings -- Key: YARN-3748 URL: https://issues.apache.org/jira/browse/YARN-3748 Project: Hadoop YARN Issue Type: Bug Reporter: Gabor Liptak Priority: Minor Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism
[ https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568083#comment-14568083 ] Sidharta Seethana commented on YARN-3542: - +1 to this Re-factor support for CPU as a resource using the new ResourceHandler mechanism --- Key: YARN-3542 URL: https://issues.apache.org/jira/browse/YARN-3542 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sidharta Seethana Assignee: Varun Vasudev Priority: Critical In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier addition of new resource types in the nodemanager (this was used for network as a resource - See YARN-2140 ). We should refactor the existing CPU implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using the new ResourceHandler mechanism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism
[ https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568084#comment-14568084 ] Sidharta Seethana commented on YARN-3542: - +1 to this Re-factor support for CPU as a resource using the new ResourceHandler mechanism --- Key: YARN-3542 URL: https://issues.apache.org/jira/browse/YARN-3542 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sidharta Seethana Assignee: Varun Vasudev Priority: Critical In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier addition of new resource types in the nodemanager (this was used for network as a resource - See YARN-2140 ). We should refactor the existing CPU implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using the new ResourceHandler mechanism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
Sumana Sathish created YARN-3753: Summary: RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Priority: Critical RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at
[jira] [Assigned] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-3753: - Assignee: Jian He RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568145#comment-14568145 ] Jian He commented on YARN-3753: --- This happens because this exception {{new IOException(Wait for ZKClient creation timed out);}} is not retried by upper level runWithRetries method which causes RM to fail. we've seen quite a few issues regarding the retry logic of zk-store, YARN-2716 should be the long-term solution to fix all these. In the interim, I'm writing a quick work-around patch for this, as this problem makes RM unavailable. RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568187#comment-14568187 ] zhihai xu commented on YARN-3749: - Hi [~chenchun], thanks for filing and working on this issue. The patch seems reasonable to me. Some nits: 1. It looks like setRpcAddressForRM and setConfForRM are only used by test code. Should we create a new HA test utility file to include these functions? 2. Do we really need the following change at {{MiniYARNCluster#serviceInit}} {{code}} conf.set(YarnConfiguration.RM_HA_ID, rm0); {{code}} Because I saw {{initResourceManager}} will also configure {{RM_HA_ID}}. 3. Is any particular reason to configure {{YarnConfiguration.RM_HA_ID}} as {{RM2_NODE_ID}} instead of {{RM1_NODE_ID}} in ProtocolHATestBase? {{code}} conf.set(YarnConfiguration.RM_HA_ID, RM2_NODE_ID); {{code}} We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568189#comment-14568189 ] Sergey Shelukhin commented on YARN-1462: This commit changes newInstance API, breaking Tez build. It is hard to make it compatible with both pre-2.8 and 2.8... is it possible to preserve both versions of the method? AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568222#comment-14568222 ] Karthik Kambatla commented on YARN-3753: [~jianhe] - YARN-2716 is ready for review. I can make time for addressing any comments to get this in for trunk and branch-2. Given that, would it make sense to limit this fix to branch-2.7? RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568284#comment-14568284 ] Jian He commented on YARN-3753: --- [~kasha], sure, make sense, this can go into branch-2.7 only. And YARN-2716 can get in for trunk and branch-2. RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at
[jira] [Comment Edited] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568292#comment-14568292 ] Jian He edited comment on YARN-3753 at 6/2/15 12:32 AM: Upload a patch to set the wait time based on numRetries*retry-interval. I reproduced this issue locally in following way. 1. start ZK 2. start RM 3. kill ZK. 4. submit a job - without the patch, RM will fail with the same IOException(Wait for ZKClient creation timed out) - with the patch, after re-start ZK server, RM and job can continue run successfully. was (Author: jianhe): Upload a patch to set the wait time based on numRetries*retry-interval. I reproduced this issue locally in following way. 1. start RM. 2. start ZK. 3. kill ZK. 4. submit a job - without the patch, RM will fail with the same IOException(Wait for ZKClient creation timed out) - with the patch, after re-start ZK server, RM and job can continue run successfully. RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568429#comment-14568429 ] Masatake Iwasaki commented on YARN-3749: Thanks for working on this, [~chenchun]. I would like this fix to comes in because it seems to affect YARN-3752 I'm looking into. {quote} 2. Do we really need the following change at MiniYARNCluster#serviceInit conf.set(YarnConfiguration.RM_HA_ID, rm0); Because I saw initResourceManager will also configure RM_HA_ID. {quote} When I tried similar to the patch, I got error below because {{HAUtil#getRMHAId}} called from {{YarnConfiguration#updateConnectAddr}} expects that there is at most 1 RM id matching to the node. {noformat} 2015-06-02 10:14:23,648 INFO [Thread-284] service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService failed in state STARTED; cause: org.apache.hadoop.HadoopIllegalArgumentException: The HA Configuration has multiple addresses that match local node's address. org.apache.hadoop.HadoopIllegalArgumentException: The HA Configuration has multiple addresses that match local node's address. at org.apache.hadoop.yarn.conf.HAUtil.getRMHAId(HAUtil.java:204) at org.apache.hadoop.yarn.conf.YarnConfiguration.updateConnectAddr(YarnConfiguration.java:1971) at org.apache.hadoop.conf.Configuration.updateConnectAddr(Configuration.java:2129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.serviceStart(ResourceLocalizationService.java:357) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:467) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:321) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper$1.run(MiniYARNCluster.java:562) {noformat} The check can be bypassed by setting dummy value to {{yarn.resourcemanager.ha.id}} in configuration *used by NodeManager instance*. I think there should be a comment explain that it is a dummy for unit test at least. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568462#comment-14568462 ] Rohith commented on YARN-3733: -- This issue fix need to go in for 2.7.1. Updated the target version as 2.7.1 On RM restart AM getting more than maximum possible memory when many tasks in queue - Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
Bibin A Chundatt created YARN-3754: -- Summary: Race condition when the NodeManager is shutting down and container is launched Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt reassigned YARN-3754: -- Assignee: Bibin A Chundatt Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-3754: - Assignee: Sunil G (was: Bibin A Chundatt) Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568573#comment-14568573 ] Hadoop QA commented on YARN-3170: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 2m 58s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 55s | Site still builds. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 6m 18s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736746/YARN-3170-010.patch | | Optional Tests | site | | git revision | trunk / 990078b | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8162/console | This message was automatically generated. YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170-010.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.7.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568409#comment-14568409 ] Hadoop QA commented on YARN-3753: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 37s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 47s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 50m 33s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 88m 42s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736694/YARN-3753.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cdc13ef | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8154/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8154/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8154/console | This message was automatically generated. RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568428#comment-14568428 ] Tsuyoshi Ozawa commented on YARN-3170: -- {quote} The Scheduler has a pluggable policy plug-in {quote} I think Allen means the sentence is awkward since pluggable plug-in sounds redundant. Could you fix it? YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568466#comment-14568466 ] Sunil G commented on YARN-3733: --- I feel clusterResource=0,0 lhs=1,1, and rhs2,2 may happen. But we cannot differentiate which is bigger infinity here and thats not correct. Why could we check for clusterResource=0,0 prior to * getResourceAsValue()* check and handle from there. On RM restart AM getting more than maximum possible memory when many tasks in queue - Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3754: --- Description: Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same was: Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568539#comment-14568539 ] Rohith commented on YARN-3585: -- Thanks [~jlowe] for the review .. bq. if we should flip the logic to not exit but then have NodeManager.main override that. This probably precludes the need to update existing tests. Make sense to me.. Changed the logic to call jvm exit when NodeMananager is instantiated from main function. bq. We should be using ExitUtil instead of System.exit directly. Done bq. Nit: setexitOnShutdownEvent s/b setExitOnShutdownEvent This method is not necessary now since patch preassume true when it is called from only main funtion. I have removed this. Kindly reveiw updated patch NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3585.patch, YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.4.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568467#comment-14568467 ] Sunil G commented on YARN-3733: --- I feel clusterResource=0,0 lhs=1,1, and rhs2,2 may happen. But we cannot differentiate which is bigger infinity here and thats not correct. Why could we check for clusterResource=0,0 prior to * getResourceAsValue()* check and handle from there. On RM restart AM getting more than maximum possible memory when many tasks in queue - Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568433#comment-14568433 ] Zhijie Shen commented on YARN-1462: --- bq. This commit changes newInstance API, breaking Tez build. {{newInstance}} is marked as \@Private, and it's not supposed to be used ouside Hadoop. What's the use case in Tez? bq. is it possible to preserve both versions of the method? It's possible, but the problem is whether should do it. Theoretically, compatibility is not required for the private method. If it has strong use case to let app report be created outside Hadoop, we should mark this method \@Public and keep it compatible over releases. AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.5.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568499#comment-14568499 ] Bibin A Chundatt commented on YARN-3585: [~rohithsharma] and [~sunilg] Have added jira YARN-3754 for tracking DB connection close NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)