[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325576#comment-14325576 ] Sunil G commented on YARN-2004: --- About the priority inversion problem, I feel we could use below approach 1. To identify lower priority application which is waiting for resource over a long period, *lastScheduledContainer* in *SchedulerApplicationAttempt* can be used to get the timestamp of last allocation. And based on a time limit configuration, it is possible to identify the apps which are starving. 2. Identify few higher applications and decrease its headroom explicitly by one resource request of lower priority application. 3. Reset the headroom of higher priority application back once lower priority application has got the container. Kindly share the thoughts on same. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2820: Attachment: YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325589#comment-14325589 ] zhihai xu commented on YARN-2820: - [~ozawa], thanks for the review. Your suggestion is good. I uploaded a new patch YARN-2820.003.patch, which addressed your comment. please review it. thanks zhihai Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at
[jira] [Updated] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2820: Description: Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at
[jira] [Updated] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2820: Description: Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at
[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325581#comment-14325581 ] Devaraj K commented on YARN-3197: - Thanks [~varun_saxena] for patch and [~rohithsharma] for comment. {code:xml} + + ] of unknown application completed with event + event); {code} Here 'unknown application' may not be appropriate always. Instead can we think of logging like 'Unknown container + containerStatus.getContainerId() + completed with event + event'. bq. It would be better if the log level changed to DEBUG. In NM restart, these messages are very huge Do you see any other info logs coming for the same container? IMO, it should have at least one info log about this container status update from NM, after NM restart. Confusing log generated by CapacityScheduler Key: YARN-3197 URL: https://issues.apache.org/jira/browse/YARN-3197 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Hitesh Shah Assignee: Varun Saxena Priority: Minor Attachments: YARN-3197.001.patch 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.
[ https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325750#comment-14325750 ] Hudson commented on YARN-3207: -- FAILURE: Integrated in Hadoop-Yarn-trunk #842 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/842/]) YARN-3207. Secondary filter matches entites which do not have the key (xgong: rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/CHANGES.txt secondary filter matches entites which do not have the key being filtered for. -- Key: YARN-3207 URL: https://issues.apache.org/jira/browse/YARN-3207 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Prakash Ramachandran Assignee: Zhijie Shen Attachments: YARN-3207.1.patch in the leveldb implementation of the TimelineStore the secondary filter matches entities where the key being searched for is not present. ex query from tez ui http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar will match and return the entity even though there is no entity with otherinfo.foo defined. the issue seems to be in {code:title=LeveldbTimelineStore:675} if (vs != null !vs.contains(filter.getValue())) { filterPassed = false; break; } {code} this should be IMHO vs == null || !vs.contains(filter.getValue()) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3211) Do not use zero as the beginning number for commands for LinuxContainerExecutor
Liang-Chi Hsieh created YARN-3211: - Summary: Do not use zero as the beginning number for commands for LinuxContainerExecutor Key: YARN-3211 URL: https://issues.apache.org/jira/browse/YARN-3211 Project: Hadoop YARN Issue Type: Bug Reporter: Liang-Chi Hsieh Priority: Minor Current the implementation of LinuxContainerExecutor and container-executor uses some numbers as its commands. The commands begin from zero (INITIALIZE_CONTAINER). When LinuxContainerExecutor gives the numeric command as the command line parameter to run container-executor. container-executor calls atoi() to parse the command string to integer. However, we know that atoi() will return zero when it can not parse the string to integer. So if you give an non-numeric command, container-executor still accepts it and runs INITIALIZE_CONTAINER command. I think it is wrong and we should not use zero as the beginning number of the commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325676#comment-14325676 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699443/YARN-2820.003.patch against trunk revision b6fc1f3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6658//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6658//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6658//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at
[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325702#comment-14325702 ] Devaraj K commented on YARN-3197: - rmContainer could be null when SchedulerApplicationAttempt is null or liveContainers doesn't have the container info. There could be a chance of ApplicationAttempt is running and container has already completed(removed from liveContainers). Here we cannot say unknown application. I have mentioned 'Unknown container' because RM has removed this container info and doesn't know about this container any more. Do you see any better message here? Confusing log generated by CapacityScheduler Key: YARN-3197 URL: https://issues.apache.org/jira/browse/YARN-3197 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Hitesh Shah Assignee: Varun Saxena Priority: Minor Attachments: YARN-3197.001.patch 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325610#comment-14325610 ] Rohith commented on YARN-3194: -- Thanks [~jlowe] [~djp] [~jianhe] for detailed review:-) bq. the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition Agree, this has to be refactored. Majority of processing containerStatus code is same. bq. we don't remove containers that have completed from the launchedContainers map which seems wrong I see, yes. completed containers should be removed from launchedContainers. bq. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why these both created differently. What I see is containerStatus is subset of NMcontainerStatus. I think containerStatus would have been inside NMContainerStatus. bq. Is below condition valid for the newly added code in ReconnectNodeTransition too ? Yes, it is applicable since we are keeping old RMNode object. bq. Add timeout to the test, testAppCleanupWhenNMRstarts - testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ? Agree. bq. Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario? Agree, I will add bq. Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ? AM memory is 1024 and additional requested container memory is 2048. In test, number of request container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072 After NM restart,completed containers are not released by RM which are sent during NM registration -- Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM process only ContainerState.RUNNING. If container is completed when NM was down then those containers resources wont be release which result in applications to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3211) Do not use zero as the beginning number for commands for LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated YARN-3211: -- Attachment: YARN-3211.patch Do not use zero as the beginning number for commands for LinuxContainerExecutor --- Key: YARN-3211 URL: https://issues.apache.org/jira/browse/YARN-3211 Project: Hadoop YARN Issue Type: Bug Reporter: Liang-Chi Hsieh Priority: Minor Attachments: YARN-3211.patch Current the implementation of LinuxContainerExecutor and container-executor uses some numbers as its commands. The commands begin from zero (INITIALIZE_CONTAINER). When LinuxContainerExecutor gives the numeric command as the command line parameter to run container-executor. container-executor calls atoi() to parse the command string to integer. However, we know that atoi() will return zero when it can not parse the string to integer. So if you give an non-numeric command, container-executor still accepts it and runs INITIALIZE_CONTAINER command. I think it is wrong and we should not use zero as the beginning number of the commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements
[ https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325618#comment-14325618 ] Gururaj Shetty commented on YARN-3199: -- [~ka...@cloudera.com] your comment is been incorporated. Will be merged once all the docs are converted to Markdown. Fair Scheduler documentation improvements - Key: YARN-3199 URL: https://issues.apache.org/jira/browse/YARN-3199 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: Gururaj Shetty Priority: Minor Labels: documentation Attachments: YARN-3199.patch {{yarn.scheduler.increment-allocation-mb}} and {{yarn.scheduler.increment-allocation-vcores}} are not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3194: - Summary: After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node (was: After NM restart,completed containers are not released by RM which are sent during NM registration) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM process only ContainerState.RUNNING. If container is completed when NM was down then those containers resources wont be release which result in applications to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325616#comment-14325616 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699439/YARN-2820.002.patch against trunk revision b6fc1f3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6657//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6657//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6657//console This message is automatically generated. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at
[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3194: - Description: On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. was: On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3194: - Description: On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM was:On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM process only ContainerState.RUNNING. If container is completed when NM was down then those containers resources wont be release which result in applications to hang. After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325621#comment-14325621 ] zhihai xu commented on YARN-2820: - I checked the warning message, all these 5 findbugs are not related to my change. Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. -- Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException. When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1409135750325_109118_01 at: /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ appattempt_1409135750325_109118_01.new.tmp retrying... 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Error updating info for attempt: appattempt_1409135750325_109118_01 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing/updating appAttempt: appattempt_1409135750325_109118_01 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} As discussed at YARN-1778, TestFSRMStateStore failure is also due to IOException in storeApplicationStateInternal. Stack trace from TestFSRMStateStore failure: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at
[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325620#comment-14325620 ] Varun Saxena commented on YARN-3197: Hmm...But we do have Container ID. Would it be right to say Unknown container if we are printing ContainerID ? We do not know the Application ID however. Confusing log generated by CapacityScheduler Key: YARN-3197 URL: https://issues.apache.org/jira/browse/YARN-3197 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Hitesh Shah Assignee: Varun Saxena Priority: Minor Attachments: YARN-3197.001.patch 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3212: - Attachment: RMNodeImpl - new.png Attache the new state transition diagram for RMNode. RMNode State Transition Update with DECOMMISSIONING state - Key: YARN-3212 URL: https://issues.apache.org/jira/browse/YARN-3212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Junping Du Assignee: Junping Du Attachments: RMNodeImpl - new.png As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can transition from “running” state triggered by a new event - “decommissioning”. This new state can be transit to state of “decommissioned” when Resource_Update if no running apps on this NM or NM reconnect after restart. Or it received DECOMMISSIONED event (after timeout from CLI). In addition, it can back to “running” if user decides to cancel previous decommission by calling recommission on the same node. The reaction to other events is similar to RUNNING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen
[ https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326178#comment-14326178 ] Joep Rottinghuis commented on YARN-3041: Agreed with [~sjlee0] that we should use an enum to enumerate the timeline entity types. Not sure if we should directly use enums, or have TimelineEntity.type be interface TimelineEntityType and have an enum that implements that interface. The latter is more extensible later on (there could be other enums implementing the interface). On the other hand that makes things a bit harder to enumerate over, so perhaps that is overkill. [Data Model] create overall data objects of TS next gen --- Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.preliminary.001.patch Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326197#comment-14326197 ] Sunil G commented on YARN-2004: --- Hi Jason, thank you for sharing the thoughts. In one way, we need not have to think abt headroom and userlimit. Still I would like to share 2 scenarios 1. Similar to MAPREDUCE-314. A job j1 is submitted with lower priority and finished its map tasks, reducers are running. later j2 and j3 came in and took over cluster resources. if a map is failed, by loosing some map o/p, there are no chances of getting a resource for j1 till j2 and j3 releases resources and not allocating it. In a -ve scenario, j1 will starve for much longer. This was one of the intention to temporarily pause demand from j2 and j4 for a while and spare some resources for j1. 2. User Limit: Assume the factor is 25, and 4 users can take 25% each from cluster. 5th user has to wait. Assume the highest priority app is submitted by 5th user. He may not get resources untill demand from first 4 users(for existing apps) are over. Do you feel this is to be handled? Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen
[ https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326183#comment-14326183 ] Joep Rottinghuis commented on YARN-3041: Some additional throughs: If we have the types strongly typed, do we need to call containers YARN_CONTAINER and YARN_FLOW, or would we be able to capture more generic flows and containers with this as well ? Perhaps the framework used to run could be a property for the generic entity. I don't see what the advantage is to have the user set up the proper relationship. Why not make that part of the constructors and have protected methods to set up the hierarchy correctly ? Why introduce a chance to have this all set up strange ? I think the acceptable entity types for parent-child relationships can be setup in the enum itself. The enums would simply have methods on them and can take constructors. [Data Model] create overall data objects of TS next gen --- Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.preliminary.001.patch Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3194: - Attachment: 0001-YARN-3194.patch After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-933) Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326235#comment-14326235 ] Rohith commented on YARN-933: - [~jianhe] kindly review the updated patch. Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING - Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Assignee: Rohith Attachments: 0001-YARN-933.patch, 0001-YARN-933.patch, 0004-YARN-933.patch, YARN-933.3.patch, YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495) at
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326238#comment-14326238 ] Jason Lowe commented on YARN-2004: -- For your first scenario, it can happen today without priority. MR jobs ask for resources in waves -- first all the maps, then over time it ramps up reducers. Multiple jobs in the same queue from the same user can collide in different phases. That's the whole point of the headroom calculation and reporting -- to allow AMs to realize this scenario is happening and react to it. In this case what will happen is j1 will see its headroom is zero and start killing reducers to make room for the failed map task. After killing the reducers there will be some free resources in the cluster (if they weren't stolen by another, underserved queue). Then the question goes to who will get those resources. If we're using the default priority, j1 will get first crack at them due to FIFO priority. If j2 or j3 were made higher priority then j1 will see that its headroom is _still_ zero after killing some reducers and will probably kill some more to try to make room. Rinse, repeat until j1 is out of reducers to shoot or gets the resources it needs to run the failed map. For the second scenario, the 5th user will _still_ be the first one to get any spare resources in the queue because he has the highest priority app. Note that the user limit calculation does not involve comparing a user's current limit with other user's usage. It's just a computation of what's available in the queue and what you're allowed based on the configured user limit and user limit factor. So what will happen is the 5th user will continue to consume any free resources in the queue until either the app is satiated or the 5th user hits the 25% cap. If there are no free resources then the 5th user's app will starve (without preemption) just like the rest until resources show up. Again, higher priority just means you're first in line to get resources when they are freed up, and it doesn't change anything else. We can discuss adding preemption into the mix to force higher priority apps to get their requested resources faster in a full queue. However I think the first step is to get priority scheduling working for resources that are free in the queue in the non-preemption case, as that's still very useful in practice. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326191#comment-14326191 ] Rohith commented on YARN-3194: -- Attached the patch addressing all the above comments.. Kindly review the new patch.. After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen
[ https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326185#comment-14326185 ] Joep Rottinghuis commented on YARN-3041: I think version may have to be something more than a property on a flow. We need to be able to query by versions. [Data Model] create overall data objects of TS next gen --- Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.preliminary.001.patch Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state
Junping Du created YARN-3212: Summary: RMNode State Transition Update with DECOMMISSIONING state Key: YARN-3212 URL: https://issues.apache.org/jira/browse/YARN-3212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Junping Du Assignee: Junping Du As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can transition from “running” state triggered by a new event - “decommissioning”. This new state can be transit to state of “decommissioned” when Resource_Update if no running apps on this NM or NM reconnect after restart. Or it received DECOMMISSIONED event (after timeout from CLI). In addition, it can back to “running” if user decides to cancel previous decommission by calling recommission on the same node. The reaction to other events is similar to RUNNING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326344#comment-14326344 ] Wangda Tan commented on YARN-3197: -- I think it's better not saying {code} 1276 LOG.info(Container [ContainerId: + containerStatus.getContainerId() 1277 + ] of unknown application completed with event + event); {code} Since we have containerId within containerStatus, it's better to indicate we cannot get RMContainer since the attempt probably is already completed, I suggest print both containerId and applicationId out.\ I think INFO could be fine since it will be at most once for each container. And a logging below is also confusing: {code} if (application == null) { LOG.info(Container + container + of + unknown application + appId + completed with event + event); return; } {code} If a RM can get RMContainer, the application will definitely not unknown, should indicate the application may be completed as well. Confusing log generated by CapacityScheduler Key: YARN-3197 URL: https://issues.apache.org/jira/browse/YARN-3197 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Hitesh Shah Assignee: Varun Saxena Priority: Minor Attachments: YARN-3197.001.patch 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326328#comment-14326328 ] Hadoop QA commented on YARN-3194: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699499/0001-YARN-3194.patch against trunk revision 2ecea5a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6660//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6660//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6660//console This message is automatically generated. After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3132) RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated
[ https://issues.apache.org/jira/browse/YARN-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326460#comment-14326460 ] Jian He commented on YARN-3132: --- +1 RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated --- Key: YARN-3132 URL: https://issues.apache.org/jira/browse/YARN-3132 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3132.1.patch Using an example to explain: 1) Admin specify host1 has label=x 2) node=host1:123 registered 3) Get node-to-label mapping, return host1/host1:123 4) node=host1:123 unregistered 5) Get node-to-label mapping, still returns host1:123 Probably we should remove host1:123 when it becomes deactivated and no directly label assigned to it (directly assign means admin specify host1:123 has x instead of host1 has x). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326368#comment-14326368 ] Robert Kanter commented on YARN-2942: - In the end, the cleaner service isn't necessary because the compacted aggregated logs are in the same place as the aggregated logs, so the {{AggregatedLogDeletionService}} takes care of this for us, without any code changes :) I'll upload an updated design doc later today. Aggregated Log Files should be compacted Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326333#comment-14326333 ] Karthik Kambatla commented on YARN-2942: I have been involved in the design. I like the currently design mainly because it is an optimization. In the final document, I didn't quite get what the Cleaner service would do. [~rkanter] - could you elaborate? Aggregated Log Files should be compacted Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3132) RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated
[ https://issues.apache.org/jira/browse/YARN-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326476#comment-14326476 ] Hudson commented on YARN-3132: -- FAILURE: Integrated in Hadoop-trunk-Commit #7146 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7146/]) YARN-3132. RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated. Contributed by Wangda Tan (jianhe: rev f5da5566d9c392a5df71a2dce4c2d0d50eea51ee) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/TestRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated --- Key: YARN-3132 URL: https://issues.apache.org/jira/browse/YARN-3132 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.7.0 Attachments: YARN-3132.1.patch Using an example to explain: 1) Admin specify host1 has label=x 2) node=host1:123 registered 3) Get node-to-label mapping, return host1/host1:123 4) node=host1:123 unregistered 5) Get node-to-label mapping, still returns host1:123 Probably we should remove host1:123 when it becomes deactivated and no directly label assigned to it (directly assign means admin specify host1:123 has x instead of host1 has x). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326582#comment-14326582 ] Wangda Tan commented on YARN-2004: -- [~sunilg], Thanks for uploading patch, I just read comments from [~jlowe], I think what he said all make sense to me. For scenario#1 There're some possible solutions to tackle the priority inversion problem you just mentioned. But it is more important to make CS with basic priority works first. What you said is more like adjustable priority, which could be updated according to application's waiting time or other factors. For scenario#2 It is possible that a user with higher priority application comes but there's no available resource in a queue, preemption policy should reclaim resource from other users. YARN-2009 should cover it. General approach of the patch looks good to me. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3166) [Source organization] Decide detailed package structures for timeline service v2 components
[ https://issues.apache.org/jira/browse/YARN-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326538#comment-14326538 ] Li Lu commented on YARN-3166: - Hi [~sjlee0], [~rkanter], [~zjshen] and [~vinodkv], would (anyone of) you mind to take a look at the conclusion here? I'm trying to finalize our first draft for module/package structures for timeline v2 here. Please feel free to let me know if you have any concerns. Thanks! [Source organization] Decide detailed package structures for timeline service v2 components --- Key: YARN-3166 URL: https://issues.apache.org/jira/browse/YARN-3166 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Open this JIRA to track all discussions on detailed package structures for timeline services v2. This JIRA is for discussion only. For our current timeline service v2 design, aggregator (previously called writer) implementation is in hadoop-yarn-server's: {{org.apache.hadoop.yarn.server.timelineservice.aggregator}} In YARN-2928's design, the next gen ATS reader is also a server. Maybe we want to put reader related implementations into hadoop-yarn-server's: {{org.apache.hadoop.yarn.server.timelineservice.reader}} Both readers and aggregators will expose features that may be used by YARN and other 3rd party components, such as aggregator/reader APIs. For those features, maybe we would like to expose their interfaces to hadoop-yarn-common's {{org.apache.hadoop.yarn.timelineservice}}? Let's use this JIRA as a centralized place to track all related discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-933) Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326557#comment-14326557 ] Jian He commented on YARN-933: -- lgtm, +1 Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING - Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Assignee: Rohith Attachments: 0001-YARN-933.patch, 0001-YARN-933.patch, 0004-YARN-933.patch, YARN-933.3.patch, YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495) at
[jira] [Updated] (YARN-1615) Fix typos in FSAppAttempt.java
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-1615: Attachment: YARN-1615-002.patch Attaching a patch. Fix typos in FSAppAttempt.java -- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1615) Fix typos in FSAppAttempt.java
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-1615: Description: In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} was: In FSSchedulerApp.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} Target Version/s: 2.7.0 Affects Version/s: (was: 2.2.0) 2.6.0 Summary: Fix typos in FSAppAttempt.java (was: Fix typos in FSSchedulerApp.java) FSScheduler and AppSchedulable were merged into FSAppAttempt by YARN-2399, but the typos still exist. Fix typos in FSAppAttempt.java -- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326637#comment-14326637 ] Karthik Kambatla commented on YARN-2942: Thanks for clarifying that, Robert. Also, I don't think we should use the word compaction for this. I would prefer combined-aggregated-logs or uber-aggregated-logs. Can we split this JIRA into sub-tasks for easier reviewing: curator-ChildReaper, reader/writer, LogCombiner, and NMs calling the LogCombiner (including coordination)? Aggregated Log Files should be compacted Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3034) [Aggregator wireup] Implement RM starting its ATS writer
[ https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326630#comment-14326630 ] Li Lu commented on YARN-3034: - Hi [~Naganarasimha], thanks for the patch! I briefly looked at it, and have some questions about it. * A general question, I think there are some inconsistencies between this patch and the proposed solution for aggregators. In the original design, it is proposed that we need to organize application level aggregators into collections (either on the NMs or on the RM, supposedly implemented as AppLevelServiceManager? ), and the servers launches its own collection. I could not find related logic in this patch, am I missing anything here? * I noticed that you refactored some metrics related code in the RM, moving part them into the new RMTimelineAggregator. Maybe in this JIRA we'd like to focus on setting up the wiring for the aggregator (collections) on the RM, rather than moving into the details of the timeline data? We can always resolve those problems in a separate JIRA after we set up the base infrastructure for timeline v2. * About source code organization: currently you're putting RMTimelineAggregator into the hadoop-yarn-server-resourcemanager module, under the package org.apache.hadoop.yarn.server.resourcemanager.metrics package. I'm not sure if that's a place we'd like it to be in. YARN-3166 keeps track of code organization related discussions, and you're more than welcome to join the discussion there. I think for now in this JIRA, maybe we want to firstly focus on making RM launches its aggregator collection (not blocked by any other JIRAs, may interfere with aggregator refactoring)? [Aggregator wireup] Implement RM starting its ATS writer Key: YARN-3034 URL: https://issues.apache.org/jira/browse/YARN-3034 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3034.20150205-1.patch Per design in YARN-2928, implement resource managers starting their own ATS writers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3213) Respect labels in Capacity Scheduler when computing user-limit
Wangda Tan created YARN-3213: Summary: Respect labels in Capacity Scheduler when computing user-limit Key: YARN-3213 URL: https://issues.apache.org/jira/browse/YARN-3213 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Now we can support node-labels in Capacity Scheduler, but user-limit computing doesn't respect node-labels enough, we should fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3222) RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential order
[ https://issues.apache.org/jira/browse/YARN-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326965#comment-14326965 ] Rohith commented on YARN-3222: -- Attaching the logs which gives more information about issue. In the below log, RM has shutdown with NPE while updating node_resource. And observe scheduler events dispatched from AsyncDispatcher in *org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.\**. Here the order is NODE_REMOVED -- NODE_RESOURCE_UPDATE -- NODE_ADDED -- NODE_LABELS_UPDATE {noformat} 2015-02-19 09:14:57,212 INFO [main] util.RackResolver (RackResolver.java:coreResolve(109)) - Resolved 127.0.0.1 to /default-rack 2015-02-19 09:14:57,213 INFO [main] resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(313)) - Reconnect from the node at: 127.0.0.1 2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeReconnectEvent.EventType: RECONNECTED 2015-02-19 09:14:57,215 INFO [main] resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(343)) - NodeManager from node 127.0.0.1(cmPort: 1234 httpPort: 3) registered with capability: memory:16384, vCores:16, assigned nodeId 127.0.0.1:1234 2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type RECONNECTED 2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeRemovedSchedulerEvent.EventType: NODE_REMOVED 2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStartedEvent.EventType: STARTED 2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type STARTED 2015-02-19 09:14:57,266 INFO [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(424)) - 127.0.0.1:1234 Node Transitioned from NEW to RUNNING 2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: NODE_USABLE 2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeResourceUpdateSchedulerEvent.EventType: NODE_RESOURCE_UPDATE 2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeAddedSchedulerEvent.EventType: NODE_ADDED 2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: NODE_USABLE 2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeLabelsUpdateSchedulerEvent.EventType: NODE_LABELS_UPDATE 2015-02-19 09:14:57,267 INFO [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1267)) - Removed node 127.0.0.1:1234 clusterResource: memory:0, vCores:0 2015-02-19 09:14:57,267 FATAL [ResourceManager Event Processor] resourcemanager.ResourceManager (ResourceManager.java:run(688)) - Error in handling event type NODE_RESOURCE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:548) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:992) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:679) at java.lang.Thread.run(Thread.java:745) 2015-02-19 09:14:57,280 INFO [ResourceManager Event Processor] resourcemanager.ResourceManager
[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen
[ https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327028#comment-14327028 ] Zhijie Shen commented on YARN-3041: --- Cool! Thanks for your review, Sangjin! I'll go ahead to commit the patch. [Data Model] create overall data objects of TS next gen --- Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.5.patch, YARN-3041.preliminary.001.patch Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326969#comment-14326969 ] Rohith commented on YARN-3197: -- bq. Do you see any other info logs coming for the same container? No information about container. Its only above log message will be printed. Confusing log generated by CapacityScheduler Key: YARN-3197 URL: https://issues.apache.org/jira/browse/YARN-3197 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Hitesh Shah Assignee: Varun Saxena Priority: Minor Attachments: YARN-3197.001.patch 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326980#comment-14326980 ] Rohith commented on YARN-3197: -- bq. I think INFO could be fine since it will be at most once for each container. I agree this log message is most once for each containers But IIUC, the above log message would not help to analyze any issue in cluster rather it is just only information. This would come because NodeManger may be delayed in identifying container has finished and sending its status. Consider NM restart , NM recovers all the containers and sends all the container status(running and completed) while registering. But application already would have completed and scheduler prints above message which is not really required. It just fills log files. May be above scenario can be considered for changing log level to DEBUG. Confusing log generated by CapacityScheduler Key: YARN-3197 URL: https://issues.apache.org/jira/browse/YARN-3197 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Hitesh Shah Assignee: Varun Saxena Priority: Minor Attachments: YARN-3197.001.patch 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen
[ https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327019#comment-14327019 ] Sangjin Lee commented on YARN-3041: --- LGTM. Thanks for reflecting the latest feedback! I agree with your points for the most part. The update of the design doc is long overdue. I'll try to update the document to reflect all the changes that have taken place so far. We'll file more JIRAs if we need to adjust/update the data model as the work progresses. [Data Model] create overall data objects of TS next gen --- Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.5.patch, YARN-3041.preliminary.001.patch Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3040) [Data Model] Implement client-side API for handling flows
[ https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327034#comment-14327034 ] Naganarasimha G R commented on YARN-3040: - Thanks for briefing [~rkanter], and my queries or comments as follows : bq. I think the Entities (YARN-3041) are mainly for writing/reading to/from the ATS store. Most of the information stored in those Entities are not needed by the user when submitting a job. All the user really needs to set is the IDs, and some of these we can make optional or determine automatically (e.g. it's obvious which cluster it's running on) Yes i agree Flow, Cluster, Flow run not required for submitting a job and hence if we are only passing the Entity ID's then tags should be sufficient enough. But the concern what i had was based on the design doc section 7, out of scope, point 1 i am under the assumption that posting of Entities to ATSV2 can be done only by RM,NM and AM and client will not be able to post Flow, Flow run and Cluster Entities explicitly. Hence wanted to know the approach for clients to post Flow, Flow run and Cluster Entities. And wrt to Cluster info i remember Vrushali mentioning about diff clusters like production and a test cluster which they wanted to capture explicitly. bq.100 characters per tag seems like it should be enough; if not, we can maybe increase this limit? It is marked as @Evolving If we are planning to pass Entity ID's to map the application hierarchy then i feel 100 chars per tag should be sufficient. how about making it configurable if required to store more information per tag bq. For example, setFlowId(String id) would simply set the tag yes i agree that these are not first class YARN concepts hence like you mentioned YARN applications can take care of simplifying it. +1 for this approach. [Data Model] Implement client-side API for handling flows - Key: YARN-3040 URL: https://issues.apache.org/jira/browse/YARN-3040 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Robert Kanter Per design in YARN-2928, implement client-side API for handling *flows*. Frameworks should be able to define and pass in all attributes of flows and flow runs to YARN, and they should be passed into ATS writers. YARN tags were discussed as a way to handle this piece of information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy
[ https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3217: --- Priority: Major (was: Minor) Remove httpclient dependency from hadoop-yarn-server-web-proxy -- Key: YARN-3217 URL: https://issues.apache.org/jira/browse/YARN-3217 Project: Hadoop YARN Issue Type: Task Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula Attachments: YARN-3217.patch Sub-task of HADOOP-10105. Remove httpclient dependency from WebAppProxyServlet.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy
[ https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3217: --- Labels: (was: newbie) Remove httpclient dependency from hadoop-yarn-server-web-proxy -- Key: YARN-3217 URL: https://issues.apache.org/jira/browse/YARN-3217 Project: Hadoop YARN Issue Type: Task Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula Priority: Minor Attachments: YARN-3217.patch Sub-task of HADOOP-10105. Remove httpclient dependency from WebAppProxyServlet.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy
[ https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327083#comment-14327083 ] Hadoop QA commented on YARN-3217: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699620/YARN-3217.patch against trunk revision 946456c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy: org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServer Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6667//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6667//console This message is automatically generated. Remove httpclient dependency from hadoop-yarn-server-web-proxy -- Key: YARN-3217 URL: https://issues.apache.org/jira/browse/YARN-3217 Project: Hadoop YARN Issue Type: Task Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula Attachments: YARN-3217.patch Sub-task of HADOOP-10105. Remove httpclient dependency from WebAppProxyServlet.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3034) [Aggregator wireup] Implement RM starting its ATS writer
[ https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327072#comment-14327072 ] Naganarasimha G R commented on YARN-3034: - Hi [~gtCarrera9], thanks for reviewing the patch, # _point1_ I think there is difference in understanding about the approach here. Based on discussions with [~sjlee0] and in the design doc, {quote} _In section 4.1_ RM itself has its own ATS process to be able to write RM-specific timeline events (e.g. application lifecycle events). RM can also use YARN tags to associate events with a specific flow/run/app. The volume of data coming directly from RM should not be great. {quote} IIUC RM has its own single ATS aggregator(service) and writer and it differs from NM, where NM through service discovery YARN-3039 identifies AppLevelAggregatorService and posts the entities through it. # _point2_ Yes, i agree to your point here i could have kept these modifications separate from this jira, i got similar kind of comment from Sangjin where in he was asking both old and new ATS should be working and based on configuration we pick the appropriate ATS notifications from RM. Will take care in next patch. # _point3_ Well i tried to keep it in sync with the existing ATS code (SystemMetricsPublisher). Once my queries are clarified then thought about discussing the package structure in yarn-3166. Currently i have following queries : # Will RM have its own Aggregator (which i feel is correct as we are publishing only app and app attempt life cycle events from RM,) or collection of application level aggregators ( it doesn't serve any purpose for having this separately in RM). As per YARN-3030, Using AUX services separate AppLevelAggregatorService is created per App in each NM. # If your understanding is correct (collection of application level aggregators for both RM and NM). Then i have few queries based on YARN-3030. #* Why are we starting AppLevelAggregatorService in NM through Aux services, we should have created this from RM. so that initial app life cycle events can be posted to ATS. #* whats the scope of RMTimelineAggregator when we have AppLevelAggregatorService? # If My understanding is correct (RM have its own ATS Aggregator ) i have following queries : #* Based on the discussions we had on Feb11, I understand that RM and NM should not be directly dependent on TimelineService. But in 3030 patch, BaseAggregatorService.java is in timeline service project hence where to place this RMTimelineAggregator.java class (as it extends BaseAggregatorService ) ? #* If we plan to handle similar to current approach i.e send the Entity data through a rest client to a timeline writer service(RMTimelineAggregator), where should this service be running i.e. as part of which process or should it be a daemon on its own? [Aggregator wireup] Implement RM starting its ATS writer Key: YARN-3034 URL: https://issues.apache.org/jira/browse/YARN-3034 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3034.20150205-1.patch Per design in YARN-2928, implement resource managers starting their own ATS writers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3166) [Source organization] Decide detailed package structures for timeline service v2 components
[ https://issues.apache.org/jira/browse/YARN-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326987#comment-14326987 ] Zhijie Shen commented on YARN-3166: --- bq. I assume they will post data through our clients, am I right here? RM and NM should have the code to start the aggregator too. [Source organization] Decide detailed package structures for timeline service v2 components --- Key: YARN-3166 URL: https://issues.apache.org/jira/browse/YARN-3166 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Open this JIRA to track all discussions on detailed package structures for timeline services v2. This JIRA is for discussion only. For our current timeline service v2 design, aggregator (previously called writer) implementation is in hadoop-yarn-server's: {{org.apache.hadoop.yarn.server.timelineservice.aggregator}} In YARN-2928's design, the next gen ATS reader is also a server. Maybe we want to put reader related implementations into hadoop-yarn-server's: {{org.apache.hadoop.yarn.server.timelineservice.reader}} Both readers and aggregators will expose features that may be used by YARN and other 3rd party components, such as aggregator/reader APIs. For those features, maybe we would like to expose their interfaces to hadoop-yarn-common's {{org.apache.hadoop.yarn.timelineservice}}? Let's use this JIRA as a centralized place to track all related discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3041) [Data Model] create overall data objects of TS next gen
[ https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3041: -- Attachment: YARN-3041.5.patch Thanks for the feedback, Sangjin, Vrushali and Joep! We had an offline discussion. I updated the patch according to it. Here's the summary of the major changes: 1. It is necessary to have both Flow and FlowRun in the taxonomy, as the concepts of them are most the same. FlowRun is more likely to model an individual flow instance of a number applications while Flow sounds like a the generic perspective of application organization, which may be nested multiple FlowRun instances. Hence, we just need to have FlowRun only, but rename FlowRun to Flow for simplicity. 2. To address the aggregation interval, which means we may want to query the aggregated information for a particular time window, I change TimelineMetric to have starttime and endtime attributes. 3. The types of the first class citizen entities are defined centrally as the enums, and the parent-child relationship is defined there too. 4. In the write path, queue is the string attribute of application while user is the string attribute of the flow, while we still have the entities of both to put the aggregated data at the reader side. One additional implication is that all the applications are going to be run by the same user of the parent flow. 5. Flow id is the composite: user@flow_name(or id)/version/run, which will uniquely identify a flow in the storage. Joep has raised a great point of keeping the type generic to extend the data model beyond YARN, such as Mesos. I think we can think and discuss more around it, but let's file a separate Jira to tackle this direction. Here, as mentioned above, let's try to get the first draft of data model in asap to unblock the aggregator and the reader work. Hopefully it makes sense to the folks here. [Data Model] create overall data objects of TS next gen --- Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.5.patch, YARN-3041.preliminary.001.patch Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy
[ https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3217: --- Attachment: YARN-3217.patch Remove httpclient dependency from hadoop-yarn-server-web-proxy -- Key: YARN-3217 URL: https://issues.apache.org/jira/browse/YARN-3217 Project: Hadoop YARN Issue Type: Task Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula Priority: Minor Labels: newbie Attachments: YARN-3217.patch Sub-task of HADOOP-10105. Remove httpclient dependency from WebAppProxyServlet.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327088#comment-14327088 ] Robert Kanter commented on YARN-2423: - [~zjshen], can you take another look at the patch? TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Attachments: YARN-2423.004.patch, YARN-2423.005.patch, YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3076) YarnClient implementation to retrieve label to node mapping
[ https://issues.apache.org/jira/browse/YARN-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326991#comment-14326991 ] Hadoop QA commented on YARN-3076: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699124/YARN-3076.003.patch against trunk revision b8a14ef. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.conf.TestJobConf org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6664//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6664//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6664//console This message is automatically generated. YarnClient implementation to retrieve label to node mapping --- Key: YARN-3076 URL: https://issues.apache.org/jira/browse/YARN-3076 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-3076.001.patch, YARN-3076.002.patch, YARN-3076.003.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen
[ https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327030#comment-14327030 ] Hadoop QA commented on YARN-3041: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699612/YARN-3041.5.patch against trunk revision 946456c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build///testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build///console This message is automatically generated. [Data Model] create overall data objects of TS next gen --- Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.5.patch, YARN-3041.preliminary.001.patch Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326145#comment-14326145 ] Jason Lowe commented on YARN-2004: -- I'm not sure I understand the priority inversion problem and why we would be changing headroom. The headroom has no priority calculations in it. As I see it, the priority scheduling is _only_ changing the order in which applications are examined when deciding how to assign free resources in a queue. In other words, it does _not_ change the following: - the priority order between queues (i.e.: deciding which queue is first in line to obtain free resources in the cluster) - the user limits within a queue (i.e.: making an app higher priority does not implicitly give the user more room to grow within the queue than normal) - the headroom for an app within the queue (higher priority doesn't change the queue capacity or user limits) For example, a user is running app A then follows up with app B. The user decides app B is pretty important and raises its priority. This doesn't change the user limits within the queue or the headroom of those apps, but it does change which app will be assigned a spare resource if it is available. If the queue is totally full then both apps will be told their headroom is zero. One (or both) of them will need to free up some resources to make progress. When resources becomes available, app B will have the first chance to claim them since it was made a higher priority than A. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.
[ https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325910#comment-14325910 ] Hudson commented on YARN-3207: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #99 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/99/]) YARN-3207. Secondary filter matches entites which do not have the key (xgong: rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java secondary filter matches entites which do not have the key being filtered for. -- Key: YARN-3207 URL: https://issues.apache.org/jira/browse/YARN-3207 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Prakash Ramachandran Assignee: Zhijie Shen Attachments: YARN-3207.1.patch in the leveldb implementation of the TimelineStore the secondary filter matches entities where the key being searched for is not present. ex query from tez ui http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar will match and return the entity even though there is no entity with otherinfo.foo defined. the issue seems to be in {code:title=LeveldbTimelineStore:675} if (vs != null !vs.contains(filter.getValue())) { filterPassed = false; break; } {code} this should be IMHO vs == null || !vs.contains(filter.getValue()) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.
[ https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325936#comment-14325936 ] Hudson commented on YARN-3207: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2040 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2040/]) YARN-3207. Secondary filter matches entites which do not have the key (xgong: rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java secondary filter matches entites which do not have the key being filtered for. -- Key: YARN-3207 URL: https://issues.apache.org/jira/browse/YARN-3207 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Prakash Ramachandran Assignee: Zhijie Shen Attachments: YARN-3207.1.patch in the leveldb implementation of the TimelineStore the secondary filter matches entities where the key being searched for is not present. ex query from tez ui http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar will match and return the entity even though there is no entity with otherinfo.foo defined. the issue seems to be in {code:title=LeveldbTimelineStore:675} if (vs != null !vs.contains(filter.getValue())) { filterPassed = false; break; } {code} this should be IMHO vs == null || !vs.contains(filter.getValue()) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.
[ https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326030#comment-14326030 ] Hudson commented on YARN-3207: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2059 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2059/]) YARN-3207. Secondary filter matches entites which do not have the key (xgong: rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java secondary filter matches entites which do not have the key being filtered for. -- Key: YARN-3207 URL: https://issues.apache.org/jira/browse/YARN-3207 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Prakash Ramachandran Assignee: Zhijie Shen Attachments: YARN-3207.1.patch in the leveldb implementation of the TimelineStore the secondary filter matches entities where the key being searched for is not present. ex query from tez ui http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar will match and return the entity even though there is no entity with otherinfo.foo defined. the issue seems to be in {code:title=LeveldbTimelineStore:675} if (vs != null !vs.contains(filter.getValue())) { filterPassed = false; break; } {code} this should be IMHO vs == null || !vs.contains(filter.getValue()) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1916) Leveldb timeline store applies secondary filters incorrectly
[ https://issues.apache.org/jira/browse/YARN-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326053#comment-14326053 ] Hadoop QA commented on YARN-1916: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639313/YARN-1916.1.patch against trunk revision 2ecea5a. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6659//console This message is automatically generated. Leveldb timeline store applies secondary filters incorrectly Key: YARN-1916 URL: https://issues.apache.org/jira/browse/YARN-1916 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1916.1.patch When applying a secondary filter (fieldname:fieldvalue) in a get entities query, LeveldbTimelineStore retrieves entities that do not have the specified fieldname, in addition to correctly retrieving entities that have the fieldname with the specified fieldvalue. It should not return entities that do not have the fieldname. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.
[ https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325787#comment-14325787 ] Hudson commented on YARN-3207: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #108 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/108/]) YARN-3207. Secondary filter matches entites which do not have the key (xgong: rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java secondary filter matches entites which do not have the key being filtered for. -- Key: YARN-3207 URL: https://issues.apache.org/jira/browse/YARN-3207 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Prakash Ramachandran Assignee: Zhijie Shen Attachments: YARN-3207.1.patch in the leveldb implementation of the TimelineStore the secondary filter matches entities where the key being searched for is not present. ex query from tez ui http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar will match and return the entity even though there is no entity with otherinfo.foo defined. the issue seems to be in {code:title=LeveldbTimelineStore:675} if (vs != null !vs.contains(filter.getValue())) { filterPassed = false; break; } {code} this should be IMHO vs == null || !vs.contains(filter.getValue()) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.
[ https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325997#comment-14325997 ] Hudson commented on YARN-3207: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #109 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/109/]) YARN-3207. Secondary filter matches entites which do not have the key (xgong: rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/CHANGES.txt secondary filter matches entites which do not have the key being filtered for. -- Key: YARN-3207 URL: https://issues.apache.org/jira/browse/YARN-3207 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Prakash Ramachandran Assignee: Zhijie Shen Attachments: YARN-3207.1.patch in the leveldb implementation of the TimelineStore the secondary filter matches entities where the key being searched for is not present. ex query from tez ui http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar will match and return the entity even though there is no entity with otherinfo.foo defined. the issue seems to be in {code:title=LeveldbTimelineStore:675} if (vs != null !vs.contains(filter.getValue())) { filterPassed = false; break; } {code} this should be IMHO vs == null || !vs.contains(filter.getValue()) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3136: -- Attachment: 0003-YARN-3136.patch Yes [~jlowe]. Its good to keep the backward compatibility. bq. can be overridden in derived schedulers A new method named *getSchedulerApplication* can be added in AbstractYarnScheduler and it can come with lock by default to access application object from *applications* map. Later in CS or other scheduler, we can override to remove the lock. I attached a patch on this. Please see whether its same as you mentioned. getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-914) Support graceful decommission of nodemanager
[ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-914: Attachment: GracefullyDecommissionofNodeManagerv3.pdf Update proposal to incorporate most comments above, include: AM notification mechanism, name, UI changes, etc. In addition, add some details on core state transition for RMNode state machine. Will break down sub liras and start the work if no further more comments on significant issues. Support graceful decommission of nodemanager Key: YARN-914 URL: https://issues.apache.org/jira/browse/YARN-914 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Luke Lu Assignee: Junping Du Attachments: Gracefully Decommission of NodeManager (v1).pdf, Gracefully Decommission of NodeManager (v2).pdf, GracefullyDecommissionofNodeManagerv3.pdf When NMs are decommissioned for non-fault reasons (capacity change etc.), it's desirable to minimize the impact to running applications. Currently if a NM is decommissioned, all running containers on the NM need to be rescheduled on other NMs. Further more, for finished map tasks, if their map output are not fetched by the reducers of the job, these map tasks will need to be rerun as well. We propose to introduce a mechanism to optionally gracefully decommission a node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3214) Adding non-exclusive node labels
Wangda Tan created YARN-3214: Summary: Adding non-exclusive node labels Key: YARN-3214 URL: https://issues.apache.org/jira/browse/YARN-3214 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Currently node labels partition the cluster to some sub-clusters so resources cannot be shared between partitioned cluster. With the current implementation of node labels we cannot use the cluster optimally and the throughput of the cluster will suffer. We are proposing adding non-exclusive node labels: 1. Labeled apps get the preference on Labeled nodes 2. If there is no ask for labeled resources we can assign those nodes to non labeled apps 3. If there is any future ask for those resources , we will preempt the non labeled apps and give them back to labeled apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3214) Add non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3214: - Summary: Add non-exclusive node labels (was: Adding non-exclusive node labels ) Add non-exclusive node labels -- Key: YARN-3214 URL: https://issues.apache.org/jira/browse/YARN-3214 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Currently node labels partition the cluster to some sub-clusters so resources cannot be shared between partitioned cluster. With the current implementation of node labels we cannot use the cluster optimally and the throughput of the cluster will suffer. We are proposing adding non-exclusive node labels: 1. Labeled apps get the preference on Labeled nodes 2. If there is no ask for labeled resources we can assign those nodes to non labeled apps 3. If there is any future ask for those resources , we will preempt the non labeled apps and give them back to labeled apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3215) Respect labels in CapacityScheduler when computing headroom
Wangda Tan created YARN-3215: Summary: Respect labels in CapacityScheduler when computing headroom Key: YARN-3215 URL: https://issues.apache.org/jira/browse/YARN-3215 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
Wangda Tan created YARN-3216: Summary: Max-AM-Resource-Percentage should respect node labels Key: YARN-3216 URL: https://issues.apache.org/jira/browse/YARN-3216 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1615) Fix typos in FSAppAttempt.java
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326700#comment-14326700 ] Hadoop QA commented on YARN-1615: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699540/YARN-1615-002.patch against trunk revision 2aa9979. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6661//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6661//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6661//console This message is automatically generated. Fix typos in FSAppAttempt.java -- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3040) [Data Model] Implement client-side API for handling flows
[ https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326716#comment-14326716 ] Robert Kanter commented on YARN-3040: - [~Naganarasimha] # I think the Entities (YARN-3041) are mainly for writing/reading to/from the ATS store. Most of the information stored in those Entities are not needed by the user when submitting a job. All the user really needs to set is the IDs, and some of these we can make optional or determine automatically (e.g. it's obvious which cluster it's running on) # 100 characters per tag seems like it should be enough; if not, we can maybe increase this limit? It is marked as {{@Evolving}} # Like other properties, we can add a method to JobClient or one of those classes that sets the property. For example, {{setFlowId(String id)}} would simply set the tag Flows and related constructs don't currently exist in YARN. Unless we add these as first-class concepts to the rest of YARN outside of the ATS (e.g. instead of only being able to submit YARN applications, you can also submit YARN flows; though this is looking more like Oozie...), I think tags are the only way to track this information. [Data Model] Implement client-side API for handling flows - Key: YARN-3040 URL: https://issues.apache.org/jira/browse/YARN-3040 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Robert Kanter Per design in YARN-2928, implement client-side API for handling *flows*. Frameworks should be able to define and pass in all attributes of flows and flow runs to YARN, and they should be passed into ATS writers. YARN tags were discussed as a way to handle this piece of information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326722#comment-14326722 ] Robert Kanter commented on YARN-2942: - Sure. I think Combined Aggregated Logs is more obvious than Uber Aggregated Logs; we also seem to use Uber for a few different things already. I'll update the design doc and look into spliting up the patch into a few sub tasks. Aggregated Log Files should be compacted Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326744#comment-14326744 ] Hadoop QA commented on YARN-2942: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699570/CombinedAggregatedLogsProposal_v3.pdf against trunk revision 9a3e292. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6662//console This message is automatically generated. Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy
[ https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-3217: Labels: newbie (was: ) Remove httpclient dependency from hadoop-yarn-server-web-proxy -- Key: YARN-3217 URL: https://issues.apache.org/jira/browse/YARN-3217 Project: Hadoop YARN Issue Type: Task Reporter: Akira AJISAKA Priority: Minor Labels: newbie Sub-task of HADOOP-10105. Remove httpclient dependency from WebAppProxyServlet.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1615) Fix typos in description about delay scheduling
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1615: - Summary: Fix typos in description about delay scheduling (was: Fix typos in delay scheduler's description) Fix typos in description about delay scheduling --- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326924#comment-14326924 ] Rohith commented on YARN-3194: -- FIndbugs warnings are unrelated to this Jira. These warnings will be handled as part of YARN-3204 After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326953#comment-14326953 ] Robert Kanter commented on YARN-2942: - {quote}We should try to avoid rereading the entire log file and rewriting again. How about we try the concat approach (with variable length blocks) first before we try the reread+rewrite?{quote} The problem here is that the aggregated log files are not in an append-friendly format (TFile). We'd have to change the file format that they're in (perhaps reusing the similar format I created in this patch), but this wouldn't be backwards compatible. {quote}The long term solution for the later really is HDFS supporting atomic append (with concurrent writers){quote} This would be very useful. Even with the design implemented by this patch, it sounds like it would eventually allow us to get rid of the ZooKeeper locks. Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2942: Attachment: CombinedAggregatedLogsProposal_v3.pdf I've just uploaded CombinedAggregatedLogsProposal_v3.pdf, which has some minor updates. Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CombinedAggregatedLogsProposal_v3.pdf, CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3219) Use CombinedAggregatedLogFormat Writer to combine aggregated log files
Robert Kanter created YARN-3219: --- Summary: Use CombinedAggregatedLogFormat Writer to combine aggregated log files Key: YARN-3219 URL: https://issues.apache.org/jira/browse/YARN-3219 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter The NodeManager should use the {{CombinedAggregatedLogFormat}} from YARN-3218 to append its aggregated log to the per-app log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3222) RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential order
Rohith created YARN-3222: Summary: RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential order Key: YARN-3222 URL: https://issues.apache.org/jira/browse/YARN-3222 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Priority: Critical When a node is reconnected,RMNodeImpl#ReconnectNodeTransition notifies the scheduler in a events node_added,node_removed or node_resource_update. These events should be notified in an sequential order i.e node_added event and next node_resource_update events. But if the node is reconnected with different http port, the oder of scheduler events are node_removed -- node_resource_update -- node_added which causes scheduler does not find the node and throw NPE and RM exit. Node_Resource_update event should be always should be triggered via RMNodeEventType.RESOURCE_UPDATE -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy
Akira AJISAKA created YARN-3217: --- Summary: Remove httpclient dependency from hadoop-yarn-server-web-proxy Key: YARN-3217 URL: https://issues.apache.org/jira/browse/YARN-3217 Project: Hadoop YARN Issue Type: Task Reporter: Akira AJISAKA Priority: Minor Sub-task of HADOOP-10105. Remove httpclient dependency from WebAppProxyServlet.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3166) [Source organization] Decide detailed package structures for timeline service v2 components
[ https://issues.apache.org/jira/browse/YARN-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326812#comment-14326812 ] Zhijie Shen commented on YARN-3166: --- Thanks for raising the package structure definition. Some comments: 1. Should we sort out the package for RM and NM code of writing actual system data through the aggregator? 2. RM and NM modules will depend on timeline service module? [Source organization] Decide detailed package structures for timeline service v2 components --- Key: YARN-3166 URL: https://issues.apache.org/jira/browse/YARN-3166 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Open this JIRA to track all discussions on detailed package structures for timeline services v2. This JIRA is for discussion only. For our current timeline service v2 design, aggregator (previously called writer) implementation is in hadoop-yarn-server's: {{org.apache.hadoop.yarn.server.timelineservice.aggregator}} In YARN-2928's design, the next gen ATS reader is also a server. Maybe we want to put reader related implementations into hadoop-yarn-server's: {{org.apache.hadoop.yarn.server.timelineservice.reader}} Both readers and aggregators will expose features that may be used by YARN and other 3rd party components, such as aggregator/reader APIs. For those features, maybe we would like to expose their interfaces to hadoop-yarn-common's {{org.apache.hadoop.yarn.timelineservice}}? Let's use this JIRA as a centralized place to track all related discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy
[ https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula reassigned YARN-3217: -- Assignee: Brahma Reddy Battula Remove httpclient dependency from hadoop-yarn-server-web-proxy -- Key: YARN-3217 URL: https://issues.apache.org/jira/browse/YARN-3217 Project: Hadoop YARN Issue Type: Task Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula Priority: Minor Labels: newbie Sub-task of HADOOP-10105. Remove httpclient dependency from WebAppProxyServlet.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326877#comment-14326877 ] Tsuyoshi OZAWA commented on YARN-1514: -- Thanks Jian for your review! Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.7.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, YARN-1514.5.patch, YARN-1514.6.patch, YARN-1514.7.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1615) Fix typos in description about delay scheduling
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326898#comment-14326898 ] Akira AJISAKA commented on YARN-1615: - Thanks Tsuyoshi for review and commit. Fix typos in description about delay scheduling --- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Fix For: 2.7.0 Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2942: Summary: Aggregated Log Files should be combined (was: Aggregated Log Files should be compacted) Aggregated Log Files should be combined --- Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: CompactedAggregatedLogsProposal_v1.pdf, CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, YARN-2942.003.patch Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3220) JHS should display Combined Aggregated Logs when available
Robert Kanter created YARN-3220: --- Summary: JHS should display Combined Aggregated Logs when available Key: YARN-3220 URL: https://issues.apache.org/jira/browse/YARN-3220 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter The JHS should read the Combined Aggregated Log files created by YARN-3219 when the user asks it for logs. When unavailable, it should fallback to the regular Aggregated Log files (the current behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1615) Fix typos in description about delay scheduling
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1615: - Hadoop Flags: Reviewed Fix typos in description about delay scheduling --- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3122) Metrics for container's actual CPU usage
[ https://issues.apache.org/jira/browse/YARN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326864#comment-14326864 ] Hadoop QA commented on YARN-3122: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699581/YARN-3122.002.patch against trunk revision 1c03376. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6663//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6663//console This message is automatically generated. Metrics for container's actual CPU usage Key: YARN-3122 URL: https://issues.apache.org/jira/browse/YARN-3122 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3122.001.patch, YARN-3122.002.patch, YARN-3122.prelim.patch, YARN-3122.prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track CPU usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1615) Fix typos in description about delay scheduling
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326867#comment-14326867 ] Hudson commented on YARN-1615: -- FAILURE: Integrated in Hadoop-trunk-Commit #7150 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7150/]) YARN-1615. Fix typos in delay scheduler's description. Contributed by Akira Ajisaka. (ozawa: rev b8a14efdf535d42bcafa58d380bd2c7f4d36f8cb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/CHANGES.txt Fix typos in description about delay scheduling --- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Fix For: 2.7.0 Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1615) Fix typos in FSAppAttempt.java
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326831#comment-14326831 ] Tsuyoshi OZAWA commented on YARN-1615: -- +1, committing this shortly. Fix typos in FSAppAttempt.java -- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1615) Fix typos in delay scheduler's description
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1615: - Summary: Fix typos in delay scheduler's description (was: Fix typos in FSAppAttempt.java) Fix typos in delay scheduler's description -- Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615-002.patch, YARN-1615.patch In FSAppAttempt.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3218) Implement CombinedAggregatedLogFormat Reader and Writer
[ https://issues.apache.org/jira/browse/YARN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326901#comment-14326901 ] Hadoop QA commented on YARN-3218: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699591/YARN-3218.001.patch against trunk revision b8a14ef. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6665//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6665//console This message is automatically generated. Implement CombinedAggregatedLogFormat Reader and Writer --- Key: YARN-3218 URL: https://issues.apache.org/jira/browse/YARN-3218 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: YARN-3218.001.patch We need to create a Reader and Writer for the CombinedAggregatedLogFormat -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326737#comment-14326737 ] Jason Lowe commented on YARN-2004: -- I took a closer look at the patch, and the following logic seems suspect: {code} + if (a1.getApplicationPriority() != null + a2.getApplicationPriority() != null + !a1.getApplicationPriority().equals(a2.getApplicationPriority())) { +return a2.getApplicationPriority().compareTo( +a1.getApplicationPriority()); + } {code} Priority is only considered if both applications have a priority that was set. Do we really want that behavior? I'm thinking of the scenario where all the apps in the queue have no set priority then one of the apps has their priority set to very high or very low. That has no net effect since all other apps being compared in the queue don't have a priority set. A more intuitive behavior is to treat an unset priority as if the app had a default priority, so we aren't implicitly disabling priority checks in some scenarios. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326822#comment-14326822 ] Hudson commented on YARN-1514: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7149 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7149/]) YARN-1514. Utility to benchmark ZKRMStateStore#loadState for RM HA. Contributed by Tsuyoshi OZAWA (jianhe: rev 1c03376300a46722d4147f5b8f37242f68dba0a2) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/test/YarnTestDriver.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml * hadoop-yarn-project/CHANGES.txt * hadoop-project/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStorePerf.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.7.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, YARN-1514.5.patch, YARN-1514.6.patch, YARN-1514.7.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3221) Applications should be able to 're-register'
Sidharta Seethana created YARN-3221: --- Summary: Applications should be able to 're-register' Key: YARN-3221 URL: https://issues.apache.org/jira/browse/YARN-3221 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sidharta Seethana Today, it is not possible for YARN applications to 're-register' in failure/restart scenarios. This is especially problematic for Unmanaged applications - when restarts (normal or otherwise) or other failures necessitate the re-creation of the AMRMClient (along with a reset of the internal RPC counter). The YARN RM disallows an attempt to register again (with the same saved token) with the following exception shown below. This should be fixed. {quote} rmClient.RegisterApplicationMaster org.apache.hadoop.yarn.exceptions.InvalidApplicationMasterRequestException:Application Master is already registered : application_1424304845861_0002 at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:264) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:90) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3218) Implement CombinedAggregatedLogFormat Reader and Writer
[ https://issues.apache.org/jira/browse/YARN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-3218: Attachment: YARN-3218.001.patch Implement CombinedAggregatedLogFormat Reader and Writer --- Key: YARN-3218 URL: https://issues.apache.org/jira/browse/YARN-3218 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: YARN-3218.001.patch We need to create a Reader and Writer for the CombinedAggregatedLogFormat -- This message was sent by Atlassian JIRA (v6.3.4#6332)