date:20150218


 [ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2820:

Attachment: YARN-2820.003.patch

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325589#comment-14325589
 ] 

zhihai xu commented on YARN-2820:
-

[~ozawa], thanks for the review. Your suggestion is good.
I uploaded a new patch YARN-2820.003.patch, which addressed your comment.
please review it.
thanks
zhihai

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at

[jira] [Updated] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.


 [ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2820:

Description: 
Do retry in FileSystemRMStateStore for better error recovery when update/store 
failure due to IOException.
When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw 
the following IOexception cause the RM shutdown.
{code}
2014-10-29 23:49:12,202 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Updating info for attempt: appattempt_1409135750325_109118_01 at: 
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01

2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:46,283 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Error updating info for attempt: appattempt_1409135750325_109118_01
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas.
2014-10-29 23:49:46,284 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing/updating appAttempt: appattempt_1409135750325_109118_01
2014-10-29 23:49:46,916 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause: 
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas. 
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) 
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
 
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:744) 
{code}
As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
IOException in storeApplicationStateInternal.
Stack trace from TestFSRMStateStore failure:
{code}
 2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
(TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not 
started
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
  at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at

[jira] [Updated] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.


 [ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2820:

Description: 
Do retry in FileSystemRMStateStore for better error recovery when update/store 
failure due to IOException.
When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw 
the following IOexception cause the RM shutdown.
{code}
2014-10-29 23:49:12,202 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Updating info for attempt: appattempt_1409135750325_109118_01 at: 
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01

2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:46,283 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Error updating info for attempt: appattempt_1409135750325_109118_01
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas.
2014-10-29 23:49:46,284 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing/updating appAttempt: appattempt_1409135750325_109118_01
2014-10-29 23:49:46,916 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause: 
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas. 
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) 
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
 
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:744) 
{code}
As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
IOException in storeApplicationStateInternal.
Stack trace from TestFSRMStateStore failure:
{code}
 2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
(TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not 
started
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at

[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler

2015-02-18 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325581#comment-14325581
 ] 

Devaraj K commented on YARN-3197:
-

Thanks [~varun_saxena] for patch and [~rohithsharma] for comment.

{code:xml}
+  + ] of unknown application completed with event  + event);
{code}
Here 'unknown application' may not be appropriate always. Instead can we think 
of logging like 'Unknown container  + containerStatus.getContainerId() +  
completed with event  + event'.


bq. It would be better if the log level changed to DEBUG. In NM restart, these 
messages are very huge
Do you see any other info logs coming for the same container? IMO, it should 
have at least one info log about this container status update from NM, after NM 
restart.


 Confusing log generated by CapacityScheduler
 

 Key: YARN-3197
 URL: https://issues.apache.org/jira/browse/YARN-3197
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Hitesh Shah
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3197.001.patch


 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.


[ 
https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325750#comment-14325750
 ] 

Hudson commented on YARN-3207:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #842 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/842/])
YARN-3207. Secondary filter matches entites which do not have the key (xgong: 
rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
* hadoop-yarn-project/CHANGES.txt


 secondary filter matches entites which do not have the key being filtered for.
 --

 Key: YARN-3207
 URL: https://issues.apache.org/jira/browse/YARN-3207
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Prakash Ramachandran
Assignee: Zhijie Shen
 Attachments: YARN-3207.1.patch


 in the leveldb implementation of the TimelineStore the secondary filter 
 matches entities where the key being searched for is not present.
 ex query from tez ui
 http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar
 will match and return the entity even though there is no entity with 
 otherinfo.foo defined.
 the issue seems to be in 
 {code:title=LeveldbTimelineStore:675}
 if (vs != null  !vs.contains(filter.getValue())) {
   filterPassed = false;
   break;
 }
 {code}
 this should be IMHO
 vs == null || !vs.contains(filter.getValue())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3211) Do not use zero as the beginning number for commands for LinuxContainerExecutor

2015-02-18 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created YARN-3211:
-

 Summary: Do not use zero as the beginning number for commands for 
LinuxContainerExecutor
 Key: YARN-3211
 URL: https://issues.apache.org/jira/browse/YARN-3211
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Priority: Minor


Current the implementation of LinuxContainerExecutor and container-executor 
uses some numbers as its commands. The commands begin from zero 
(INITIALIZE_CONTAINER).

When LinuxContainerExecutor gives the numeric command as the command line 
parameter to run container-executor. container-executor calls atoi() to parse 
the command string to integer.

However, we know that atoi() will return zero when it can not parse the string 
to integer. So if you give an non-numeric command, container-executor still 
accepts it and runs INITIALIZE_CONTAINER command.

I think it is wrong and we should not use zero as the beginning number of the 
commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325676#comment-14325676
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12699443/YARN-2820.003.patch
against trunk revision b6fc1f3.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6658//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6658//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6658//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

Key: YARN-2820
URL: https://issues.apache.org/jira/browse/YARN-2820
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Attachments: YARN-2820.000.patch, YARN-2820.001.patch,
YARN-2820.002.patch, YARN-2820.003.patch

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We
saw the following IOexception cause the RM shutdown.
{code}
2014-10-29 23:49:12,202 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Updating info for attempt: appattempt_1409135750325_109118_01 at:
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01
2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...
2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...
2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...
2014-10-29 23:49:46,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Error updating info for attempt: appattempt_1409135750325_109118_01
java.io.IOException: Unable to close file because the last block does not
have enough number of replicas.
2014-10-29 23:49:46,284 ERROR
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing/updating appAttempt: appattempt_1409135750325_109118_01
2014-10-29 23:49:46,916 FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type
STATE_STORE_OP_FAILED. Cause:
java.io.IOException: Unable to close file because the last block does not
have enough number of replicas.
at
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)

at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)

[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler

2015-02-18 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325702#comment-14325702
 ] 

Devaraj K commented on YARN-3197:
-

rmContainer could be null when SchedulerApplicationAttempt is null or 
liveContainers doesn't have the container info. There could be a chance of 
ApplicationAttempt is running and container has already completed(removed from 
liveContainers). Here we cannot say unknown application.

I have mentioned 'Unknown container' because RM has removed this container info 
and doesn't know about this container any more. Do you see any better message 
here?


 Confusing log generated by CapacityScheduler
 

 Key: YARN-3197
 URL: https://issues.apache.org/jira/browse/YARN-3197
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Hitesh Shah
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3197.001.patch


 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration

[
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325610#comment-14325610
]

Rohith commented on YARN-3194:
--

Thanks [~jlowe] [~djp] [~jianhe] for detailed review:-)

bq. the container status processing code is almost a duplicate of the same code
in StatusUpdateWhenHealthyTransition
Agree, this has to be refactored. Majority of processing containerStatus code
is same.

bq. we don't remove containers that have completed from the launchedContainers
map which seems wrong
I see, yes. completed containers should be removed from launchedContainers.

bq. I don't see why we would process container status sent during a reconnect
differently than a regular status update from the NM
IIUC it is only to deal with NMContainerStatus and containerStatus. But I am
not sure why these both created differently. What I see is containerStatus is
subset of NMcontainerStatus. I think containerStatus would have been inside
NMContainerStatus.

bq. Is below condition valid for the newly added code in
ReconnectNodeTransition too ?
Yes, it is applicable since we are keeping old RMNode object.

bq. Add timeout to the test, testAppCleanupWhenNMRstarts -
testProcessingContainerStatusesOnNMRestart ? and add more detailed comments
about what the test is doing too ?
Agree.

bq. Could you add a validation that ApplicationMasterService#allocate indeed
receives the completed container in this scenario?
Agree, I will add

bq. Question: does the 3072 include 1024 for the AM container and 2048 for the
allocated container ?
AM memory is 1024 and additional requested container memory is 2048. In test,
number of request container is 1. So AllocatedMB should be AM+Requested i.e
1024+2048=3072

After NM restart,completed containers are not released by RM which are sent
during NM registration
--

Key: YARN-3194
URL: https://issues.apache.org/jira/browse/YARN-3194
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.7.0
Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
Attachments: 0001-yarn-3194-v1.patch

On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM
process only ContainerState.RUNNING. If container is completed when NM was
down then those containers resources wont be release which result in
applications to hang.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3211) Do not use zero as the beginning number for commands for LinuxContainerExecutor

2015-02-18 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated YARN-3211:
--
Attachment: YARN-3211.patch

 Do not use zero as the beginning number for commands for 
 LinuxContainerExecutor
 ---

 Key: YARN-3211
 URL: https://issues.apache.org/jira/browse/YARN-3211
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Priority: Minor
 Attachments: YARN-3211.patch


 Current the implementation of LinuxContainerExecutor and container-executor 
 uses some numbers as its commands. The commands begin from zero 
 (INITIALIZE_CONTAINER).
 When LinuxContainerExecutor gives the numeric command as the command line 
 parameter to run container-executor. container-executor calls atoi() to parse 
 the command string to integer.
 However, we know that atoi() will return zero when it can not parse the 
 string to integer. So if you give an non-numeric command, container-executor 
 still accepts it and runs INITIALIZE_CONTAINER command.
 I think it is wrong and we should not use zero as the beginning number of the 
 commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements

2015-02-18 Thread Gururaj Shetty (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325618#comment-14325618
 ] 

Gururaj Shetty commented on YARN-3199:
--

[~ka...@cloudera.com] your comment is been incorporated. Will be merged once 
all the docs are converted to Markdown.

 Fair Scheduler documentation improvements
 -

 Key: YARN-3199
 URL: https://issues.apache.org/jira/browse/YARN-3199
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Rohit Agarwal
Assignee: Gururaj Shetty
Priority: Minor
  Labels: documentation
 Attachments: YARN-3199.patch


 {{yarn.scheduler.increment-allocation-mb}} and 
 {{yarn.scheduler.increment-allocation-vcores}} are not documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node


 [ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3194:
-
Summary: After NM restart, RM should handle NMCotainerStatuses sent by NM 
while registering if NM is Reconnected node  (was: After NM restart,completed 
containers are not released by RM which are sent during NM registration)

 After NM restart, RM should handle NMCotainerStatuses sent by NM while 
 registering if NM is Reconnected node
 

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
 process only ContainerState.RUNNING. If container is completed when NM was 
 down then those containers resources wont be release which result in 
 applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.

[
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325616#comment-14325616
]

Hadoop QA commented on YARN-2820:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12699439/YARN-2820.002.patch
against trunk revision b6fc1f3.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6657//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6657//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6657//console

This message is automatically generated.

Do retry in FileSystemRMStateStore for better error recovery when
update/store failure due to IOException.
--

at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node


 [ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3194:
-
Description: 
On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
registration. The registration can be treated by RM as New node or Reconnecting 
node. RM triggers corresponding event on the basis of node added or node 
reconnected state. 
# Node added event : Again here 2 scenario's can occur 
## New node is registering with different ip:port – NOT A PROBLEM
## Old node is re-registering because of RESYNC command from RM when RM restart 
– NOT A PROBLEM

# Node reconnected event : 
## Existing node is re-registering i.e RM treat it as reconnecting node when RM 
is not restarted 
### NM RESTART NOT Enabled – NOT A PROBLEM
### NM RESTART is Enabled 
 Some applications are running on this node – *Problem is here*
 Zero applications are running on this node – NOT A PROBLEM

Since NMContainerStatus are not handled, RM never get to know about 
completedContainer and never release resource held be containers. RM will not 
allocate new containers for pending resource request as long as the 
completedContainer event is triggered. This results in applications to wait 
indefinitly because of pending containers are not served by RM.


  was:
On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
registration. The registration can be treated by RM as New node or Reconnecting 
node. RM triggers corresponding event on the basis of node added or node 
reconnected state. 
# Node added event : Again here 2 scenario's can occur 
## New node is registering with different ip:port – NOT A PROBLEM
## Old node is re-registering because of RESYNC command from RM when RM restart 
– NOT A PROBLEM

# Node reconnected event : 
## Existing node is re-registering i.e RM treat it as reconnecting node when RM 
is not restarted 
### NM RESTART NOT Enabled – NOT A PROBLEM
### NM RESTART is Enabled 
 Some applications are running on this node – *Problem is here*
 Zero applications are running on this node – NOT A PROBLEM


 After NM restart, RM should handle NMCotainerStatuses sent by NM while 
 registering if NM is Reconnected node
 

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
 registration. The registration can be treated by RM as New node or 
 Reconnecting node. RM triggers corresponding event on the basis of node added 
 or node reconnected state. 
 # Node added event : Again here 2 scenario's can occur 
 ## New node is registering with different ip:port – NOT A PROBLEM
 ## Old node is re-registering because of RESYNC command from RM when RM 
 restart – NOT A PROBLEM
 # Node reconnected event : 
 ## Existing node is re-registering i.e RM treat it as reconnecting node when 
 RM is not restarted 
 ### NM RESTART NOT Enabled – NOT A PROBLEM
 ### NM RESTART is Enabled 
  Some applications are running on this node – *Problem is here*
  Zero applications are running on this node – NOT A PROBLEM
 Since NMContainerStatus are not handled, RM never get to know about 
 completedContainer and never release resource held be containers. RM will not 
 allocate new containers for pending resource request as long as the 
 completedContainer event is triggered. This results in applications to wait 
 indefinitly because of pending containers are not served by RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node


 [ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3194:
-
Description: 
On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
registration. The registration can be treated by RM as New node or Reconnecting 
node. RM triggers corresponding event on the basis of node added or node 
reconnected state. 
# Node added event : Again here 2 scenario's can occur 
## New node is registering with different ip:port – NOT A PROBLEM
## Old node is re-registering because of RESYNC command from RM when RM restart 
– NOT A PROBLEM

# Node reconnected event : 
## Existing node is re-registering i.e RM treat it as reconnecting node when RM 
is not restarted 
### NM RESTART NOT Enabled – NOT A PROBLEM
### NM RESTART is Enabled 
 Some applications are running on this node – *Problem is here*
 Zero applications are running on this node – NOT A PROBLEM

  was:On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But 
RM process only ContainerState.RUNNING. If container is completed when NM was 
down then those containers resources wont be release which result in 
applications to hang.


 After NM restart, RM should handle NMCotainerStatuses sent by NM while 
 registering if NM is Reconnected node
 

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
 registration. The registration can be treated by RM as New node or 
 Reconnecting node. RM triggers corresponding event on the basis of node added 
 or node reconnected state. 
 # Node added event : Again here 2 scenario's can occur 
 ## New node is registering with different ip:port – NOT A PROBLEM
 ## Old node is re-registering because of RESYNC command from RM when RM 
 restart – NOT A PROBLEM
 # Node reconnected event : 
 ## Existing node is re-registering i.e RM treat it as reconnecting node when 
 RM is not restarted 
 ### NM RESTART NOT Enabled – NOT A PROBLEM
 ### NM RESTART is Enabled 
  Some applications are running on this node – *Problem is here*
  Zero applications are running on this node – NOT A PROBLEM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.


[ 
https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325621#comment-14325621
 ] 

zhihai xu commented on YARN-2820:
-

I checked the warning message, all these 5 findbugs are not related to my 
change.

 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 --

 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2820.000.patch, YARN-2820.001.patch, 
 YARN-2820.002.patch, YARN-2820.003.patch


 Do retry in FileSystemRMStateStore for better error recovery when 
 update/store failure due to IOException.
 When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We 
 saw the following IOexception cause the RM shutdown.
 {code}
 2014-10-29 23:49:12,202 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Updating info for attempt: appattempt_1409135750325_109118_01 at: 
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01
 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
 complete
 /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
 appattempt_1409135750325_109118_01.new.tmp retrying...
 2014-10-29 23:49:46,283 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
 Error updating info for attempt: appattempt_1409135750325_109118_01
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
 2014-10-29 23:49:46,284 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
 Error storing/updating appAttempt: appattempt_1409135750325_109118_01
 2014-10-29 23:49:46,916 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
 Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas. 
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
  
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
  
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
 at java.lang.Thread.run(Thread.java:744) 
 {code}
 As discussed at YARN-1778, TestFSRMStateStore failure is also due to  
 IOException in storeApplicationStateInternal.
 Stack trace from TestFSRMStateStore failure:
 {code}
  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
 (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still 
 not started
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at

[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler

2015-02-18 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325620#comment-14325620
 ] 

Varun Saxena commented on YARN-3197:


Hmm...But we do have Container ID. Would it be right to say Unknown container 
if we are printing ContainerID ? 
We do not know the Application ID however.

 Confusing log generated by CapacityScheduler
 

 Key: YARN-3197
 URL: https://issues.apache.org/jira/browse/YARN-3197
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Hitesh Shah
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3197.001.patch


 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state

2015-02-18 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3212:
-
Attachment: RMNodeImpl - new.png

Attache the new state transition diagram for RMNode.

 RMNode State Transition Update with DECOMMISSIONING state
 -

 Key: YARN-3212
 URL: https://issues.apache.org/jira/browse/YARN-3212
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Junping Du
Assignee: Junping Du
 Attachments: RMNodeImpl - new.png


 As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and 
 can transition from “running” state triggered by a new event - 
 “decommissioning”. 
 This new state can be transit to state of “decommissioned” when 
 Resource_Update if no running apps on this NM or NM reconnect after restart. 
 Or it received DECOMMISSIONED event (after timeout from CLI).
 In addition, it can back to “running” if user decides to cancel previous 
 decommission by calling recommission on the same node. The reaction to other 
 events is similar to RUNNING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen

2015-02-18 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326178#comment-14326178
 ] 

Joep Rottinghuis commented on YARN-3041:


Agreed with [~sjlee0] that 
we should use an enum to enumerate the timeline entity types.
Not sure if we should directly use enums, or have
TimelineEntity.type be interface TimelineEntityType and have an enum that 
implements that interface.
The latter is more extensible later on (there could be other enums implementing 
the interface).
On the other hand that makes things a bit harder to enumerate over, so perhaps 
that is overkill.

 [Data Model] create overall data objects of TS next gen
 ---

 Key: YARN-3041
 URL: https://issues.apache.org/jira/browse/YARN-3041
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, 
 YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.preliminary.001.patch


 Per design in YARN-2928, create the ATS entity and events API.
 Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, 
 flow, flow run, YARN app, ...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-02-18 Thread Sunil G (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326197#comment-14326197
]

Sunil G commented on YARN-2004:
---

Hi Jason, thank you for sharing the thoughts.

In one way, we need not have to think abt headroom and userlimit. Still I would
like to share 2 scenarios

1. Similar to MAPREDUCE-314. A job j1 is submitted with lower priority and
finished its map tasks, reducers are running. later j2 and j3 came in and took
over cluster resources. if a map is failed, by loosing some map o/p, there are
no chances of getting a resource for j1 till j2 and j3 releases resources and
not allocating it. In a -ve scenario, j1 will starve for much longer. This was
one of the intention to temporarily pause demand from j2 and j4 for a while and
spare some resources for j1.

2. User Limit: Assume the factor is 25, and 4 users can take 25% each from
cluster. 5th user has to wait. Assume the highest priority app is submitted by
5th user. He may not get resources untill demand from first 4 users(for
existing apps) are over. Do you feel this is to be handled?

Priority scheduling support in Capacity scheduler
-

1.Check for Application priority. If priority is available, then return
the highest priority job.
2.Otherwise continue with existing logic such as App ID comparison and
then TimeStamp comparison.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen

2015-02-18 Thread Joep Rottinghuis (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326183#comment-14326183
]

Joep Rottinghuis commented on YARN-3041:

Some additional throughs:
If we have the types strongly typed, do we need to call containers
YARN_CONTAINER and YARN_FLOW, or would we be able to capture more generic
flows and containers with this as well ?
Perhaps the framework used to run could be a property for the generic entity.

I don't see what the advantage is to have the user set up the proper
relationship. Why not make that part of the constructors and have protected
methods to set up the hierarchy correctly ? Why introduce a chance to have this
all set up strange ?

I think the acceptable entity types for parent-child relationships can be setup
in the enum itself. The enums would simply have methods on them and can take
constructors.

[Data Model] create overall data objects of TS next gen
---

Key: YARN-3041
URL: https://issues.apache.org/jira/browse/YARN-3041
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch,
YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.preliminary.001.patch

Per design in YARN-2928, create the ATS entity and events API.
Also, as part of this JIRA, create YARN system entities (e.g. cluster, user,
flow, flow run, YARN app, ...).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node


 [ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3194:
-
Attachment: 0001-YARN-3194.patch

 After NM restart, RM should handle NMCotainerStatuses sent by NM while 
 registering if NM is Reconnected node
 

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
 registration. The registration can be treated by RM as New node or 
 Reconnecting node. RM triggers corresponding event on the basis of node added 
 or node reconnected state. 
 # Node added event : Again here 2 scenario's can occur 
 ## New node is registering with different ip:port – NOT A PROBLEM
 ## Old node is re-registering because of RESYNC command from RM when RM 
 restart – NOT A PROBLEM
 # Node reconnected event : 
 ## Existing node is re-registering i.e RM treat it as reconnecting node when 
 RM is not restarted 
 ### NM RESTART NOT Enabled – NOT A PROBLEM
 ### NM RESTART is Enabled 
  Some applications are running on this node – *Problem is here*
  Zero applications are running on this node – NOT A PROBLEM
 Since NMContainerStatus are not handled, RM never get to know about 
 completedContainer and never release resource held be containers. RM will not 
 allocate new containers for pending resource request as long as the 
 completedContainer event is triggered. This results in applications to wait 
 indefinitly because of pending containers are not served by RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-933) Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING


[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326235#comment-14326235
 ] 

Rohith commented on YARN-933:
-

[~jianhe] kindly review the updated patch.

 Potential InvalidStateTransitonException: Invalid event: LAUNCHED at 
 FINAL_SAVING
 -

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
Assignee: Rohith
 Attachments: 0001-YARN-933.patch, 0001-YARN-933.patch, 
 0004-YARN-933.patch, YARN-933.3.patch, YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
  at

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-02-18 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326238#comment-14326238
]

Jason Lowe commented on YARN-2004:
--

For your first scenario, it can happen today without priority. MR jobs ask for
resources in waves -- first all the maps, then over time it ramps up reducers.
Multiple jobs in the same queue from the same user can collide in different
phases. That's the whole point of the headroom calculation and reporting -- to
allow AMs to realize this scenario is happening and react to it. In this case
what will happen is j1 will see its headroom is zero and start killing reducers
to make room for the failed map task. After killing the reducers there will be
some free resources in the cluster (if they weren't stolen by another,
underserved queue). Then the question goes to who will get those resources.
If we're using the default priority, j1 will get first crack at them due to
FIFO priority. If j2 or j3 were made higher priority then j1 will see that its
headroom is _still_ zero after killing some reducers and will probably kill
some more to try to make room. Rinse, repeat until j1 is out of reducers to
shoot or gets the resources it needs to run the failed map.

For the second scenario, the 5th user will _still_ be the first one to get any
spare resources in the queue because he has the highest priority app. Note
that the user limit calculation does not involve comparing a user's current
limit with other user's usage. It's just a computation of what's available in
the queue and what you're allowed based on the configured user limit and user
limit factor. So what will happen is the 5th user will continue to consume any
free resources in the queue until either the app is satiated or the 5th user
hits the 25% cap. If there are no free resources then the 5th user's app will
starve (without preemption) just like the rest until resources show up. Again,
higher priority just means you're first in line to get resources when they are
freed up, and it doesn't change anything else.

We can discuss adding preemption into the mix to force higher priority apps to
get their requested resources faster in a full queue. However I think the
first step is to get priority scheduling working for resources that are free in
the queue in the non-preemption case, as that's still very useful in practice.

Priority scheduling support in Capacity scheduler
-

1.Check for Application priority. If priority is available, then return
the highest priority job.
2.Otherwise continue with existing logic such as App ID comparison and
then TimeStamp comparison.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node


[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326191#comment-14326191
 ] 

Rohith commented on YARN-3194:
--

Attached the patch addressing all the above comments.. Kindly review the new 
patch..

 After NM restart, RM should handle NMCotainerStatuses sent by NM while 
 registering if NM is Reconnected node
 

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
 registration. The registration can be treated by RM as New node or 
 Reconnecting node. RM triggers corresponding event on the basis of node added 
 or node reconnected state. 
 # Node added event : Again here 2 scenario's can occur 
 ## New node is registering with different ip:port – NOT A PROBLEM
 ## Old node is re-registering because of RESYNC command from RM when RM 
 restart – NOT A PROBLEM
 # Node reconnected event : 
 ## Existing node is re-registering i.e RM treat it as reconnecting node when 
 RM is not restarted 
 ### NM RESTART NOT Enabled – NOT A PROBLEM
 ### NM RESTART is Enabled 
  Some applications are running on this node – *Problem is here*
  Zero applications are running on this node – NOT A PROBLEM
 Since NMContainerStatus are not handled, RM never get to know about 
 completedContainer and never release resource held be containers. RM will not 
 allocate new containers for pending resource request as long as the 
 completedContainer event is triggered. This results in applications to wait 
 indefinitly because of pending containers are not served by RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen

2015-02-18 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326185#comment-14326185
 ] 

Joep Rottinghuis commented on YARN-3041:


I think version may have to be something more than a property on a flow. We 
need to be able to query by versions.

 [Data Model] create overall data objects of TS next gen
 ---

 Key: YARN-3041
 URL: https://issues.apache.org/jira/browse/YARN-3041
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, 
 YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.preliminary.001.patch


 Per design in YARN-2928, create the ATS entity and events API.
 Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, 
 flow, flow run, YARN app, ...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state

2015-02-18 Thread Junping Du (JIRA)

Junping Du created YARN-3212:


 Summary: RMNode State Transition Update with DECOMMISSIONING state
 Key: YARN-3212
 URL: https://issues.apache.org/jira/browse/YARN-3212
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Junping Du
Assignee: Junping Du


As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can 
transition from “running” state triggered by a new event - “decommissioning”. 
This new state can be transit to state of “decommissioned” when Resource_Update 
if no running apps on this NM or NM reconnect after restart. Or it received 
DECOMMISSIONED event (after timeout from CLI).
In addition, it can back to “running” if user decides to cancel previous 
decommission by calling recommission on the same node. The reaction to other 
events is similar to RUNNING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler


[ 
https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326344#comment-14326344
 ] 

Wangda Tan commented on YARN-3197:
--

I think it's better not saying 

{code}
1276  LOG.info(Container [ContainerId:  + 
containerStatus.getContainerId()
1277  + ] of unknown application completed with event  + event);
{code}

Since we have containerId within containerStatus, it's better to indicate we 
cannot get RMContainer since the attempt probably is already completed, I 
suggest print both containerId and applicationId out.\

I think INFO could be fine since it will be at most once for each container.

And a logging below is also confusing:
{code}
if (application == null) {
  LOG.info(Container  + container +  of +  unknown application 
  + appId +  completed with event  + event);
  return;
}
{code}

If a RM can get RMContainer, the application will definitely not unknown, 
should indicate the application may be completed as well.

 Confusing log generated by CapacityScheduler
 

 Key: YARN-3197
 URL: https://issues.apache.org/jira/browse/YARN-3197
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Hitesh Shah
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3197.001.patch


 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node

[
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326328#comment-14326328
]

Hadoop QA commented on YARN-3194:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12699499/0001-YARN-3194.patch
against trunk revision 2ecea5a.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 2 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 5 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6660//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/6660//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6660//console

This message is automatically generated.

After NM restart, RM should handle NMCotainerStatuses sent by NM while
registering if NM is Reconnected node

On NM restart ,NM sends all the outstanding NMContainerStatus to RM during
registration. The registration can be treated by RM as New node or
Reconnecting node. RM triggers corresponding event on the basis of node added
or node reconnected state.
# Node added event : Again here 2 scenario's can occur
## New node is registering with different ip:port – NOT A PROBLEM
## Old node is re-registering because of RESYNC command from RM when RM
restart – NOT A PROBLEM
# Node reconnected event :
## Existing node is re-registering i.e RM treat it as reconnecting node when
RM is not restarted
### NM RESTART NOT Enabled – NOT A PROBLEM
### NM RESTART is Enabled
Some applications are running on this node – *Problem is here*
Zero applications are running on this node – NOT A PROBLEM
Since NMContainerStatus are not handled, RM never get to know about
completedContainer and never release resource held be containers. RM will not
allocate new containers for pending resource request as long as the
completedContainer event is triggered. This results in applications to wait
indefinitly because of pending containers are not served by RM.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3132) RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated

2015-02-18 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326460#comment-14326460
 ] 

Jian He commented on YARN-3132:
---

+1

 RMNodeLabelsManager should remove node from node-to-label mapping when node 
 becomes deactivated
 ---

 Key: YARN-3132
 URL: https://issues.apache.org/jira/browse/YARN-3132
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-3132.1.patch


 Using an example to explain:
 1) Admin specify host1 has label=x
 2) node=host1:123 registered
 3) Get node-to-label mapping, return host1/host1:123
 4) node=host1:123 unregistered
 5) Get node-to-label mapping, still returns host1:123
 Probably we should remove host1:123 when it becomes deactivated and no 
 directly label assigned to it (directly assign means admin specify host1:123 
 has x instead of host1 has x).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted


[ 
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326368#comment-14326368
 ] 

Robert Kanter commented on YARN-2942:
-

In the end, the cleaner service isn't necessary because the compacted 
aggregated logs are in the same place as the aggregated logs, so the 
{{AggregatedLogDeletionService}} takes care of this for us, without any code 
changes :)
I'll upload an updated design doc later today.

 Aggregated Log Files should be compacted
 

 Key: YARN-2942
 URL: https://issues.apache.org/jira/browse/YARN-2942
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: CompactedAggregatedLogsProposal_v1.pdf, 
 CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, 
 YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, 
 YARN-2942.003.patch


 Turning on log aggregation allows users to easily store container logs in 
 HDFS and subsequently view them in the YARN web UIs from a central place.  
 Currently, there is a separate log file for each Node Manager.  This can be a 
 problem for HDFS if you have a cluster with many nodes as you’ll slowly start 
 accumulating many (possibly small) files per YARN application.  The current 
 “solution” for this problem is to configure YARN (actually the JHS) to 
 automatically delete these files after some amount of time.  
 We should improve this by compacting the per-node aggregated log files into 
 one log file per application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted

2015-02-18 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326333#comment-14326333
]

Karthik Kambatla commented on YARN-2942:

I have been involved in the design. I like the currently design mainly because
it is an optimization. In the final document, I didn't quite get what the
Cleaner service would do. [~rkanter] - could you elaborate?

Aggregated Log Files should be compacted

Key: YARN-2942
URL: https://issues.apache.org/jira/browse/YARN-2942
Project: Hadoop YARN
Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Attachments: CompactedAggregatedLogsProposal_v1.pdf,
CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch,
YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch,
YARN-2942.003.patch

Turning on log aggregation allows users to easily store container logs in
HDFS and subsequently view them in the YARN web UIs from a central place.
Currently, there is a separate log file for each Node Manager. This can be a
problem for HDFS if you have a cluster with many nodes as you’ll slowly start
accumulating many (possibly small) files per YARN application. The current
“solution” for this problem is to configure YARN (actually the JHS) to
automatically delete these files after some amount of time.
We should improve this by compacting the per-node aggregated log files into
one log file per application.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3132) RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated


[ 
https://issues.apache.org/jira/browse/YARN-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326476#comment-14326476
 ] 

Hudson commented on YARN-3132:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7146 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7146/])
YARN-3132. RMNodeLabelsManager should remove node from node-to-label mapping 
when node becomes deactivated. Contributed by Wangda Tan (jianhe: rev 
f5da5566d9c392a5df71a2dce4c2d0d50eea51ee)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/TestRMNodeLabelsManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java
* hadoop-yarn-project/CHANGES.txt


 RMNodeLabelsManager should remove node from node-to-label mapping when node 
 becomes deactivated
 ---

 Key: YARN-3132
 URL: https://issues.apache.org/jira/browse/YARN-3132
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Fix For: 2.7.0

 Attachments: YARN-3132.1.patch


 Using an example to explain:
 1) Admin specify host1 has label=x
 2) node=host1:123 registered
 3) Get node-to-label mapping, return host1/host1:123
 4) node=host1:123 unregistered
 5) Get node-to-label mapping, still returns host1:123
 Probably we should remove host1:123 when it becomes deactivated and no 
 directly label assigned to it (directly assign means admin specify host1:123 
 has x instead of host1 has x).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

[
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326582#comment-14326582
]

Wangda Tan commented on YARN-2004:
--

[~sunilg],
Thanks for uploading patch,

I just read comments from [~jlowe], I think what he said all make sense to me.

For scenario#1
There're some possible solutions to tackle the priority inversion problem you
just mentioned. But it is more important to make CS with basic priority works
first. What you said is more like adjustable priority, which could be updated
according to application's waiting time or other factors.

For scenario#2
It is possible that a user with higher priority application comes but there's
no available resource in a queue, preemption policy should reclaim resource
from other users. YARN-2009 should cover it.

General approach of the patch looks good to me.

Priority scheduling support in Capacity scheduler
-

1.Check for Application priority. If priority is available, then return
the highest priority job.
2.Otherwise continue with existing logic such as App ID comparison and
then TimeStamp comparison.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3166) [Source organization] Decide detailed package structures for timeline service v2 components

2015-02-18 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326538#comment-14326538
 ] 

Li Lu commented on YARN-3166:
-

Hi [~sjlee0], [~rkanter], [~zjshen] and [~vinodkv], would (anyone of) you mind 
to take a look at the conclusion here? I'm trying to finalize our first draft 
for module/package structures for timeline v2 here. Please feel free to let me 
know if you have any concerns. Thanks! 

 [Source organization] Decide detailed package structures for timeline service 
 v2 components
 ---

 Key: YARN-3166
 URL: https://issues.apache.org/jira/browse/YARN-3166
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu

 Open this JIRA to track all discussions on detailed package structures for 
 timeline services v2. This JIRA is for discussion only.
 For our current timeline service v2 design, aggregator (previously called 
 writer) implementation is in hadoop-yarn-server's:
 {{org.apache.hadoop.yarn.server.timelineservice.aggregator}}
 In YARN-2928's design, the next gen ATS reader is also a server. Maybe we 
 want to put reader related implementations into hadoop-yarn-server's:
 {{org.apache.hadoop.yarn.server.timelineservice.reader}}
 Both readers and aggregators will expose features that may be used by YARN 
 and other 3rd party components, such as aggregator/reader APIs. For those 
 features, maybe we would like to expose their interfaces to 
 hadoop-yarn-common's {{org.apache.hadoop.yarn.timelineservice}}? 
 Let's use this JIRA as a centralized place to track all related discussions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-933) Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING

2015-02-18 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326557#comment-14326557
 ] 

Jian He commented on YARN-933:
--

lgtm, +1

 Potential InvalidStateTransitonException: Invalid event: LAUNCHED at 
 FINAL_SAVING
 -

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
Assignee: Rohith
 Attachments: 0001-YARN-933.patch, 0001-YARN-933.patch, 
 0004-YARN-933.patch, YARN-933.3.patch, YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
  at

[jira] [Updated] (YARN-1615) Fix typos in FSAppAttempt.java


 [ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-1615:

Attachment: YARN-1615-002.patch

Attaching a patch.

 Fix typos in FSAppAttempt.java
 --

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1615) Fix typos in FSAppAttempt.java


 [ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-1615:

  Description: 
In FSAppAttempt.java there're 4 typos:
{code}
   * containers over rack-local or off-switch containers. To acheive this
   * we first only allow node-local assigments for a given prioirty level,
   * then relax the locality threshold once we've had a long enough period
   * without succesfully scheduling. We measure both the number of missed
{code}
They should be fixed as follows:
{code}
   * containers over rack-local or off-switch containers. To achieve this
   * we first only allow node-local assignments for a given priority level,
   * then relax the locality threshold once we've had a long enough period
   * without successfully scheduling. We measure both the number of missed
{code}

  was:
In FSSchedulerApp.java there're 4 typos:
{code}
   * containers over rack-local or off-switch containers. To acheive this
   * we first only allow node-local assigments for a given prioirty level,
   * then relax the locality threshold once we've had a long enough period
   * without succesfully scheduling. We measure both the number of missed
{code}
They should be fixed as follows:
{code}
   * containers over rack-local or off-switch containers. To achieve this
   * we first only allow node-local assignments for a given priority level,
   * then relax the locality threshold once we've had a long enough period
   * without successfully scheduling. We measure both the number of missed
{code}

 Target Version/s: 2.7.0
Affects Version/s: (was: 2.2.0)
   2.6.0
  Summary: Fix typos in FSAppAttempt.java  (was: Fix typos in 
FSSchedulerApp.java)

FSScheduler and AppSchedulable were merged into FSAppAttempt by YARN-2399, but 
the typos still exist.

 Fix typos in FSAppAttempt.java
 --

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted

2015-02-18 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326637#comment-14326637
]

Karthik Kambatla commented on YARN-2942:

Thanks for clarifying that, Robert. Also, I don't think we should use the word
compaction for this. I would prefer combined-aggregated-logs or
uber-aggregated-logs.

Can we split this JIRA into sub-tasks for easier reviewing:
curator-ChildReaper, reader/writer, LogCombiner, and NMs calling the
LogCombiner (including coordination)?

Aggregated Log Files should be compacted

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Aggregator wireup] Implement RM starting its ATS writer

2015-02-18 Thread Li Lu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326630#comment-14326630
]

Li Lu commented on YARN-3034:
-

Hi [~Naganarasimha], thanks for the patch! I briefly looked at it, and have
some questions about it.

* A general question, I think there are some inconsistencies between this patch
and the proposed solution for aggregators. In the original design, it is
proposed that we need to organize application level aggregators into
collections (either on the NMs or on the RM, supposedly implemented as
AppLevelServiceManager? ), and the servers launches its own collection. I could
not find related logic in this patch, am I missing anything here?

* I noticed that you refactored some metrics related code in the RM, moving
part them into the new RMTimelineAggregator. Maybe in this JIRA we'd like to
focus on setting up the wiring for the aggregator (collections) on the RM,
rather than moving into the details of the timeline data? We can always resolve
those problems in a separate JIRA after we set up the base infrastructure for
timeline v2.

* About source code organization: currently you're putting RMTimelineAggregator
into the hadoop-yarn-server-resourcemanager module, under the package
org.apache.hadoop.yarn.server.resourcemanager.metrics package. I'm not sure if
that's a place we'd like it to be in. YARN-3166 keeps track of code
organization related discussions, and you're more than welcome to join the
discussion there.

I think for now in this JIRA, maybe we want to firstly focus on making RM
launches its aggregator collection (not blocked by any other JIRAs, may
interfere with aggregator refactoring)?

[Aggregator wireup] Implement RM starting its ATS writer

Key: YARN-3034
URL: https://issues.apache.org/jira/browse/YARN-3034
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Attachments: YARN-3034.20150205-1.patch

Per design in YARN-2928, implement resource managers starting their own ATS
writers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3213) Respect labels in Capacity Scheduler when computing user-limit

Wangda Tan created YARN-3213:


 Summary: Respect labels in Capacity Scheduler when computing 
user-limit
 Key: YARN-3213
 URL: https://issues.apache.org/jira/browse/YARN-3213
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Wangda Tan
Assignee: Wangda Tan


Now we can support node-labels in Capacity Scheduler, but user-limit computing 
doesn't respect node-labels enough, we should fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3222) RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential order


[ 
https://issues.apache.org/jira/browse/YARN-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326965#comment-14326965
 ] 

Rohith commented on YARN-3222:
--

Attaching the logs which gives more information about issue. In the below log, 
RM has shutdown with NPE while updating node_resource. And observe scheduler 
events dispatched from AsyncDispatcher in 
*org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.\**. Here the 
order is NODE_REMOVED -- NODE_RESOURCE_UPDATE -- NODE_ADDED -- 
NODE_LABELS_UPDATE
{noformat}
2015-02-19 09:14:57,212 INFO  [main] util.RackResolver 
(RackResolver.java:coreResolve(109)) - Resolved 127.0.0.1 to /default-rack
2015-02-19 09:14:57,213 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:registerNodeManager(313)) - Reconnect from the 
node at: 127.0.0.1
2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeReconnectEvent.EventType:
 RECONNECTED
2015-02-19 09:14:57,215 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:registerNodeManager(343)) - NodeManager from node 
127.0.0.1(cmPort: 1234 httpPort: 3) registered with capability: memory:16384, 
vCores:16, assigned nodeId 127.0.0.1:1234
2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type RECONNECTED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeRemovedSchedulerEvent.EventType:
 NODE_REMOVED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStartedEvent.EventType:
 STARTED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type STARTED
2015-02-19 09:14:57,266 INFO  [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(424)) - 127.0.0.1:1234 Node Transitioned from NEW to 
RUNNING
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: 
NODE_USABLE
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeResourceUpdateSchedulerEvent.EventType:
 NODE_RESOURCE_UPDATE
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeAddedSchedulerEvent.EventType:
 NODE_ADDED
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: 
NODE_USABLE
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeLabelsUpdateSchedulerEvent.EventType:
 NODE_LABELS_UPDATE
2015-02-19 09:14:57,267 INFO  [ResourceManager Event Processor] 
capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1267)) - Removed 
node 127.0.0.1:1234 clusterResource: memory:0, vCores:0
2015-02-19 09:14:57,267 FATAL [ResourceManager Event Processor] 
resourcemanager.ResourceManager (ResourceManager.java:run(688)) - Error in 
handling event type NODE_RESOURCE_UPDATE to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:548)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1119)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:120)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:679)
at java.lang.Thread.run(Thread.java:745)
2015-02-19 09:14:57,280 INFO  [ResourceManager Event Processor] 
resourcemanager.ResourceManager

[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen


[ 
https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327028#comment-14327028
 ] 

Zhijie Shen commented on YARN-3041:
---

Cool! Thanks for your review, Sangjin! I'll go ahead to commit the patch.

 [Data Model] create overall data objects of TS next gen
 ---

 Key: YARN-3041
 URL: https://issues.apache.org/jira/browse/YARN-3041
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, 
 YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.5.patch, 
 YARN-3041.preliminary.001.patch


 Per design in YARN-2928, create the ATS entity and events API.
 Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, 
 flow, flow run, YARN app, ...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler


[ 
https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326969#comment-14326969
 ] 

Rohith commented on YARN-3197:
--

bq. Do you see any other info logs coming for the same container?
No information about container. Its only above log message will be printed.

 Confusing log generated by CapacityScheduler
 

 Key: YARN-3197
 URL: https://issues.apache.org/jira/browse/YARN-3197
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Hitesh Shah
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3197.001.patch


 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:39,968 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...
 2015-02-12 20:35:40,960 INFO  capacity.CapacityScheduler 
 (CapacityScheduler.java:completedContainer(1190)) - Null container 
 completed...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3197) Confusing log generated by CapacityScheduler

[
https://issues.apache.org/jira/browse/YARN-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326980#comment-14326980
]

Rohith commented on YARN-3197:
--

bq. I think INFO could be fine since it will be at most once for each container.
I agree this log message is most once for each containers But IIUC, the above
log message would not help to analyze any issue in cluster rather it is just
only information. This would come because NodeManger may be delayed in
identifying container has finished and sending its status.

Consider NM restart , NM recovers all the containers and sends all the
container status(running and completed) while registering. But application
already would have completed and scheduler prints above message which is not
really required. It just fills log files.
May be above scenario can be considered for changing log level to DEBUG.

Confusing log generated by CapacityScheduler

Key: YARN-3197
URL: https://issues.apache.org/jira/browse/YARN-3197
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Hitesh Shah
Assignee: Varun Saxena
Priority: Minor
Attachments: YARN-3197.001.patch

2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler
(CapacityScheduler.java:completedContainer(1190)) - Null container
completed...
2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler
(CapacityScheduler.java:completedContainer(1190)) - Null container
completed...
2015-02-12 20:35:39,968 INFO capacity.CapacityScheduler
(CapacityScheduler.java:completedContainer(1190)) - Null container
completed...
2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler
(CapacityScheduler.java:completedContainer(1190)) - Null container
completed...
2015-02-12 20:35:40,960 INFO capacity.CapacityScheduler
(CapacityScheduler.java:completedContainer(1190)) - Null container
completed...

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen

2015-02-18 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327019#comment-14327019
 ] 

Sangjin Lee commented on YARN-3041:
---

LGTM. Thanks for reflecting the latest feedback!

I agree with your points for the most part. The update of the design doc is 
long overdue. I'll try to update the document to reflect all the changes that 
have taken place so far.

We'll file more JIRAs if we need to adjust/update the data model as the work 
progresses.

 [Data Model] create overall data objects of TS next gen
 ---

 Key: YARN-3041
 URL: https://issues.apache.org/jira/browse/YARN-3041
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, 
 YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.5.patch, 
 YARN-3041.preliminary.001.patch


 Per design in YARN-2928, create the ATS entity and events API.
 Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, 
 flow, flow run, YARN app, ...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Implement client-side API for handling flows

2015-02-18 Thread Naganarasimha G R (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327034#comment-14327034
]

Naganarasimha G R commented on YARN-3040:
-

Thanks for briefing [~rkanter], and my queries or comments as follows :
bq. I think the Entities (YARN-3041) are mainly for writing/reading to/from the
ATS store. Most of the information stored in those Entities are not needed by
the user when submitting a job. All the user really needs to set is the IDs,
and some of these we can make optional or determine automatically (e.g. it's
obvious which cluster it's running on)
Yes i agree Flow, Cluster, Flow run not required for submitting a job and
hence if we are only passing the Entity ID's then tags should be sufficient
enough. But the concern what i had was based on the design doc section 7, out
of scope, point 1 i am under the assumption that posting of Entities to ATSV2
can be done only by RM,NM and AM and client will not be able to post Flow, Flow
run and Cluster Entities explicitly. Hence wanted to know the approach for
clients to post Flow, Flow run and Cluster Entities. And wrt to Cluster info i
remember Vrushali mentioning about diff clusters like production and a test
cluster which they wanted to capture explicitly.
bq.100 characters per tag seems like it should be enough; if not, we can maybe
increase this limit? It is marked as @Evolving
If we are planning to pass Entity ID's to map the application hierarchy then i
feel 100 chars per tag should be sufficient. how about making it configurable
if required to store more information per tag
bq. For example, setFlowId(String id) would simply set the tag
yes i agree that these are not first class YARN concepts hence like you
mentioned YARN applications can take care of simplifying it. +1 for this
approach.

[Data Model] Implement client-side API for handling flows
-

Key: YARN-3040
URL: https://issues.apache.org/jira/browse/YARN-3040
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Sangjin Lee
Assignee: Robert Kanter

Per design in YARN-2928, implement client-side API for handling *flows*.
Frameworks should be able to define and pass in all attributes of flows and
flow runs to YARN, and they should be passed into ATS writers.
YARN tags were discussed as a way to handle this piece of information.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy


 [ 
https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3217:
---
Priority: Major  (was: Minor)

 Remove httpclient dependency from hadoop-yarn-server-web-proxy
 --

 Key: YARN-3217
 URL: https://issues.apache.org/jira/browse/YARN-3217
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Akira AJISAKA
Assignee: Brahma Reddy Battula
 Attachments: YARN-3217.patch


 Sub-task of HADOOP-10105. Remove httpclient dependency from 
 WebAppProxyServlet.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy


 [ 
https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3217:
---
Labels:   (was: newbie)

 Remove httpclient dependency from hadoop-yarn-server-web-proxy
 --

 Key: YARN-3217
 URL: https://issues.apache.org/jira/browse/YARN-3217
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Akira AJISAKA
Assignee: Brahma Reddy Battula
Priority: Minor
 Attachments: YARN-3217.patch


 Sub-task of HADOOP-10105. Remove httpclient dependency from 
 WebAppProxyServlet.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy


[ 
https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327083#comment-14327083
 ] 

Hadoop QA commented on YARN-3217:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12699620/YARN-3217.patch
  against trunk revision 946456c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy:

org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServer

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6667//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6667//console

This message is automatically generated.

 Remove httpclient dependency from hadoop-yarn-server-web-proxy
 --

 Key: YARN-3217
 URL: https://issues.apache.org/jira/browse/YARN-3217
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Akira AJISAKA
Assignee: Brahma Reddy Battula
 Attachments: YARN-3217.patch


 Sub-task of HADOOP-10105. Remove httpclient dependency from 
 WebAppProxyServlet.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Aggregator wireup] Implement RM starting its ATS writer

2015-02-18 Thread Naganarasimha G R (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327072#comment-14327072
]

Naganarasimha G R commented on YARN-3034:
-

Hi [~gtCarrera9], thanks for reviewing the patch,
# _point1_
I think there is difference in understanding about the approach here. Based on
discussions with [~sjlee0] and in the design doc,
{quote}
_In section 4.1_
RM itself has its own ATS process to be able to write RM-specific timeline
events (e.g.
application lifecycle events). RM can also use YARN tags to associate events
with a specific
flow/run/app. The volume of data coming directly from RM should not be great.
{quote}
IIUC RM has its own single ATS aggregator(service) and writer and it differs
from NM, where NM through service discovery YARN-3039 identifies
AppLevelAggregatorService and posts the entities through it.
# _point2_
Yes, i agree to your point here i could have kept these modifications separate
from this jira, i got similar kind of comment from Sangjin where in he was
asking both old and new ATS should be working and based on configuration we
pick the appropriate ATS notifications from RM. Will take care in next patch.
# _point3_
Well i tried to keep it in sync with the existing ATS code
(SystemMetricsPublisher). Once my queries are clarified then thought about
discussing the package structure in yarn-3166.

Currently i have following queries :
# Will RM have its own Aggregator (which i feel is correct as we are publishing
only app and app attempt life cycle events from RM,) or collection of
application level aggregators ( it doesn't serve any purpose for having this
separately in RM). As per YARN-3030, Using AUX services separate
AppLevelAggregatorService is created per App in each NM.
# If your understanding is correct (collection of application level aggregators
for both RM and NM). Then i have few queries based on YARN-3030.
#* Why are we starting AppLevelAggregatorService in NM through Aux services,
we should have created this from RM. so that initial app life cycle events can
be posted to ATS.
#* whats the scope of RMTimelineAggregator when we have
AppLevelAggregatorService?
# If My understanding is correct (RM have its own ATS Aggregator ) i have
following queries :
#* Based on the discussions we had on Feb11, I understand that RM and NM
should not be directly dependent on TimelineService. But in 3030 patch,
BaseAggregatorService.java is in timeline service project hence where to place
this RMTimelineAggregator.java class (as it extends BaseAggregatorService ) ?
#* If we plan to handle similar to current approach i.e send the Entity data
through a rest client to a timeline writer service(RMTimelineAggregator), where
should this service be running i.e. as part of which process or should it be a
daemon on its own?

[Aggregator wireup] Implement RM starting its ATS writer

Per design in YARN-2928, implement resource managers starting their own ATS
writers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3166) [Source organization] Decide detailed package structures for timeline service v2 components


[ 
https://issues.apache.org/jira/browse/YARN-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326987#comment-14326987
 ] 

Zhijie Shen commented on YARN-3166:
---

bq. I assume they will post data through our clients, am I right here?

RM and NM should have the code to start the aggregator too.

 [Source organization] Decide detailed package structures for timeline service 
 v2 components
 ---

 Key: YARN-3166
 URL: https://issues.apache.org/jira/browse/YARN-3166
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu

 Open this JIRA to track all discussions on detailed package structures for 
 timeline services v2. This JIRA is for discussion only.
 For our current timeline service v2 design, aggregator (previously called 
 writer) implementation is in hadoop-yarn-server's:
 {{org.apache.hadoop.yarn.server.timelineservice.aggregator}}
 In YARN-2928's design, the next gen ATS reader is also a server. Maybe we 
 want to put reader related implementations into hadoop-yarn-server's:
 {{org.apache.hadoop.yarn.server.timelineservice.reader}}
 Both readers and aggregators will expose features that may be used by YARN 
 and other 3rd party components, such as aggregator/reader APIs. For those 
 features, maybe we would like to expose their interfaces to 
 hadoop-yarn-common's {{org.apache.hadoop.yarn.timelineservice}}? 
 Let's use this JIRA as a centralized place to track all related discussions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3041) [Data Model] create overall data objects of TS next gen

[
https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhijie Shen updated YARN-3041:
--
Attachment: YARN-3041.5.patch

Thanks for the feedback, Sangjin, Vrushali and Joep! We had an offline
discussion. I updated the patch according to it. Here's the summary of the
major changes:

1. It is necessary to have both Flow and FlowRun in the taxonomy, as the
concepts of them are most the same. FlowRun is more likely to model an
individual flow instance of a number applications while Flow sounds like a the
generic perspective of application organization, which may be nested multiple
FlowRun instances. Hence, we just need to have FlowRun only, but rename FlowRun
to Flow for simplicity.

2. To address the aggregation interval, which means we may want to query the
aggregated information for a particular time window, I change TimelineMetric to
have starttime and endtime attributes.

3. The types of the first class citizen entities are defined centrally as the
enums, and the parent-child relationship is defined there too.

4. In the write path, queue is the string attribute of application while user
is the string attribute of the flow, while we still have the entities of both
to put the aggregated data at the reader side. One additional implication is
that all the applications are going to be run by the same user of the parent
flow.

5. Flow id is the composite: user@flow_name(or id)/version/run, which will
uniquely identify a flow in the storage.

Joep has raised a great point of keeping the type generic to extend the data
model beyond YARN, such as Mesos. I think we can think and discuss more around
it, but let's file a separate Jira to tackle this direction.

Here, as mentioned above, let's try to get the first draft of data model in
asap to unblock the aggregator and the reader work. Hopefully it makes sense to
the folks here.

[Data Model] create overall data objects of TS next gen
---

Per design in YARN-2928, create the ATS entity and events API.
Also, as part of this JIRA, create YARN system entities (e.g. cluster, user,
flow, flow run, YARN app, ...).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy


 [ 
https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3217:
---
Attachment: YARN-3217.patch

 Remove httpclient dependency from hadoop-yarn-server-web-proxy
 --

 Key: YARN-3217
 URL: https://issues.apache.org/jira/browse/YARN-3217
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Akira AJISAKA
Assignee: Brahma Reddy Battula
Priority: Minor
  Labels: newbie
 Attachments: YARN-3217.patch


 Sub-task of HADOOP-10105. Remove httpclient dependency from 
 WebAppProxyServlet.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users


[ 
https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327088#comment-14327088
 ] 

Robert Kanter commented on YARN-2423:
-

[~zjshen], can you take another look at the patch?

 TimelineClient should wrap all GET APIs to facilitate Java users
 

 Key: YARN-2423
 URL: https://issues.apache.org/jira/browse/YARN-2423
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Robert Kanter
 Attachments: YARN-2423.004.patch, YARN-2423.005.patch, 
 YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, 
 YARN-2423.patch


 TimelineClient provides the Java method to put timeline entities. It's also 
 good to wrap over all GET APIs (both entity and domain), and deserialize the 
 json response into Java POJO objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3076) YarnClient implementation to retrieve label to node mapping


[ 
https://issues.apache.org/jira/browse/YARN-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326991#comment-14326991
 ] 

Hadoop QA commented on YARN-3076:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12699124/YARN-3076.003.patch
  against trunk revision b8a14ef.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.conf.TestJobConf
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6664//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6664//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6664//console

This message is automatically generated.

 YarnClient implementation to retrieve label to node mapping
 ---

 Key: YARN-3076
 URL: https://issues.apache.org/jira/browse/YARN-3076
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-3076.001.patch, YARN-3076.002.patch, 
 YARN-3076.003.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3041) [Data Model] create overall data objects of TS next gen


[ 
https://issues.apache.org/jira/browse/YARN-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327030#comment-14327030
 ] 

Hadoop QA commented on YARN-3041:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12699612/YARN-3041.5.patch
  against trunk revision 946456c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build///testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build///console

This message is automatically generated.

 [Data Model] create overall data objects of TS next gen
 ---

 Key: YARN-3041
 URL: https://issues.apache.org/jira/browse/YARN-3041
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: Data_model_proposal_v2.pdf, YARN-3041.2.patch, 
 YARN-3041.3.patch, YARN-3041.4.patch, YARN-3041.5.patch, 
 YARN-3041.preliminary.001.patch


 Per design in YARN-2928, create the ATS entity and events API.
 Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, 
 flow, flow run, YARN app, ...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-02-18 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326145#comment-14326145
]

Jason Lowe commented on YARN-2004:
--

I'm not sure I understand the priority inversion problem and why we would be
changing headroom. The headroom has no priority calculations in it. As I see
it, the priority scheduling is _only_ changing the order in which applications
are examined when deciding how to assign free resources in a queue. In other
words, it does _not_ change the following:

- the priority order between queues (i.e.: deciding which queue is first in
line to obtain free resources in the cluster)
- the user limits within a queue (i.e.: making an app higher priority does not
implicitly give the user more room to grow within the queue than normal)
- the headroom for an app within the queue (higher priority doesn't change the
queue capacity or user limits)

For example, a user is running app A then follows up with app B. The user
decides app B is pretty important and raises its priority. This doesn't change
the user limits within the queue or the headroom of those apps, but it does
change which app will be assigned a spare resource if it is available. If the
queue is totally full then both apps will be told their headroom is zero. One
(or both) of them will need to free up some resources to make progress. When
resources becomes available, app B will have the first chance to claim them
since it was made a higher priority than A.

Priority scheduling support in Capacity scheduler
-

1.Check for Application priority. If priority is available, then return
the highest priority job.
2.Otherwise continue with existing logic such as App ID comparison and
then TimeStamp comparison.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.


[ 
https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325910#comment-14325910
 ] 

Hudson commented on YARN-3207:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #99 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/99/])
YARN-3207. Secondary filter matches entites which do not have the key (xgong: 
rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java


 secondary filter matches entites which do not have the key being filtered for.
 --

 Key: YARN-3207
 URL: https://issues.apache.org/jira/browse/YARN-3207
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Prakash Ramachandran
Assignee: Zhijie Shen
 Attachments: YARN-3207.1.patch


 in the leveldb implementation of the TimelineStore the secondary filter 
 matches entities where the key being searched for is not present.
 ex query from tez ui
 http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar
 will match and return the entity even though there is no entity with 
 otherinfo.foo defined.
 the issue seems to be in 
 {code:title=LeveldbTimelineStore:675}
 if (vs != null  !vs.contains(filter.getValue())) {
   filterPassed = false;
   break;
 }
 {code}
 this should be IMHO
 vs == null || !vs.contains(filter.getValue())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.


[ 
https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325936#comment-14325936
 ] 

Hudson commented on YARN-3207:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2040 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2040/])
YARN-3207. Secondary filter matches entites which do not have the key (xgong: 
rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java


 secondary filter matches entites which do not have the key being filtered for.
 --

 Key: YARN-3207
 URL: https://issues.apache.org/jira/browse/YARN-3207
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Prakash Ramachandran
Assignee: Zhijie Shen
 Attachments: YARN-3207.1.patch


 in the leveldb implementation of the TimelineStore the secondary filter 
 matches entities where the key being searched for is not present.
 ex query from tez ui
 http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar
 will match and return the entity even though there is no entity with 
 otherinfo.foo defined.
 the issue seems to be in 
 {code:title=LeveldbTimelineStore:675}
 if (vs != null  !vs.contains(filter.getValue())) {
   filterPassed = false;
   break;
 }
 {code}
 this should be IMHO
 vs == null || !vs.contains(filter.getValue())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.


[ 
https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326030#comment-14326030
 ] 

Hudson commented on YARN-3207:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2059 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2059/])
YARN-3207. Secondary filter matches entites which do not have the key (xgong: 
rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java


 secondary filter matches entites which do not have the key being filtered for.
 --

 Key: YARN-3207
 URL: https://issues.apache.org/jira/browse/YARN-3207
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Prakash Ramachandran
Assignee: Zhijie Shen
 Attachments: YARN-3207.1.patch


 in the leveldb implementation of the TimelineStore the secondary filter 
 matches entities where the key being searched for is not present.
 ex query from tez ui
 http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar
 will match and return the entity even though there is no entity with 
 otherinfo.foo defined.
 the issue seems to be in 
 {code:title=LeveldbTimelineStore:675}
 if (vs != null  !vs.contains(filter.getValue())) {
   filterPassed = false;
   break;
 }
 {code}
 this should be IMHO
 vs == null || !vs.contains(filter.getValue())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1916) Leveldb timeline store applies secondary filters incorrectly


[ 
https://issues.apache.org/jira/browse/YARN-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326053#comment-14326053
 ] 

Hadoop QA commented on YARN-1916:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12639313/YARN-1916.1.patch
  against trunk revision 2ecea5a.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6659//console

This message is automatically generated.

 Leveldb timeline store applies secondary filters incorrectly
 

 Key: YARN-1916
 URL: https://issues.apache.org/jira/browse/YARN-1916
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Attachments: YARN-1916.1.patch


 When applying a secondary filter (fieldname:fieldvalue) in a get entities 
 query, LeveldbTimelineStore retrieves entities that do not have the specified 
 fieldname, in addition to correctly retrieving entities that have the 
 fieldname with the specified fieldvalue.  It should not return entities that 
 do not have the fieldname.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.


[ 
https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325787#comment-14325787
 ] 

Hudson commented on YARN-3207:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #108 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/108/])
YARN-3207. Secondary filter matches entites which do not have the key (xgong: 
rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java


 secondary filter matches entites which do not have the key being filtered for.
 --

 Key: YARN-3207
 URL: https://issues.apache.org/jira/browse/YARN-3207
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Prakash Ramachandran
Assignee: Zhijie Shen
 Attachments: YARN-3207.1.patch


 in the leveldb implementation of the TimelineStore the secondary filter 
 matches entities where the key being searched for is not present.
 ex query from tez ui
 http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar
 will match and return the entity even though there is no entity with 
 otherinfo.foo defined.
 the issue seems to be in 
 {code:title=LeveldbTimelineStore:675}
 if (vs != null  !vs.contains(filter.getValue())) {
   filterPassed = false;
   break;
 }
 {code}
 this should be IMHO
 vs == null || !vs.contains(filter.getValue())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3207) secondary filter matches entites which do not have the key being filtered for.


[ 
https://issues.apache.org/jira/browse/YARN-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325997#comment-14325997
 ] 

Hudson commented on YARN-3207:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #109 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/109/])
YARN-3207. Secondary filter matches entites which do not have the key (xgong: 
rev 57db50cbe3ce42618ad6d6869ae337d15b261f4e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
* hadoop-yarn-project/CHANGES.txt


 secondary filter matches entites which do not have the key being filtered for.
 --

 Key: YARN-3207
 URL: https://issues.apache.org/jira/browse/YARN-3207
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Prakash Ramachandran
Assignee: Zhijie Shen
 Attachments: YARN-3207.1.patch


 in the leveldb implementation of the TimelineStore the secondary filter 
 matches entities where the key being searched for is not present.
 ex query from tez ui
 http://uvm:8188/ws/v1/timeline/TEZ_DAG_ID/?limit=1secondaryFilter=foo:bar
 will match and return the entity even though there is no entity with 
 otherinfo.foo defined.
 the issue seems to be in 
 {code:title=LeveldbTimelineStore:675}
 if (vs != null  !vs.contains(filter.getValue())) {
   filterPassed = false;
   break;
 }
 {code}
 this should be IMHO
 vs == null || !vs.contains(filter.getValue())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-02-18 Thread Sunil G (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sunil G updated YARN-3136:
--
Attachment: 0003-YARN-3136.patch

Yes [~jlowe]. Its good to keep the backward compatibility.
bq. can be overridden in derived schedulers

A new method named *getSchedulerApplication* can be added in
AbstractYarnScheduler and it can come with lock by default to access
application object from *applications* map.
Later in CS or other scheduler, we can override to remove the lock.
I attached a patch on this. Please see whether its same as you mentioned.

getTransferredContainers can be a bottleneck during AM registration
---

Key: YARN-3136
URL: https://issues.apache.org/jira/browse/YARN-3136
Project: Hadoop YARN
Issue Type: Sub-task
Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch,
0003-YARN-3136.patch

While examining RM stack traces on a busy cluster I noticed a pattern of AMs
stuck waiting for the scheduler lock trying to call getTransferredContainers.
The scheduler lock is highly contended, especially on a large cluster with
many nodes heartbeating, and it would be nice if we could find a way to
eliminate the need to grab this lock during this call. We've already done
similar work during AM allocate calls to make sure they don't needlessly grab
the scheduler lock, and it would be good to do so here as well, if possible.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-914) Support graceful decommission of nodemanager

2015-02-18 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-914:

Attachment: GracefullyDecommissionofNodeManagerv3.pdf

Update proposal to incorporate most comments above, include: AM notification 
mechanism, name, UI changes, etc. In addition, add some details on core state 
transition for RMNode state machine. Will break down sub liras and start the 
work if no further more comments on significant issues. 

 Support graceful decommission of nodemanager
 

 Key: YARN-914
 URL: https://issues.apache.org/jira/browse/YARN-914
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Luke Lu
Assignee: Junping Du
 Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
 Gracefully Decommission of NodeManager (v2).pdf, 
 GracefullyDecommissionofNodeManagerv3.pdf


 When NMs are decommissioned for non-fault reasons (capacity change etc.), 
 it's desirable to minimize the impact to running applications.
 Currently if a NM is decommissioned, all running containers on the NM need to 
 be rescheduled on other NMs. Further more, for finished map tasks, if their 
 map output are not fetched by the reducers of the job, these map tasks will 
 need to be rerun as well.
 We propose to introduce a mechanism to optionally gracefully decommission a 
 node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3214) Adding non-exclusive node labels

Wangda Tan created YARN-3214:


 Summary: Adding non-exclusive node labels 
 Key: YARN-3214
 URL: https://issues.apache.org/jira/browse/YARN-3214
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan


Currently node labels partition the cluster to some sub-clusters so resources 
cannot be shared between partitioned cluster. 

With the current implementation of node labels we cannot use the cluster 
optimally and the throughput of the cluster will suffer.

We are proposing adding non-exclusive node labels:

1. Labeled apps get the preference on Labeled nodes 
2. If there is no ask for labeled resources we can assign those nodes to non 
labeled apps
3. If there is any future ask for those resources , we will preempt the non 
labeled apps and give them back to labeled apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3214) Add non-exclusive node labels


 [ 
https://issues.apache.org/jira/browse/YARN-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3214:
-
Summary: Add non-exclusive node labels   (was: Adding non-exclusive node 
labels )

 Add non-exclusive node labels 
 --

 Key: YARN-3214
 URL: https://issues.apache.org/jira/browse/YARN-3214
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan

 Currently node labels partition the cluster to some sub-clusters so resources 
 cannot be shared between partitioned cluster. 
 With the current implementation of node labels we cannot use the cluster 
 optimally and the throughput of the cluster will suffer.
 We are proposing adding non-exclusive node labels:
 1. Labeled apps get the preference on Labeled nodes 
 2. If there is no ask for labeled resources we can assign those nodes to non 
 labeled apps
 3. If there is any future ask for those resources , we will preempt the non 
 labeled apps and give them back to labeled apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3215) Respect labels in CapacityScheduler when computing headroom

Wangda Tan created YARN-3215:


 Summary: Respect labels in CapacityScheduler when computing 
headroom
 Key: YARN-3215
 URL: https://issues.apache.org/jira/browse/YARN-3215
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Wangda Tan
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

Wangda Tan created YARN-3216:


 Summary: Max-AM-Resource-Percentage should respect node labels
 Key: YARN-3216
 URL: https://issues.apache.org/jira/browse/YARN-3216
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1615) Fix typos in FSAppAttempt.java


[ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326700#comment-14326700
 ] 

Hadoop QA commented on YARN-1615:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12699540/YARN-1615-002.patch
  against trunk revision 2aa9979.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+0 tests included{color}.  The patch appears to be a 
documentation patch that doesn't require tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6661//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6661//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6661//console

This message is automatically generated.

 Fix typos in FSAppAttempt.java
 --

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Implement client-side API for handling flows


[ 
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326716#comment-14326716
 ] 

Robert Kanter commented on YARN-3040:
-

[~Naganarasimha]
# I think the Entities (YARN-3041) are mainly for writing/reading to/from the 
ATS store.  Most of the information stored in those Entities are not needed by 
the user when submitting a job.  All the user really needs to set is the IDs, 
and some of these we can make optional or determine automatically (e.g. it's 
obvious which cluster it's running on)
# 100 characters per tag seems like it should be enough; if not, we can maybe 
increase this limit?  It is marked as {{@Evolving}}
# Like other properties, we can add a method to JobClient or one of those 
classes that sets the property.  For example, {{setFlowId(String id)}} would 
simply set the tag

Flows and related constructs don't currently exist in YARN.  Unless we add 
these as first-class concepts to the rest of YARN outside of the ATS (e.g. 
instead of only being able to submit YARN applications, you can also submit 
YARN flows; though this is looking more like Oozie...), I think tags are the 
only way to track this information.

 [Data Model] Implement client-side API for handling flows
 -

 Key: YARN-3040
 URL: https://issues.apache.org/jira/browse/YARN-3040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Robert Kanter

 Per design in YARN-2928, implement client-side API for handling *flows*. 
 Frameworks should be able to define and pass in all attributes of flows and 
 flow runs to YARN, and they should be passed into ATS writers.
 YARN tags were discussed as a way to handle this piece of information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2942) Aggregated Log Files should be compacted


[ 
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326722#comment-14326722
 ] 

Robert Kanter commented on YARN-2942:
-

Sure.  I think Combined Aggregated Logs is more obvious than Uber Aggregated 
Logs; we also seem to use Uber for a few different things already.  I'll update 
the design doc and look into spliting up the patch into a few sub tasks.

 Aggregated Log Files should be compacted
 

 Key: YARN-2942
 URL: https://issues.apache.org/jira/browse/YARN-2942
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: CompactedAggregatedLogsProposal_v1.pdf, 
 CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, 
 YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, 
 YARN-2942.003.patch


 Turning on log aggregation allows users to easily store container logs in 
 HDFS and subsequently view them in the YARN web UIs from a central place.  
 Currently, there is a separate log file for each Node Manager.  This can be a 
 problem for HDFS if you have a cluster with many nodes as you’ll slowly start 
 accumulating many (possibly small) files per YARN application.  The current 
 “solution” for this problem is to configure YARN (actually the JHS) to 
 automatically delete these files after some amount of time.  
 We should improve this by compacting the per-node aggregated log files into 
 one log file per application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined

[
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326744#comment-14326744
]

Hadoop QA commented on YARN-2942:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12699570/CombinedAggregatedLogsProposal_v3.pdf
against trunk revision 9a3e292.

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6662//console

This message is automatically generated.

Aggregated Log Files should be combined
---

Key: YARN-2942
URL: https://issues.apache.org/jira/browse/YARN-2942
Project: Hadoop YARN
Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Attachments: CombinedAggregatedLogsProposal_v3.pdf,
CompactedAggregatedLogsProposal_v1.pdf,
CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch,
YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch,
YARN-2942.003.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy


 [ 
https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-3217:

Labels: newbie  (was: )

 Remove httpclient dependency from hadoop-yarn-server-web-proxy
 --

 Key: YARN-3217
 URL: https://issues.apache.org/jira/browse/YARN-3217
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Akira AJISAKA
Priority: Minor
  Labels: newbie

 Sub-task of HADOOP-10105. Remove httpclient dependency from 
 WebAppProxyServlet.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1615) Fix typos in description about delay scheduling


 [ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1615:
-
Summary: Fix typos in description about delay scheduling  (was: Fix typos 
in delay scheduler's description)

 Fix typos in description about delay scheduling
 ---

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3194) After NM restart, RM should handle NMCotainerStatuses sent by NM while registering if NM is Reconnected node


[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326924#comment-14326924
 ] 

Rohith commented on YARN-3194:
--

FIndbugs warnings are unrelated to this Jira. These warnings will be handled as 
part of YARN-3204

 After NM restart, RM should handle NMCotainerStatuses sent by NM while 
 registering if NM is Reconnected node
 

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
 registration. The registration can be treated by RM as New node or 
 Reconnecting node. RM triggers corresponding event on the basis of node added 
 or node reconnected state. 
 # Node added event : Again here 2 scenario's can occur 
 ## New node is registering with different ip:port – NOT A PROBLEM
 ## Old node is re-registering because of RESYNC command from RM when RM 
 restart – NOT A PROBLEM
 # Node reconnected event : 
 ## Existing node is re-registering i.e RM treat it as reconnecting node when 
 RM is not restarted 
 ### NM RESTART NOT Enabled – NOT A PROBLEM
 ### NM RESTART is Enabled 
  Some applications are running on this node – *Problem is here*
  Zero applications are running on this node – NOT A PROBLEM
 Since NMContainerStatus are not handled, RM never get to know about 
 completedContainer and never release resource held be containers. RM will not 
 allocate new containers for pending resource request as long as the 
 completedContainer event is triggered. This results in applications to wait 
 indefinitly because of pending containers are not served by RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined


[ 
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326953#comment-14326953
 ] 

Robert Kanter commented on YARN-2942:
-

{quote}We should try to avoid rereading the entire log file and rewriting 
again. How about we try the concat approach (with variable length blocks) first 
before we try the reread+rewrite?{quote}
The problem here is that the aggregated log files are not in an append-friendly 
format (TFile).  We'd have to change the file format that they're in (perhaps 
reusing the similar format I created in this patch), but this wouldn't be 
backwards compatible.

{quote}The long term solution for the later really is HDFS supporting atomic 
append (with concurrent writers){quote}
This would be very useful.  Even with the design implemented by this patch, it 
sounds like it would eventually allow us to get rid of the ZooKeeper locks.

 Aggregated Log Files should be combined
 ---

 Key: YARN-2942
 URL: https://issues.apache.org/jira/browse/YARN-2942
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: CombinedAggregatedLogsProposal_v3.pdf, 
 CompactedAggregatedLogsProposal_v1.pdf, 
 CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, 
 YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, 
 YARN-2942.003.patch


 Turning on log aggregation allows users to easily store container logs in 
 HDFS and subsequently view them in the YARN web UIs from a central place.  
 Currently, there is a separate log file for each Node Manager.  This can be a 
 problem for HDFS if you have a cluster with many nodes as you’ll slowly start 
 accumulating many (possibly small) files per YARN application.  The current 
 “solution” for this problem is to configure YARN (actually the JHS) to 
 automatically delete these files after some amount of time.  
 We should improve this by compacting the per-node aggregated log files into 
 one log file per application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2942) Aggregated Log Files should be combined


 [ 
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-2942:

Attachment: CombinedAggregatedLogsProposal_v3.pdf

I've just uploaded CombinedAggregatedLogsProposal_v3.pdf, which has some minor 
updates.

 Aggregated Log Files should be combined
 ---

 Key: YARN-2942
 URL: https://issues.apache.org/jira/browse/YARN-2942
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: CombinedAggregatedLogsProposal_v3.pdf, 
 CompactedAggregatedLogsProposal_v1.pdf, 
 CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, 
 YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, 
 YARN-2942.003.patch


 Turning on log aggregation allows users to easily store container logs in 
 HDFS and subsequently view them in the YARN web UIs from a central place.  
 Currently, there is a separate log file for each Node Manager.  This can be a 
 problem for HDFS if you have a cluster with many nodes as you’ll slowly start 
 accumulating many (possibly small) files per YARN application.  The current 
 “solution” for this problem is to configure YARN (actually the JHS) to 
 automatically delete these files after some amount of time.  
 We should improve this by compacting the per-node aggregated log files into 
 one log file per application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3219) Use CombinedAggregatedLogFormat Writer to combine aggregated log files

Robert Kanter created YARN-3219:
---

 Summary: Use CombinedAggregatedLogFormat Writer to combine 
aggregated log files
 Key: YARN-3219
 URL: https://issues.apache.org/jira/browse/YARN-3219
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


The NodeManager should use the {{CombinedAggregatedLogFormat}} from YARN-3218 
to append its aggregated log to the per-app log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3222) RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential order

Rohith created YARN-3222:


 Summary: RMNodeImpl#ReconnectNodeTransition should send scheduler 
events in sequential order
 Key: YARN-3222
 URL: https://issues.apache.org/jira/browse/YARN-3222
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Rohith
Assignee: Rohith
Priority: Critical


When a node is reconnected,RMNodeImpl#ReconnectNodeTransition notifies the 
scheduler in a events node_added,node_removed or node_resource_update. These 
events should be notified in an sequential order i.e node_added event and next 
node_resource_update events.
But if the node is reconnected with different http port, the oder of scheduler 
events are node_removed -- node_resource_update -- node_added which causes 
scheduler does not find the node and throw NPE and RM exit.

Node_Resource_update event should be always should be triggered via 
RMNodeEventType.RESOURCE_UPDATE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy

Akira AJISAKA created YARN-3217:
---

 Summary: Remove httpclient dependency from 
hadoop-yarn-server-web-proxy
 Key: YARN-3217
 URL: https://issues.apache.org/jira/browse/YARN-3217
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Akira AJISAKA
Priority: Minor


Sub-task of HADOOP-10105. Remove httpclient dependency from 
WebAppProxyServlet.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3166) [Source organization] Decide detailed package structures for timeline service v2 components


[ 
https://issues.apache.org/jira/browse/YARN-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326812#comment-14326812
 ] 

Zhijie Shen commented on YARN-3166:
---

Thanks for raising the package structure definition. Some comments:

1. Should we sort out the package for RM and NM code of writing actual system 
data through the aggregator?

2. RM and NM modules will depend on timeline service module?

 [Source organization] Decide detailed package structures for timeline service 
 v2 components
 ---

 Key: YARN-3166
 URL: https://issues.apache.org/jira/browse/YARN-3166
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu

 Open this JIRA to track all discussions on detailed package structures for 
 timeline services v2. This JIRA is for discussion only.
 For our current timeline service v2 design, aggregator (previously called 
 writer) implementation is in hadoop-yarn-server's:
 {{org.apache.hadoop.yarn.server.timelineservice.aggregator}}
 In YARN-2928's design, the next gen ATS reader is also a server. Maybe we 
 want to put reader related implementations into hadoop-yarn-server's:
 {{org.apache.hadoop.yarn.server.timelineservice.reader}}
 Both readers and aggregators will expose features that may be used by YARN 
 and other 3rd party components, such as aggregator/reader APIs. For those 
 features, maybe we would like to expose their interfaces to 
 hadoop-yarn-common's {{org.apache.hadoop.yarn.timelineservice}}? 
 Let's use this JIRA as a centralized place to track all related discussions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3217) Remove httpclient dependency from hadoop-yarn-server-web-proxy


 [ 
https://issues.apache.org/jira/browse/YARN-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula reassigned YARN-3217:
--

Assignee: Brahma Reddy Battula

 Remove httpclient dependency from hadoop-yarn-server-web-proxy
 --

 Key: YARN-3217
 URL: https://issues.apache.org/jira/browse/YARN-3217
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Akira AJISAKA
Assignee: Brahma Reddy Battula
Priority: Minor
  Labels: newbie

 Sub-task of HADOOP-10105. Remove httpclient dependency from 
 WebAppProxyServlet.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


[ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326877#comment-14326877
 ] 

Tsuyoshi OZAWA commented on YARN-1514:
--

Thanks Jian for your review!

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.7.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, YARN-1514.5.patch, 
 YARN-1514.6.patch, YARN-1514.7.patch, YARN-1514.wip-2.patch, 
 YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1615) Fix typos in description about delay scheduling


[ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326898#comment-14326898
 ] 

Akira AJISAKA commented on YARN-1615:
-

Thanks Tsuyoshi for review and commit.

 Fix typos in description about delay scheduling
 ---

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Fix For: 2.7.0

 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2942) Aggregated Log Files should be combined


 [ 
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-2942:

Summary: Aggregated Log Files should be combined  (was: Aggregated Log 
Files should be compacted)

 Aggregated Log Files should be combined
 ---

 Key: YARN-2942
 URL: https://issues.apache.org/jira/browse/YARN-2942
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: CompactedAggregatedLogsProposal_v1.pdf, 
 CompactedAggregatedLogsProposal_v2.pdf, YARN-2942-preliminary.001.patch, 
 YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, 
 YARN-2942.003.patch


 Turning on log aggregation allows users to easily store container logs in 
 HDFS and subsequently view them in the YARN web UIs from a central place.  
 Currently, there is a separate log file for each Node Manager.  This can be a 
 problem for HDFS if you have a cluster with many nodes as you’ll slowly start 
 accumulating many (possibly small) files per YARN application.  The current 
 “solution” for this problem is to configure YARN (actually the JHS) to 
 automatically delete these files after some amount of time.  
 We should improve this by compacting the per-node aggregated log files into 
 one log file per application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3220) JHS should display Combined Aggregated Logs when available

Robert Kanter created YARN-3220:
---

 Summary: JHS should display Combined Aggregated Logs when available
 Key: YARN-3220
 URL: https://issues.apache.org/jira/browse/YARN-3220
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


The JHS should read the Combined Aggregated Log files created by YARN-3219 when 
the user asks it for logs.  When unavailable, it should fallback to the regular 
Aggregated Log files (the current behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1615) Fix typos in description about delay scheduling


 [ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1615:
-
Hadoop Flags: Reviewed

 Fix typos in description about delay scheduling
 ---

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3122) Metrics for container's actual CPU usage


[ 
https://issues.apache.org/jira/browse/YARN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326864#comment-14326864
 ] 

Hadoop QA commented on YARN-3122:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12699581/YARN-3122.002.patch
  against trunk revision 1c03376.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6663//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6663//console

This message is automatically generated.

 Metrics for container's actual CPU usage
 

 Key: YARN-3122
 URL: https://issues.apache.org/jira/browse/YARN-3122
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-3122.001.patch, YARN-3122.002.patch, 
 YARN-3122.prelim.patch, YARN-3122.prelim.patch


 It would be nice to capture resource usage per container, for a variety of 
 reasons. This JIRA is to track CPU usage. 
 YARN-2965 tracks the resource usage on the node, and the two implementations 
 should reuse code as much as possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1615) Fix typos in description about delay scheduling


[ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326867#comment-14326867
 ] 

Hudson commented on YARN-1615:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7150 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7150/])
YARN-1615. Fix typos in delay scheduler's description. Contributed by Akira 
Ajisaka. (ozawa: rev b8a14efdf535d42bcafa58d380bd2c7f4d36f8cb)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
* hadoop-yarn-project/CHANGES.txt


 Fix typos in description about delay scheduling
 ---

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Fix For: 2.7.0

 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1615) Fix typos in FSAppAttempt.java


[ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326831#comment-14326831
 ] 

Tsuyoshi OZAWA commented on YARN-1615:
--

+1, committing this shortly.

 Fix typos in FSAppAttempt.java
 --

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1615) Fix typos in delay scheduler's description


 [ 
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1615:
-
Summary: Fix typos in delay scheduler's description  (was: Fix typos in 
FSAppAttempt.java)

 Fix typos in delay scheduler's description
 --

 Key: YARN-1615
 URL: https://issues.apache.org/jira/browse/YARN-1615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, scheduler
Affects Versions: 2.6.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1615-002.patch, YARN-1615.patch


 In FSAppAttempt.java there're 4 typos:
 {code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
 {code}
 They should be fixed as follows:
 {code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3218) Implement CombinedAggregatedLogFormat Reader and Writer


[ 
https://issues.apache.org/jira/browse/YARN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326901#comment-14326901
 ] 

Hadoop QA commented on YARN-3218:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12699591/YARN-3218.001.patch
  against trunk revision b8a14ef.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6665//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6665//console

This message is automatically generated.

 Implement CombinedAggregatedLogFormat Reader and Writer
 ---

 Key: YARN-3218
 URL: https://issues.apache.org/jira/browse/YARN-3218
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: YARN-3218.001.patch


 We need to create a Reader and Writer for the CombinedAggregatedLogFormat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-02-18 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326737#comment-14326737
 ] 

Jason Lowe commented on YARN-2004:
--

I took a closer look at the patch, and the following logic seems suspect:

{code}
+  if (a1.getApplicationPriority() != null
+   a2.getApplicationPriority() != null
+   !a1.getApplicationPriority().equals(a2.getApplicationPriority())) 
{
+return a2.getApplicationPriority().compareTo(
+a1.getApplicationPriority());
+  }
{code}

Priority is only considered if both applications have a priority that was set.  
Do we really want that behavior?  I'm thinking of the scenario where all the 
apps in the queue have no set priority then one of the apps has their priority 
set to very high or very low.  That has no net effect since all other apps 
being compared in the queue don't have a priority set.  A more intuitive 
behavior is to treat an unset priority as if the app had a default priority, so 
we aren't implicitly disabling priority checks in some scenarios.

 Priority scheduling support in Capacity scheduler
 -

 Key: YARN-2004
 URL: https://issues.apache.org/jira/browse/YARN-2004
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2004.patch


 Based on the priority of the application, Capacity Scheduler should be able 
 to give preference to application while doing scheduling.
 ComparatorFiCaSchedulerApp applicationComparator can be changed as below.   
 
 1.Check for Application priority. If priority is available, then return 
 the highest priority job.
 2.Otherwise continue with existing logic such as App ID comparison and 
 then TimeStamp comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


[ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326822#comment-14326822
 ] 

Hudson commented on YARN-1514:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7149 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7149/])
YARN-1514. Utility to benchmark ZKRMStateStore#loadState for RM HA. Contributed 
by Tsuyoshi OZAWA (jianhe: rev 1c03376300a46722d4147f5b8f37242f68dba0a2)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/test/YarnTestDriver.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml
* hadoop-yarn-project/CHANGES.txt
* hadoop-project/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStorePerf.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java


 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.7.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, YARN-1514.5.patch, 
 YARN-1514.6.patch, YARN-1514.7.patch, YARN-1514.wip-2.patch, 
 YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3221) Applications should be able to 're-register'

2015-02-18 Thread Sidharta Seethana (JIRA)

Sidharta Seethana created YARN-3221:
---

 Summary: Applications should be able to 're-register' 
 Key: YARN-3221
 URL: https://issues.apache.org/jira/browse/YARN-3221
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Sidharta Seethana


Today, it is not possible for YARN applications to 're-register' in 
failure/restart scenarios. This is especially problematic for Unmanaged 
applications - when restarts (normal or otherwise) or other failures 
necessitate the re-creation of the AMRMClient (along with a reset of the 
internal RPC counter).  The YARN RM disallows an attempt to register again 
(with the same saved token) with the following exception shown below.  This 
should be fixed.

{quote}
rmClient.RegisterApplicationMaster 
org.apache.hadoop.yarn.exceptions.InvalidApplicationMasterRequestException:Application
 Master is already registered : application_1424304845861_0002
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:264)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:90)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3218) Implement CombinedAggregatedLogFormat Reader and Writer