[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Commented] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent
[ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510186#comment-16510186 ] Ted Yu commented on YARN-8414: -- HBaseAdmin has the following method: {code} boolean isTableAvailable(TableName tableName, byte[][] splitKeys) throws IOException; {code} You can selectively use the above method to ensure that your table is accessible. > Nodemanager crashes soon if ATSv2 HBase is either down or absent > > > Key: YARN-8414 > URL: https://issues.apache.org/jira/browse/YARN-8414 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Priority: Critical > > Test cluster has 1000 apps running, and a user trigger capacity scheduler > queue changes. This crashes all node managers. It looks like node manager > encounter too many files open while aggregating logs for containers: > {code} > 2018-06-07 21:17:59,307 WARN server.AbstractConnector > (AbstractConnector.java:handleAcceptFailure(544)) - > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at > org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371) > at > org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:18:00,842 WARN client.ConnectionUtils > (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, > please check your network > java.net.UnknownHostException: host1.example.com: System error > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getByName(InetAddress.java:1076) > at > org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189) > at > org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Timeline service has thousands of exceptions: > {code} > 2018-06-07 21:18:34,182 ERROR client.AsyncProcess > (AsyncProcess.java:submit(291)) - Failed to get region location > java.io.InterruptedIOException > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437) > at > org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281) > at >
[jira] [Comment Edited] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent
[ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510186#comment-16510186 ] Ted Yu edited comment on YARN-8414 at 6/12/18 8:49 PM: --- HBaseAdmin has the following method: {code} boolean isTableAvailable(TableName tableName) throws IOException; {code} You can selectively use the above method to ensure that your table is accessible. was (Author: yuzhih...@gmail.com): HBaseAdmin has the following method: {code} boolean isTableAvailable(TableName tableName, byte[][] splitKeys) throws IOException; {code} You can selectively use the above method to ensure that your table is accessible. > Nodemanager crashes soon if ATSv2 HBase is either down or absent > > > Key: YARN-8414 > URL: https://issues.apache.org/jira/browse/YARN-8414 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Priority: Critical > > Test cluster has 1000 apps running, and a user trigger capacity scheduler > queue changes. This crashes all node managers. It looks like node manager > encounter too many files open while aggregating logs for containers: > {code} > 2018-06-07 21:17:59,307 WARN server.AbstractConnector > (AbstractConnector.java:handleAcceptFailure(544)) - > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at > org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371) > at > org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:18:00,842 WARN client.ConnectionUtils > (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, > please check your network > java.net.UnknownHostException: host1.example.com: System error > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getByName(InetAddress.java:1076) > at > org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189) > at > org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Timeline service has thousands of exceptions: > {code} > 2018-06-07 21:18:34,182 ERROR client.AsyncProcess > (AsyncProcess.java:submit(291)) - Failed to get region location > java.io.InterruptedIOException > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437) > at > org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597) > at >
[jira] [Commented] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent
[ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509851#comment-16509851 ] Ted Yu commented on YARN-8414: -- bq. TimelineCollector.putEntities is a synchronized method. The throttling might need to be implemented here to avoid excessive call I think this should be done as well. > Nodemanager crashes soon if ATSv2 HBase is either down or absent > > > Key: YARN-8414 > URL: https://issues.apache.org/jira/browse/YARN-8414 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Priority: Critical > > Test cluster has 1000 apps running, and a user trigger capacity scheduler > queue changes. This crashes all node managers. It looks like node manager > encounter too many files open while aggregating logs for containers: > {code} > 2018-06-07 21:17:59,307 WARN server.AbstractConnector > (AbstractConnector.java:handleAcceptFailure(544)) - > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at > org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371) > at > org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:18:00,842 WARN client.ConnectionUtils > (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, > please check your network > java.net.UnknownHostException: host1.example.com: System error > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getByName(InetAddress.java:1076) > at > org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189) > at > org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Timeline service has thousands of exceptions: > {code} > 2018-06-07 21:18:34,182 ERROR client.AsyncProcess > (AsyncProcess.java:submit(291)) - Failed to get region location > java.io.InterruptedIOException > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437) > at > org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236) > at >
[jira] [Commented] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent
[ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509826#comment-16509826 ] Ted Yu commented on YARN-8414: -- In {{ClientScanner}} ctor : {code} this.retries = conf.getInt(HConstants.HBASE_CLIENT_RETRIES_NUMBER, HConstants.DEFAULT_HBASE_CLIENT_RETRIES_NUMBER); {code} Config is "hbase.client.retries.number" with default of 15. You can tune this parameter so that client side fails earlier in this scenario. > Nodemanager crashes soon if ATSv2 HBase is either down or absent > > > Key: YARN-8414 > URL: https://issues.apache.org/jira/browse/YARN-8414 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Priority: Critical > > Test cluster has 1000 apps running, and a user trigger capacity scheduler > queue changes. This crashes all node managers. It looks like node manager > encounter too many files open while aggregating logs for containers: > {code} > 2018-06-07 21:17:59,307 WARN server.AbstractConnector > (AbstractConnector.java:handleAcceptFailure(544)) - > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at > org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371) > at > org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:18:00,842 WARN client.ConnectionUtils > (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, > please check your network > java.net.UnknownHostException: host1.example.com: System error > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getByName(InetAddress.java:1076) > at > org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189) > at > org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Timeline service has thousands of exceptions: > {code} > 2018-06-07 21:18:34,182 ERROR client.AsyncProcess > (AsyncProcess.java:submit(291)) - Failed to get region location > java.io.InterruptedIOException > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437) > at > org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732) > at >
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Resolved] (YARN-1869) Access to zkAcl should be synchronized in ZKRMStateStore#addStoreOrUpdateOps()
[ https://issues.apache.org/jira/browse/YARN-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-1869. -- Resolution: Won't Fix The method is gone. > Access to zkAcl should be synchronized in ZKRMStateStore#addStoreOrUpdateOps() > -- > > Key: YARN-1869 > URL: https://issues.apache.org/jira/browse/YARN-1869 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ted Yu >Priority: Minor > Attachments: yarn-1869.patch > > > Here is related code: > {code} > } else { > opList.add(Op.create(nodeCreatePath, tokenOs.toByteArray(), zkAcl, > CreateMode.PERSISTENT)); > } > {code} > The other methods accessing zkAcl are synchronized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Comment Edited] (YARN-7346) Fix compilation errors against hbase2 beta release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325683#comment-16325683 ] Ted Yu edited comment on YARN-7346 at 1/23/18 4:37 PM: --- hbase 2 beta1 has been released. FYI was (Author: yuzhih...@gmail.com): New RC for hbase 2 beta1 has been posted. FYI > Fix compilation errors against hbase2 beta release > -- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C >Priority: Major > Attachments: YARN-7346.00.patch, YARN-7346.01.patch, > YARN-7346.prelim1.patch, YARN-7346.prelim2.patch, YARN-7581.prelim.patch > > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 beta release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325683#comment-16325683 ] Ted Yu commented on YARN-7346: -- New RC for hbase 2 beta1 has been posted. FYI > Fix compilation errors against hbase2 beta release > -- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > Attachments: YARN-7346.00.patch, YARN-7346.01.patch, > YARN-7346.prelim1.patch, YARN-7346.prelim2.patch, YARN-7581.prelim.patch > > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 beta release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312355#comment-16312355 ] Ted Yu commented on YARN-7346: -- Probably because of the following dependency: {code} [INFO] org.apache.hbase:hbase-hadoop-compat:jar:3.0.0-SNAPSHOT [INFO] +- org.apache.hbase:hbase-annotations:test-jar:tests:3.0.0-SNAPSHOT:test [INFO] +- org.apache.hbase.thirdparty:hbase-shaded-miscellaneous:jar:1.0.1:compile [INFO] +- commons-logging:commons-logging:jar:1.2:compile [INFO] +- org.apache.hbase:hbase-metrics-api:jar:3.0.0-SNAPSHOT:compile {code} Similar dependency exists for hbase-hadoop2-compat module. > Fix compilation errors against hbase2 beta release > -- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > Attachments: YARN-7346.00.patch, YARN-7346.01.patch, > YARN-7346.prelim1.patch, YARN-7346.prelim2.patch, YARN-7581.prelim.patch > > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 beta release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312338#comment-16312338 ] Ted Yu commented on YARN-7346: -- First RC for beta1 is being voted upon. After the formal release of beta1, there is no need to include staging repo. > Fix compilation errors against hbase2 beta release > -- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > Attachments: YARN-7346.00.patch, YARN-7346.01.patch, > YARN-7346.prelim1.patch, YARN-7346.prelim2.patch, YARN-7581.prelim.patch > > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306611#comment-16306611 ] Ted Yu commented on YARN-7346: -- bq. Unless HBase releases beta-1 You can find maven artifacts for beta-1 RC here: https://repository.apache.org/content/groups/staging/org/apache/hbase/hbase-client/2.0.0-beta-1/ > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > Attachments: YARN-7346.00.patch, YARN-7346.prelim1.patch, > YARN-7346.prelim2.patch, YARN-7581.prelim.patch > > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297469#comment-16297469 ] Ted Yu commented on YARN-7346: -- HBASE-19112 has been integrated. See if rebase is needed for using the new hbase API. > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > Attachments: YARN-7346.00.patch, YARN-7346.prelim1.patch, > YARN-7346.prelim2.patch, YARN-7581.prelim.patch > > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289446#comment-16289446 ] Ted Yu commented on YARN-7346: -- [~rohithsharma] [~vrushalic]: Can you review the patch ? > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > Attachments: YARN-7346.00.patch, YARN-7346.prelim1.patch, > YARN-7346.prelim2.patch, YARN-7581.prelim.patch > > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7213) [Umbrella] Test and validate HBase-2.0.x with Atsv2
[ https://issues.apache.org/jira/browse/YARN-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256478#comment-16256478 ] Ted Yu commented on YARN-7213: -- See HBASE-18368 https://issues.apache.org/jira/browse/HBASE-18368?focusedCommentId=16207257=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16207257 bq. so you have to make do with SELECT x,y WHERE y = "foo" instead True. > [Umbrella] Test and validate HBase-2.0.x with Atsv2 > --- > > Key: YARN-7213 > URL: https://issues.apache.org/jira/browse/YARN-7213 > Project: Hadoop YARN > Issue Type: Task >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: YARN-7213.prelim.patch, YARN-7213.wip.patch > > > Hbase-2.0.x officially support hadoop-alpha compilations. And also they are > getting ready for Hadoop-beta release so that HBase can release their > versions compatible with Hadoop-beta. So, this JIRA is to keep track of > HBase-2.0 integration issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7213) [Umbrella] Test and validate HBase-2.0.x with Atsv2
[ https://issues.apache.org/jira/browse/YARN-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256315#comment-16256315 ] Ted Yu commented on YARN-7213: -- Haibo: Time permitting, formulating simplified Filter test (independent of ATS v2) which shows the test failure is beneficial to hbase community (to prevent regression). Thanks > [Umbrella] Test and validate HBase-2.0.x with Atsv2 > --- > > Key: YARN-7213 > URL: https://issues.apache.org/jira/browse/YARN-7213 > Project: Hadoop YARN > Issue Type: Task >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: YARN-7213.prelim.patch, YARN-7213.wip.patch > > > Hbase-2.0.x officially support hadoop-alpha compilations. And also they are > getting ready for Hadoop-beta release so that HBase can release their > versions compatible with Hadoop-beta. So, this JIRA is to keep track of > HBase-2.0 integration issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7213) [Umbrella] Test and validate HBase-2.0.x with Atsv2
[ https://issues.apache.org/jira/browse/YARN-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255988#comment-16255988 ] Ted Yu commented on YARN-7213: -- Recently there were a lot of changes for Filters, starting from (trunk): {code} Author: huzhengDate: Sat May 27 16:58:00 2017 +0800 HBASE-17678 FilterList with MUST_PASS_ONE may lead to redundant cells returned {code} to: {code} commit 705b3fa98c97806c7eba63617a99f62d829400d1 Author: huzheng Date: Tue Oct 24 15:30:55 2017 +0800 HBASE-19057 Fix other code review comments about FilterList improvement {code} One approach is to step back before commit HBASE-17678, and progressively find which commit causes the test to fail. > [Umbrella] Test and validate HBase-2.0.x with Atsv2 > --- > > Key: YARN-7213 > URL: https://issues.apache.org/jira/browse/YARN-7213 > Project: Hadoop YARN > Issue Type: Task >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: YARN-7213.prelim.patch, YARN-7213.wip.patch > > > Hbase-2.0.x officially support hadoop-alpha compilations. And also they are > getting ready for Hadoop-beta release so that HBase can release their > versions compatible with Hadoop-beta. So, this JIRA is to keep track of > HBase-2.0 integration issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7213) [Umbrella] Test and validate HBase-2.0.x with Atsv2
[ https://issues.apache.org/jira/browse/YARN-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255982#comment-16255982 ] Ted Yu commented on YARN-7213: -- I took a brief look at TestTimelineReaderWebServicesHBaseStorage.java which passes filter criteria thru URL parameters. If test can be simplified (involving SingleColumnValueFilter), that would make debugging easier for hbase developers. > [Umbrella] Test and validate HBase-2.0.x with Atsv2 > --- > > Key: YARN-7213 > URL: https://issues.apache.org/jira/browse/YARN-7213 > Project: Hadoop YARN > Issue Type: Task >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: YARN-7213.prelim.patch, YARN-7213.wip.patch > > > Hbase-2.0.x officially support hadoop-alpha compilations. And also they are > getting ready for Hadoop-beta release so that HBase can release their > versions compatible with Hadoop-beta. So, this JIRA is to keep track of > HBase-2.0 integration issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7213) [Umbrella] Test and validate HBase-2.0.x with Atsv2
[ https://issues.apache.org/jira/browse/YARN-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254693#comment-16254693 ] Ted Yu commented on YARN-7213: -- [~openinx]: You have made many changes to Filters. Mind giving Haibo a hand ? > [Umbrella] Test and validate HBase-2.0.x with Atsv2 > --- > > Key: YARN-7213 > URL: https://issues.apache.org/jira/browse/YARN-7213 > Project: Hadoop YARN > Issue Type: Task >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: YARN-7213.prelim.patch, YARN-7213.wip.patch > > > Hbase-2.0.x officially support hadoop-alpha compilations. And also they are > getting ready for Hadoop-beta release so that HBase can release their > versions compatible with Hadoop-beta. So, this JIRA is to keep track of > HBase-2.0 integration issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251819#comment-16251819 ] Ted Yu commented on YARN-7346: -- [~ram_krish]: You can find branch used by Haibo from above. > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243363#comment-16243363 ] Ted Yu commented on YARN-7346: -- I am not sure a different folder helps. As long as mapreduce.tar.gz, containing un-relocated hbase jars, is on the classpath for (hbase) mapreduce jobs, we may see some problem. e.g. HBASE-19169 > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243098#comment-16243098 ] Ted Yu commented on YARN-7346: -- Have ATS v2 developers considered shading hbase jars ? With shading, regardless of hbase version ATS v2 uses, hbase mapreduce job can succeed. > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238106#comment-16238106 ] Ted Yu commented on YARN-7346: -- When looking at the contents of mapreduce.tar.gz for hadoop3 beta1: {code} -rw-r--r-- jenkins/users 1304466 2017-10-17 16:16 hadoop/share/hadoop/yarn/lib/hbase-client-1.2.6.jar -rw-r--r-- jenkins/users 4179597 2017-10-17 16:16 hadoop/share/hadoop/yarn/lib/hbase-server-1.2.6.jar -rw-r--r-- jenkins/users 580945 2017-10-17 16:16 hadoop/share/hadoop/yarn/lib/hbase-common-1.2.6.jar -rw-r--r-- jenkins/users 4365774 2017-10-17 16:16 hadoop/share/hadoop/yarn/lib/hbase-protocol-1.2.6.jar -rw-r--r-- jenkins/users 100710 2017-10-17 16:16 hadoop/share/hadoop/yarn/lib/hbase-hadoop2-compat-1.2.6.jar {code} The above wouldn't work for hbase2 release. When can hbase developers have artifact which uses hbase2 alpha4 or later ? > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C >Priority: Major > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220795#comment-16220795 ] Ted Yu commented on YARN-7346: -- Please watch HBASE-19092 > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1869) Access to zkAcl should be synchronized in ZKRMStateStore#addStoreOrUpdateOps()
[ https://issues.apache.org/jira/browse/YARN-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214487#comment-16214487 ] Ted Yu commented on YARN-1869: -- Currently addStoreOrUpdateOps() has 4 arguments, instead of 5. > Access to zkAcl should be synchronized in ZKRMStateStore#addStoreOrUpdateOps() > -- > > Key: YARN-1869 > URL: https://issues.apache.org/jira/browse/YARN-1869 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ted Yu >Priority: Minor > Attachments: yarn-1869.patch > > > Here is related code: > {code} > } else { > opList.add(Op.create(nodeCreatePath, tokenOs.toByteArray(), zkAcl, > CreateMode.PERSISTENT)); > } > {code} > The other methods accessing zkAcl are synchronized. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210290#comment-16210290 ] Ted Yu commented on YARN-7346: -- bq. few bugs causing ATSv2 unit tests failure Please surface the bug(s) if 2.0.0-alpha4-SNAPSHOT still has it. > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Vrushali C > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208639#comment-16208639 ] Ted Yu commented on YARN-7346: -- 2.0.0-alpha4 hasn't come out yet. Please build / install hbase-2 locally. I normally use the following command line parameters : {code} -Phadoop-3.0 -Dhadoop-three.version=3.0.0-beta1 -Dhadoop-two.version=3.0.0-beta1 -Djetty.version=9.3.19.v20170502 {code} > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ted Yu >Assignee: Vrushali C > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7346) Fix compilation errors against hbase2 alpha release
[ https://issues.apache.org/jira/browse/YARN-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208111#comment-16208111 ] Ted Yu commented on YARN-7346: -- Please build 2.0.0-alpha4-SNAPSHOT locally before hbase 2 alpha4 is released - hbase APIs are still moving. > Fix compilation errors against hbase2 alpha release > --- > > Key: YARN-7346 > URL: https://issues.apache.org/jira/browse/YARN-7346 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ted Yu >Assignee: Vrushali C > > When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, > I got the following errors: > https://pastebin.com/Ms4jYEVB > This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7346) Fix compilation errors against hbase2 alpha release
Ted Yu created YARN-7346: Summary: Fix compilation errors against hbase2 alpha release Key: YARN-7346 URL: https://issues.apache.org/jira/browse/YARN-7346 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu When compiling hadoop-yarn-server-timelineservice-hbase against 2.0.0-alpha3, I got the following errors: https://pastebin.com/Ms4jYEVB This issue is to fix the compilation errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6707) [ATSv2] Update HBase version to 1.2.6
[ https://issues.apache.org/jira/browse/YARN-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045588#comment-16045588 ] Ted Yu commented on YARN-6707: -- It would take some time for hbase community to agree on the next stable release. Please go ahead with commit. > [ATSv2] Update HBase version to 1.2.6 > - > > Key: YARN-6707 > URL: https://issues.apache.org/jira/browse/YARN-6707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-5355 >Reporter: Varun Saxena >Assignee: Vrushali C > Attachments: YARN-6707-YARN-5355.001.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6707) [ATSv2] Update HBase version to 1.2.6
[ https://issues.apache.org/jira/browse/YARN-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045344#comment-16045344 ] Ted Yu commented on YARN-6707: -- hbase 1.3.1 has been released. Do you want to use that ? > [ATSv2] Update HBase version to 1.2.6 > - > > Key: YARN-6707 > URL: https://issues.apache.org/jira/browse/YARN-6707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-5355 >Reporter: Varun Saxena >Assignee: Vrushali C > Attachments: YARN-6707-YARN-5355.001.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-1872) DistributedShell occasionally keeps running endlessly
[ https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-1872. -- Resolution: Cannot Reproduce > DistributedShell occasionally keeps running endlessly > - > > Key: YARN-1872 > URL: https://issues.apache.org/jira/browse/YARN-1872 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ted Yu >Assignee: Hong Zhiguo > Attachments: TestDistributedShell.out, YARN-1872.patch > > > From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console : > TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and > TestDistributedShell#testDSShell timed out. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Commented] (YARN-1872) DistributedShell occasionally keeps running endlessly
[ https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658592#comment-15658592 ] Ted Yu commented on YARN-1872: -- Looks like this is no longer an issue. > DistributedShell occasionally keeps running endlessly > - > > Key: YARN-1872 > URL: https://issues.apache.org/jira/browse/YARN-1872 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ted Yu >Assignee: Hong Zhiguo > Attachments: TestDistributedShell.out, YARN-1872.patch > > > From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console : > TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and > TestDistributedShell#testDSShell timed out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Description: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. was: I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app:
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Labels: states (was: ) > Resourcemanager should surface failed state store operation prominently > --- > > Key: YARN-5579 > URL: https://issues.apache.org/jira/browse/YARN-5579 > Project: Hadoop YARN > Issue Type: Task >Affects Versions: 2.7.3 >Reporter: Ted Yu > Labels: states > > I found the following in Resourcemanager log when I tried to figure out why > application got stuck in NEW_SAVING state. > {code} > 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore > (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! > 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore > (RMStateStore.java:transition(205)) - Error storing app: > application_1470517915158_0001 > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:123) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore > (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store > operation failed > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:123) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) > at >
[jira] [Updated] (YARN-5579) Resourcemanager should surface failed state store operation prominently
[ https://issues.apache.org/jira/browse/YARN-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-5579: - Affects Version/s: 2.7.3 > Resourcemanager should surface failed state store operation prominently > --- > > Key: YARN-5579 > URL: https://issues.apache.org/jira/browse/YARN-5579 > Project: Hadoop YARN > Issue Type: Task >Affects Versions: 2.7.3 >Reporter: Ted Yu > > I found the following in Resourcemanager log when I tried to figure out why > application got stuck in NEW_SAVING state. > {code} > 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore > (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! > 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore > (RMStateStore.java:transition(205)) - Error storing app: > application_1470517915158_0001 > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:123) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore > (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store > operation failed > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:123) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) > at >
[jira] [Created] (YARN-5579) Resourcemanager should surface failed state store operation prominently
Ted Yu created YARN-5579: Summary: Resourcemanager should surface failed state store operation prominently Key: YARN-5579 URL: https://issues.apache.org/jira/browse/YARN-5579 Project: Hadoop YARN Issue Type: Task Reporter: Ted Yu I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state. {code} 2016-08-29 18:14:23,486 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up! 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001 org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) {code} Resourcemanager should surface the above error prominently. Likely subsequent application submission would encounter the same error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172795#comment-15172795 ] Ted Yu commented on YARN-4736: -- bq. so planning to test with hbase-1.0.3 tar. There have been more release(s) since 1.0.3 release. e.g. you can try out 1.2.0 release. BufferedMutatorImpl#flush() appeared in stack trace. However, if the hbase cluster was shutdown, the flush wouldn't succeed. I haven't seen the above issue happen on a live 1.x cluster. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-3025. -- Resolution: Later Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-3025-v1.txt, yarn-3025-v2.txt, yarn-3025-v3.txt We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1869) Access to zkAcl should be synchronized in ZKRMStateStore#addStoreOrUpdateOps()
[ https://issues.apache.org/jira/browse/YARN-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-1869: - Description: Here is related code: {code} } else { opList.add(Op.create(nodeCreatePath, tokenOs.toByteArray(), zkAcl, CreateMode.PERSISTENT)); } {code} The other methods accessing zkAcl are synchronized. was: Here is related code: {code} } else { opList.add(Op.create(nodeCreatePath, tokenOs.toByteArray(), zkAcl, CreateMode.PERSISTENT)); } {code} The other methods accessing zkAcl are synchronized. Access to zkAcl should be synchronized in ZKRMStateStore#addStoreOrUpdateOps() -- Key: YARN-1869 URL: https://issues.apache.org/jira/browse/YARN-1869 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Priority: Minor Attachments: yarn-1869.patch Here is related code: {code} } else { opList.add(Op.create(nodeCreatePath, tokenOs.toByteArray(), zkAcl, CreateMode.PERSISTENT)); } {code} The other methods accessing zkAcl are synchronized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598781#comment-14598781 ] Ted Yu commented on YARN-3815: -- [~jrottinghuis]: Your description makes sense. Cell tag is supported since hbase 0.98+ so we can use it to mark completion. [Aggregation] Application/Flow/User/Queue Level Aggregations Key: YARN-3815 URL: https://issues.apache.org/jira/browse/YARN-3815 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: Timeline Service Nextgen Flow, User, Queue Level Aggregations (v1).pdf Per previous discussions in some design documents for YARN-2928, the basic scenario is the query for stats can happen on: - Application level, expect return: an application with aggregated stats - Flow level, expect return: aggregated stats for a flow_run, flow_version and flow - User level, expect return: aggregated stats for applications submitted by user - Queue level, expect return: aggregated stats for applications within the Queue Application states is the basic building block for all other level aggregations. We can provide Flow/User/Queue level aggregated statistics info based on application states (a dedicated table for application states is needed which is missing from previous design documents like HBase/Phoenix schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596173#comment-14596173 ] Ted Yu commented on YARN-3815: -- My comment is related to usage of hbase. bq. under framework_specific_metrics column family Since column family name appears in every KeyValue, it would be better to use very short column family name. e.g. f_m for framework metrics. [Aggregation] Application/Flow/User/Queue Level Aggregations Key: YARN-3815 URL: https://issues.apache.org/jira/browse/YARN-3815 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: Timeline Service Nextgen Flow, User, Queue Level Aggregations (v1).pdf Per previous discussions in some design documents for YARN-2928, the basic scenario is the query for stats can happen on: - Application level, expect return: an application with aggregated stats - Flow level, expect return: aggregated stats for a flow_run, flow_version and flow - User level, expect return: aggregated stats for applications submitted by user - Queue level, expect return: aggregated stats for applications within the Queue Application states is the basic building block for all other level aggregations. We can provide Flow/User/Queue level aggregated statistics info based on application states (a dedicated table for application states is needed which is missing from previous design documents like HBase/Phoenix schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596616#comment-14596616 ] Ted Yu commented on YARN-3815: -- bq. in the spirit of readless increments as used in Tephra Readless increment feature is implemented in cdap, called delta write. Please take a look at: cdap-hbase-compat-0.98/src/main/java/co/cask/cdap/data2/increment/hbase98/IncrementHandler.java cdap-hbase-compat-0.98//src/main/java/co/cask/cdap/data2/increment/hbase98/IncrementSummingScanner.java The implementation uses hbase coprocessor, BTW [Aggregation] Application/Flow/User/Queue Level Aggregations Key: YARN-3815 URL: https://issues.apache.org/jira/browse/YARN-3815 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: Timeline Service Nextgen Flow, User, Queue Level Aggregations (v1).pdf Per previous discussions in some design documents for YARN-2928, the basic scenario is the query for stats can happen on: - Application level, expect return: an application with aggregated stats - Flow level, expect return: aggregated stats for a flow_run, flow_version and flow - User level, expect return: aggregated stats for applications submitted by user - Queue level, expect return: aggregated stats for applications within the Queue Application states is the basic building block for all other level aggregations. We can provide Flow/User/Queue level aggregated statistics info based on application states (a dedicated table for application states is needed which is missing from previous design documents like HBase/Phoenix schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2764) counters.LimitExceededException shouldn't abort AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reopened YARN-2764: -- counters.LimitExceededException shouldn't abort AsyncDispatcher --- Key: YARN-2764 URL: https://issues.apache.org/jira/browse/YARN-2764 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Ted Yu Labels: counters I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101) at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:106) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:203) at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:348) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1754) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1737) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1718) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:1089) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2049) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2045) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-10-25 10:28:55,185 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. {code} Counter limit was exceeded when JobFinishedEvent was created. Better handling of LimitExceededException should be provided so that AsyncDispatcher can continue functioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2350) TestApplicationMasterServiceOnHA fails with InvalidToken exception
[ https://issues.apache.org/jira/browse/YARN-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2350. -- Resolution: Cannot Reproduce TestApplicationMasterServiceOnHA fails with InvalidToken exception -- Key: YARN-2350 URL: https://issues.apache.org/jira/browse/YARN-2350 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu From https://builds.apache.org/job/Hadoop-Yarn-trunk/622 : {code} Running org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 8.591 sec FAILURE! - in org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA testAllocateOnHA(org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA) Time elapsed: 8.408 sec ERROR! org.apache.hadoop.security.token.SecretManager$InvalidToken: Given AMRMToken for application : appattempt_1000_0001_00 seems to have been generated illegally. at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy85.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy86.allocate(Unknown Source) at org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA.testAllocateOnHA(TestApplicationMasterServiceOnHA.java:84) {code} This is reproducible locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-1296) schedulerAllocateTimer is accessed without holding samplerLock in ResourceSchedulerWrapper
[ https://issues.apache.org/jira/browse/YARN-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-1296. -- Resolution: Later schedulerAllocateTimer is accessed without holding samplerLock in ResourceSchedulerWrapper -- Key: YARN-1296 URL: https://issues.apache.org/jira/browse/YARN-1296 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: yarn-1296-v1.patch Here is related code: {code} public Allocation allocate(ApplicationAttemptId attemptId, ListResourceRequest resourceRequests, ListContainerId containerIds, ListString strings, ListString strings2) { if (metricsON) { final Timer.Context context = schedulerAllocateTimer.time(); {code} samplerLock should be used to guard the access. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2178) TestApplicationMasterService sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2178. -- Resolution: Cannot Reproduce TestApplicationMasterService sometimes fails in trunk - Key: YARN-2178 URL: https://issues.apache.org/jira/browse/YARN-2178 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor Labels: test From https://builds.apache.org/job/Hadoop-Yarn-trunk/587/ : {code} Running org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 55.763 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService testInvalidContainerReleaseRequest(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService) Time elapsed: 41.336 sec FAILURE! java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:401) at org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService.testInvalidContainerReleaseRequest(TestApplicationMasterService.java:143) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2764) counters.LimitExceededException shouldn't abort AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2764. -- Resolution: Later counters.LimitExceededException shouldn't abort AsyncDispatcher --- Key: YARN-2764 URL: https://issues.apache.org/jira/browse/YARN-2764 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Ted Yu Labels: counters I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101) at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:106) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:203) at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:348) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1754) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1737) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1718) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:1089) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2049) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2045) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-10-25 10:28:55,185 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. {code} Counter limit was exceeded when JobFinishedEvent was created. Better handling of LimitExceededException should be provided so that AsyncDispatcher can continue functioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2133) Make entity Id specification in TestTimelineWebServices amenable for future test cases
[ https://issues.apache.org/jira/browse/YARN-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2133. -- Resolution: Later Make entity Id specification in TestTimelineWebServices amenable for future test cases -- Key: YARN-2133 URL: https://issues.apache.org/jira/browse/YARN-2133 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor Currently each test case in TestTimelineWebServices uses different entity Ids / types. When new test case is added, developer has to go over existing cases and find an unused entity Id. Specification of unique entity Id can be done through introduction of an AtomicInteger field of TestTimelineWebServices that is incremented at the beginning of each test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2764) counters.LimitExceededException shouldn't abort AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2764: - Description: I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101) at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:106) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:203) at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:348) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1754) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1737) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1718) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:1089) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2049) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2045) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-10-25 10:28:55,185 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. {code} Counter limit was exceeded when JobFinishedEvent was created. Better handling of LimitExceededException should be provided so that AsyncDispatcher can continue functioning. was: I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL
[jira] [Commented] (YARN-2706) Math.abs() is called on random integer in DefaultContainerExecutor#getWorkingDir()
[ https://issues.apache.org/jira/browse/YARN-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341605#comment-14341605 ] Ted Yu commented on YARN-2706: -- lgtm Math.abs() is called on random integer in DefaultContainerExecutor#getWorkingDir() -- Key: YARN-2706 URL: https://issues.apache.org/jira/browse/YARN-2706 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: haosdent Priority: Minor Attachments: YARN-2706.patch Here is the code: {code} long randomPosition = Math.abs(r.nextLong()) % totalAvailable; {code} See http://stackoverflow.com/questions/7567350/findbugs-rv-absolute-value-of-random-int-warning -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338736#comment-14338736 ] Ted Yu commented on YARN-3025: -- Ping [~zjshen] Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-3025-v1.txt, yarn-3025-v2.txt, yarn-3025-v3.txt We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2777) Mark the end of individual log in aggregated log
[ https://issues.apache.org/jira/browse/YARN-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338746#comment-14338746 ] Ted Yu commented on YARN-2777: -- @Varun: {code} 713 out.println(End of LogType:); 714 out.println(fileType); {code} Can you put the above two onto the same line ? Thanks Mark the end of individual log in aggregated log Key: YARN-2777 URL: https://issues.apache.org/jira/browse/YARN-2777 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Varun Saxena Labels: log-aggregation Attachments: YARN-2777.001.patch Below is snippet of aggregated log showing hbase master log: {code} LogType: hbase-hbase-master-ip-172-31-34-167.log LogUploadTime: 29-Oct-2014 22:31:55 LogLength: 24103045 Log Contents: Wed Oct 29 15:43:57 UTC 2014 Starting master on ip-172-31-34-167 ... at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) at org.apache.hadoop.hbase.Chore.run(Chore.java:80) at java.lang.Thread.run(Thread.java:745) LogType: hbase-hbase-master-ip-172-31-34-167.out {code} Since logs from various daemons are aggregated in one log file, it would be desirable to mark the end of one log before starting with the next. e.g. with such a line: {code} End of LogType: hbase-hbase-master-ip-172-31-34-167.log {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2777) Mark the end of individual log in aggregated log
[ https://issues.apache.org/jira/browse/YARN-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338975#comment-14338975 ] Ted Yu commented on YARN-2777: -- lgtm Mark the end of individual log in aggregated log Key: YARN-2777 URL: https://issues.apache.org/jira/browse/YARN-2777 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Varun Saxena Labels: log-aggregation Attachments: YARN-2777.001.patch, YARN-2777.002.patch Below is snippet of aggregated log showing hbase master log: {code} LogType: hbase-hbase-master-ip-172-31-34-167.log LogUploadTime: 29-Oct-2014 22:31:55 LogLength: 24103045 Log Contents: Wed Oct 29 15:43:57 UTC 2014 Starting master on ip-172-31-34-167 ... at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) at org.apache.hadoop.hbase.Chore.run(Chore.java:80) at java.lang.Thread.run(Thread.java:745) LogType: hbase-hbase-master-ip-172-31-34-167.out {code} Since logs from various daemons are aggregated in one log file, it would be desirable to mark the end of one log before starting with the next. e.g. with such a line: {code} End of LogType: hbase-hbase-master-ip-172-31-34-167.log {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324607#comment-14324607 ] Ted Yu commented on YARN-3025: -- Talking to Jian He, he suggested adding field in AllocateResponse so that ApplicationMasterProtocol#allocate() can be enhanced to return blacklisted nodes. [~zjshen]: What do you think ? Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-3025-v1.txt, yarn-3025-v2.txt, yarn-3025-v3.txt We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-3025: - Attachment: yarn-3025-v3.txt work in progress: need to add the PBImpl classes. Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-3025-v1.txt, yarn-3025-v2.txt, yarn-3025-v3.txt We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-3025: - Attachment: yarn-3025-v2.txt Patch v2 does what was proposed above. Next step is to add getter for black listed nodes in ApplicationMasterProtocol Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-3025-v1.txt, yarn-3025-v2.txt We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320885#comment-14320885 ] Ted Yu commented on YARN-3025: -- Looking into ApplicationMasterService#allocate(): {code} Allocation allocation = this.rScheduler.allocate(appAttemptId, ask, release, blacklistAdditions, blacklistRemovals); {code} Black list information can be retrieved from YarnScheduler. How about adding the following API to YarnScheduler ? {code} ListString getBlacklistedNodes(AllocationId); {code} Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu Attachments: yarn-3025-v1.txt We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reassigned YARN-3025: Assignee: Ted Yu Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2764) counters.LimitExceededException shouldn't abort AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2764: - Labels: counters (was: ) counters.LimitExceededException shouldn't abort AsyncDispatcher --- Key: YARN-2764 URL: https://issues.apache.org/jira/browse/YARN-2764 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Ted Yu Labels: counters I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101) at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:106) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:203) at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:348) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1754) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1737) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1718) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:1089) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2049) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2045) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-10-25 10:28:55,185 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. {code} Counter limit was exceeded when JobFinishedEvent was created. Better handling of LimitExceededException should be provided so that AsyncDispatcher can continue functioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2650) TestRMRestart#testRMRestartGetApplicationList sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2650. -- Resolution: Cannot Reproduce TestRMRestart#testRMRestartGetApplicationList sometimes fails in trunk -- Key: YARN-2650 URL: https://issues.apache.org/jira/browse/YARN-2650 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor Attachments: TestRMRestart.tar.gz I got the following failure running on Linux: {code} TestRMRestart.testRMRestartGetApplicationList:952 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:952) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:64) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306155#comment-14306155 ] Ted Yu commented on YARN-3025: -- bq. then we can see the efficient way to persist them into the state store to overcome RM restarting Sounds good. Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2858) TestRMHA#testFailoverAndTransitions fails in trunk against Java 8
[ https://issues.apache.org/jira/browse/YARN-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2858. -- Resolution: Cannot Reproduce TestRMHA#testFailoverAndTransitions fails in trunk against Java 8 - Key: YARN-2858 URL: https://issues.apache.org/jira/browse/YARN-2858 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/4/console : {code} Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 51.034 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMHA testFailoverAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 30.021 sec ERROR! java.lang.Exception: test timed out after 3 milliseconds at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:698) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:641) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1218) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.get(WebResource.java:503) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.checkActiveRMWebServices(TestRMHA.java:157) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.checkActiveRMFunctionality(TestRMHA.java:142) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testFailoverAndTransitions(TestRMHA.java:211) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2871. -- Resolution: Cannot Reproduce TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294467#comment-14294467 ] Ted Yu commented on YARN-3025: -- bq. If we want to make sure the blacklisted nodes is recoverable after RM crashing The above is desirable. Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292145#comment-14292145 ] Ted Yu commented on YARN-3025: -- The persistence of blacklisted nodes doesn't have to be 1-to-1 with each heartbeat from AM. RM can decide a proper interval. Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292981#comment-14292981 ] Ted Yu commented on YARN-3025: -- Tsuyoshi's comment makes sense. Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3081) Potential indefinite wait in ContainerManagementProtocolProxy#addProxyToCache()
[ https://issues.apache.org/jira/browse/YARN-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-3081: - Attachment: yarn-3081-001.patch Potential indefinite wait in ContainerManagementProtocolProxy#addProxyToCache() --- Key: YARN-3081 URL: https://issues.apache.org/jira/browse/YARN-3081 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: yarn-3081-001.patch {code} if (!removedProxy) { // all of the proxies are currently in use and already scheduled // for removal, so we need to wait until at least one of them closes try { this.wait(); {code} The above code can wait for a condition that has already been satisfied, leading to indefinite wait. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3081) Potential indefinite wait in ContainerManagementProtocolProxy#addProxyToCache()
Ted Yu created YARN-3081: Summary: Potential indefinite wait in ContainerManagementProtocolProxy#addProxyToCache() Key: YARN-3081 URL: https://issues.apache.org/jira/browse/YARN-3081 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor {code} if (!removedProxy) { // all of the proxies are currently in use and already scheduled // for removal, so we need to wait until at least one of them closes try { this.wait(); {code} The above code can wait for a condition that has already been satisfied, leading to indefinite wait. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3003) Provide API for client to retrieve label to node mapping
[ https://issues.apache.org/jira/browse/YARN-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284205#comment-14284205 ] Ted Yu commented on YARN-3003: -- For messgae LabelsToNodeIdProto, should it be named LabelsToNodeIdsProto since nodeId field is repeated ? Provide API for client to retrieve label to node mapping Key: YARN-3003 URL: https://issues.apache.org/jira/browse/YARN-3003 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Ted Yu Assignee: Varun Saxena Attachments: YARN-3003.001.patch Currently YarnClient#getNodeToLabels() returns the mapping from NodeId to set of labels associated with the node. Client (such as Slider) may be interested in label to node mapping - given label, return the nodes with this label. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3072) Dependency on io.netty in hadoop-nfs pom.xml can be dropped
Ted Yu created YARN-3072: Summary: Dependency on io.netty in hadoop-nfs pom.xml can be dropped Key: YARN-3072 URL: https://issues.apache.org/jira/browse/YARN-3072 Project: Hadoop YARN Issue Type: Task Reporter: Ted Yu Assignee: Ted Yu Priority: Minor hadoop-nfs pom.xml has compile time dependency on io.netty This dependency can be dropped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282006#comment-14282006 ] Ted Yu commented on YARN-3025: -- [~bikassaha]: Can you provide your opinion ? Thanks Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3070) TestRMAdminCLI#testHelp fails for transitionToActive command
Ted Yu created YARN-3070: Summary: TestRMAdminCLI#testHelp fails for transitionToActive command Key: YARN-3070 URL: https://issues.apache.org/jira/browse/YARN-3070 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor {code} testError(new String[] { -help, -transitionToActive }, Usage: yarn rmadmin [-transitionToActive serviceId + [--forceactive]], dataErr, 0); {code} fails with: {code} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testError(TestRMAdminCLI.java:547) at org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:335) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3070) TestRMAdminCLI#testHelp fails for transitionToActive command
[ https://issues.apache.org/jira/browse/YARN-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281665#comment-14281665 ] Ted Yu commented on YARN-3070: -- Thanks Junping for taking care of this. TestRMAdminCLI#testHelp fails for transitionToActive command Key: YARN-3070 URL: https://issues.apache.org/jira/browse/YARN-3070 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Junping Du Priority: Minor Attachments: YARN-3070.patch {code} testError(new String[] { -help, -transitionToActive }, Usage: yarn rmadmin [-transitionToActive serviceId + [--forceactive]], dataErr, 0); {code} fails with: {code} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testError(TestRMAdminCLI.java:547) at org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:335) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3025) Provide API for retrieving blacklisted nodes
Ted Yu created YARN-3025: Summary: Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271727#comment-14271727 ] Ted Yu commented on YARN-3025: -- bq. RM probably does not persist this information Looks like RM should persist blacklisted nodes to ride over RM restart. Provide API for retrieving blacklisted nodes Key: YARN-3025 URL: https://issues.apache.org/jira/browse/YARN-3025 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu We have the following method which updates blacklist: {code} public synchronized void updateBlacklist(ListString blacklistAdditions, ListString blacklistRemovals) { {code} Upon AM failover, there should be an API which returns the blacklisted nodes so that the new AM can make consistent decisions. The new API can be: {code} public synchronized ListString getBlacklistedNodes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2213) Change proxy-user cookie log in AmIpFilter to DEBUG
[ https://issues.apache.org/jira/browse/YARN-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267100#comment-14267100 ] Ted Yu commented on YARN-2213: -- lgtm Change proxy-user cookie log in AmIpFilter to DEBUG --- Key: YARN-2213 URL: https://issues.apache.org/jira/browse/YARN-2213 Project: Hadoop YARN Issue Type: Task Reporter: Ted Yu Assignee: Varun Saxena Priority: Minor Attachments: YARN-2213.001.patch I saw a lot of the following lines in AppMaster log: {code} 14/06/24 17:12:36 WARN web.SliderAmIpFilter: Could not find proxy-user cookie, so user will not be set 14/06/24 17:12:39 WARN web.SliderAmIpFilter: Could not find proxy-user cookie, so user will not be set 14/06/24 17:12:39 WARN web.SliderAmIpFilter: Could not find proxy-user cookie, so user will not be set {code} For long running app, this would consume considerable log space. Log level should be changed to DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3003) Provide API for client to retrieve label to node mapping
[ https://issues.apache.org/jira/browse/YARN-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266505#comment-14266505 ] Ted Yu commented on YARN-3003: -- +1 to the API Wangda described. Provide API for client to retrieve label to node mapping Key: YARN-3003 URL: https://issues.apache.org/jira/browse/YARN-3003 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Ted Yu Assignee: Varun Saxena Currently YarnClient#getNodeToLabels() returns the mapping from NodeId to set of labels associated with the node. Client (such as Slider) may be interested in label to node mapping - given label, return the nodes with this label. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3003) Provide API for client to retrieve label to node mapping
[ https://issues.apache.org/jira/browse/YARN-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263262#comment-14263262 ] Ted Yu commented on YARN-3003: -- Thanks for taking this, Varun. What do you think of the following API: {code} public abstract MapString, SetNodeId getNodeToLabels(ListString labels) {code} If labels parameter is null or empty, all mappings would be returned. Otherwise only mappings for selected labels would be returned. Provide API for client to retrieve label to node mapping Key: YARN-3003 URL: https://issues.apache.org/jira/browse/YARN-3003 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Assignee: Varun Saxena Priority: Minor Currently YarnClient#getNodeToLabels() returns the mapping from NodeId to set of labels associated with the node. Client (such as Slider) may be interested in label to node mapping - given label, return the nodes with this label. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2777) Mark the end of individual log in aggregated log
[ https://issues.apache.org/jira/browse/YARN-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2777: - Labels: log-aggregation (was: ) Mark the end of individual log in aggregated log Key: YARN-2777 URL: https://issues.apache.org/jira/browse/YARN-2777 Project: Hadoop YARN Issue Type: Improvement Reporter: Ted Yu Labels: log-aggregation Below is snippet of aggregated log showing hbase master log: {code} LogType: hbase-hbase-master-ip-172-31-34-167.log LogUploadTime: 29-Oct-2014 22:31:55 LogLength: 24103045 Log Contents: Wed Oct 29 15:43:57 UTC 2014 Starting master on ip-172-31-34-167 ... at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) at org.apache.hadoop.hbase.Chore.run(Chore.java:80) at java.lang.Thread.run(Thread.java:745) LogType: hbase-hbase-master-ip-172-31-34-167.out {code} Since logs from various daemons are aggregated in one log file, it would be desirable to mark the end of one log before starting with the next. e.g. with such a line: {code} End of LogType: hbase-hbase-master-ip-172-31-34-167.log {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2764) counters.LimitExceededException shouldn't abort AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262748#comment-14262748 ] Ted Yu commented on YARN-2764: -- Comment on this issue is appreciated. counters.LimitExceededException shouldn't abort AsyncDispatcher --- Key: YARN-2764 URL: https://issues.apache.org/jira/browse/YARN-2764 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Ted Yu I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101) at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:106) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:203) at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:348) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1754) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1737) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1718) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:1089) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2049) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2045) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-10-25 10:28:55,185 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. {code} Counter limit was exceeded when JobFinishedEvent was created. Better handling of LimitExceededException should be provided so that AsyncDispatcher can continue functioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2988) Graph#save() may leak resource
[ https://issues.apache.org/jira/browse/YARN-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2988: - Attachment: YARN-2988-002.patch How about this patch ? Graph#save() may leak resource -- Key: YARN-2988 URL: https://issues.apache.org/jira/browse/YARN-2988 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: YARN-2988-001.patch, YARN-2988-002.patch In hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java : {code} public void save(String filepath) throws IOException { OutputStreamWriter fout = new OutputStreamWriter( new FileOutputStream(filepath), Charset.forName(UTF-8)); fout.write(generateGraphViz()); fout.close(); {code} The close of fout should be enclosed in finally clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2988) Graph#save() may leak file descriptors
[ https://issues.apache.org/jira/browse/YARN-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reassigned YARN-2988: Assignee: Ted Yu (was: Tsuyoshi OZAWA) Graph#save() may leak file descriptors -- Key: YARN-2988 URL: https://issues.apache.org/jira/browse/YARN-2988 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Fix For: 2.7.0 Attachments: YARN-2988-001.patch, YARN-2988-002.patch In hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java : {code} public void save(String filepath) throws IOException { OutputStreamWriter fout = new OutputStreamWriter( new FileOutputStream(filepath), Charset.forName(UTF-8)); fout.write(generateGraphViz()); fout.close(); {code} The close of fout should be enclosed in finally clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2988) Graph#save() may leak resource
Ted Yu created YARN-2988: Summary: Graph#save() may leak resource Key: YARN-2988 URL: https://issues.apache.org/jira/browse/YARN-2988 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor In hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java : {code} public void save(String filepath) throws IOException { OutputStreamWriter fout = new OutputStreamWriter( new FileOutputStream(filepath), Charset.forName(UTF-8)); fout.write(generateGraphViz()); fout.close(); {code} The close of fout should be enclosed in finally clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2988) Graph#save() may leak resource
[ https://issues.apache.org/jira/browse/YARN-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2988: - Attachment: YARN-2988-001.patch Graph#save() may leak resource -- Key: YARN-2988 URL: https://issues.apache.org/jira/browse/YARN-2988 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: YARN-2988-001.patch In hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java : {code} public void save(String filepath) throws IOException { OutputStreamWriter fout = new OutputStreamWriter( new FileOutputStream(filepath), Charset.forName(UTF-8)); fout.write(generateGraphViz()); fout.close(); {code} The close of fout should be enclosed in finally clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2930) TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8
Ted Yu created YARN-2930: Summary: TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8 Key: YARN-2930 URL: https://issues.apache.org/jira/browse/YARN-2930 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/31/console : {code} testRMRestartRecoveringNodeLabelManager[0](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.136 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) testRMRestartRecoveringNodeLabelManager[1](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.081 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2914) Potential race condition in SharedCacheUploaderMetrics/CleanerMetrics/ClientSCMMetrics#getInstance()
[ https://issues.apache.org/jira/browse/YARN-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234734#comment-14234734 ] Ted Yu commented on YARN-2914: -- lgtm I triggered a QA run manually. Potential race condition in SharedCacheUploaderMetrics/CleanerMetrics/ClientSCMMetrics#getInstance() Key: YARN-2914 URL: https://issues.apache.org/jira/browse/YARN-2914 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Ted Yu Assignee: Varun Saxena Priority: Minor Fix For: 2.7.0 Attachments: YARN-2914.002.patch, YARN-2914.patch {code} public static ClientSCMMetrics getInstance() { ClientSCMMetrics topMetrics = Singleton.INSTANCE.impl; if (topMetrics == null) { throw new IllegalStateException( {code} getInstance() doesn't hold lock on Singleton.this This may result in IllegalStateException being thrown prematurely. [~ctrezzo] reported that SharedCacheUploaderMetrics has also same kind of race condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node
[ https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232512#comment-14232512 ] Ted Yu commented on YARN-2604: -- Should Fix Version be 2.7.0 ? Scheduler should consider max-allocation-* in conjunction with the largest node --- Key: YARN-2604 URL: https://issues.apache.org/jira/browse/YARN-2604 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Robert Kanter Attachments: YARN-2604.patch, YARN-2604.patch, YARN-2604.patch, YARN-2604.patch, YARN-2604.patch, YARN-2604.patch If the scheduler max-allocation-* values are larger than the resources available on the largest node in the cluster, an application requesting resources between the two values will be accepted by the scheduler but the requests will never be satisfied. The app essentially hangs forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2914) Potential race condition in ClientSCMMetrics#getInstance()
Ted Yu created YARN-2914: Summary: Potential race condition in ClientSCMMetrics#getInstance() Key: YARN-2914 URL: https://issues.apache.org/jira/browse/YARN-2914 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Priority: Minor {code} public static ClientSCMMetrics getInstance() { ClientSCMMetrics topMetrics = Singleton.INSTANCE.impl; if (topMetrics == null) { throw new IllegalStateException( {code} getInstance() doesn't hold lock on Singleton.this This may result in IllegalStateException being thrown prematurely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2871: - Description: From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} was: From trunk build #746: {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
Ted Yu created YARN-2871: Summary: TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From trunk build #746: {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2864) TestRMWebServicesAppsModification fails in trunk
Ted Yu created YARN-2864: Summary: TestRMWebServicesAppsModification fails in trunk Key: YARN-2864 URL: https://issues.apache.org/jira/browse/YARN-2864 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/console : {code} Tests run: 32, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 151.14 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testGetNewApplicationAndSubmit[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.276 sec ERROR! java.lang.NoClassDefFoundError: org/apache/hadoop/io/FastByteComparisons at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:187) at org.apache.hadoop.io.BinaryComparable.compareTo(BinaryComparable.java:50) at org.apache.hadoop.io.BinaryComparable.equals(BinaryComparable.java:72) at org.apache.hadoop.io.Text.equals(Text.java:348) at java.util.ArrayList.indexOf(ArrayList.java:216) at java.util.ArrayList.contains(ArrayList.java:199) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppSubmit(TestRMWebServicesAppsModification.java:844) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testGetNewApplicationAndSubmit(TestRMWebServicesAppsModification.java:726) testGetNewApplicationAndSubmit[3](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.225 sec ERROR! java.lang.NoClassDefFoundError: org/apache/hadoop/io/FastByteComparisons at org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:187) at org.apache.hadoop.io.BinaryComparable.compareTo(BinaryComparable.java:50) at org.apache.hadoop.io.BinaryComparable.equals(BinaryComparable.java:72) at org.apache.hadoop.io.Text.equals(Text.java:348) at java.util.ArrayList.indexOf(ArrayList.java:216) at java.util.ArrayList.contains(ArrayList.java:199) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppSubmit(TestRMWebServicesAppsModification.java:844) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testGetNewApplicationAndSubmit(TestRMWebServicesAppsModification.java:726) {code} Running on MacBook, I got (with Java 1.7.0_60): {code} Running org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification Tests run: 32, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 146.749 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testGetNewApplicationAndSubmit[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.185 sec FAILURE! java.lang.AssertionError: expected:Accepted but was:Internal Server Error at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppSubmit(TestRMWebServicesAppsModification.java:799) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testGetNewApplicationAndSubmit(TestRMWebServicesAppsModification.java:726) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2858) TestRMHA#testFailoverAndTransitions fails in trunk against Java 8
Ted Yu created YARN-2858: Summary: TestRMHA#testFailoverAndTransitions fails in trunk against Java 8 Key: YARN-2858 URL: https://issues.apache.org/jira/browse/YARN-2858 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/4/console : {code} Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 51.034 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMHA testFailoverAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 30.021 sec ERROR! java.lang.Exception: test timed out after 3 milliseconds at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:698) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:641) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1218) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.get(WebResource.java:503) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.checkActiveRMWebServices(TestRMHA.java:157) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.checkActiveRMFunctionality(TestRMHA.java:142) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testFailoverAndTransitions(TestRMHA.java:211) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2842) TestApplicationClientProtocolOnHA fails against Java 8
Ted Yu created YARN-2842: Summary: TestApplicationClientProtocolOnHA fails against Java 8 Key: YARN-2842 URL: https://issues.apache.org/jira/browse/YARN-2842 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/1/consoleFull : {code} testGetNewApplicationOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 8.959 sec ERROR! java.net.ConnectException: Call From asf908.gq1.ygridcore.net/67.195.81.152 to asf908.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy17.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:217) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:206) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:214) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetNewApplicationOnHA(TestApplicationClientProtocolOnHA.java:76) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2842) TestApplicationClientProtocolOnHA fails against Java 8
[ https://issues.apache.org/jira/browse/YARN-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved YARN-2842. -- Resolution: Duplicate Should have searched :-) TestApplicationClientProtocolOnHA fails against Java 8 -- Key: YARN-2842 URL: https://issues.apache.org/jira/browse/YARN-2842 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/1/consoleFull : {code} testGetNewApplicationOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 8.959 sec ERROR! java.net.ConnectException: Call From asf908.gq1.ygridcore.net/67.195.81.152 to asf908.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy17.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:217) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:206) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:214) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetNewApplicationOnHA(TestApplicationClientProtocolOnHA.java:76) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)