[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737005#comment-17737005 ] Prabhu Joseph edited comment on YARN-11501 at 6/26/23 6:19 AM: --- >> I am not able to trace ClusterNodeTracker#updateMaxResources -> >> RMNodeImpl.getState .. in trunk code . Any private change ?? Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. During initial analysis, we were trying to fix the locking at {_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks _RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> {_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then {_}RMNode{_}) easier. This deadlock issue won't happen without the private change, so I will mark this invalid. was (Author: prabhu joseph): >> I am not able to trace ClusterNodeTracker#updateMaxResources -> >> RMNodeImpl.getState .. in trunk code . Any private change ?? Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. During initial analysis, we were trying to fix the locking at {_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks _RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> {_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then {_}RMNode{_}) easier. This deadlock issue won't happen without the private change, so I will mark this invalid. > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > -- > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > = > "qtp1401737458-850": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x0007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > === > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at >
[jira] [Commented] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737005#comment-17737005 ] Prabhu Joseph commented on YARN-11501: -- >> I am not able to trace ClusterNodeTracker#updateMaxResources -> >> RMNodeImpl.getState .. in trunk code . Any private change ?? Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. During initial analysis, we were trying to fix the locking at {_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks _RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> {_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then {_}RMNode{_}) easier. This deadlock issue won't happen without the private change, so I will mark this invalid. > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > -- > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > = > "qtp1401737458-850": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x0007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > === > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at >
[jira] [Resolved] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-11501. -- Resolution: Invalid > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > -- > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > = > "qtp1401737458-850": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x0007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > === > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) > at >
[jira] [Commented] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727866#comment-17727866 ] Prabhu Joseph commented on YARN-11501: -- [~srinivasst] Hope you are doing well. If you get some bandwidth, could you take a look into it and give some ideas on how to fix this? > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > -- > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > = > "qtp1401737458-850": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x0007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > === > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at >
[jira] [Updated] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11501: - Priority: Critical (was: Major) > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > -- > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > = > "qtp1401737458-850": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x0007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > === > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) > at >
[jira] [Updated] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11501: - Description: We have seen a deadlock in ResourceManager due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock on RMNode and waiting to lock SchedulerNode whereas CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock RMNode. cc *Vishal Vyas* {code:java} Found one Java-level deadlock: = "qtp1401737458-850": waiting for ownable synchronizer 0x000717e6ff60, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "RM Event dispatcher" "RM Event dispatcher": waiting for ownable synchronizer 0x0007168a7a38, (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), which is held by "SchedulerEventDispatcher:Event Processor" "SchedulerEventDispatcher:Event Processor": waiting for ownable synchronizer 0x000717e6ff60, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "RM Event dispatcher" Java stack information for the threads listed above: === "qtp1401737458-850": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000717e6ff60> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) at
[jira] [Updated] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11501: - Description: We have seen a deadlock in ResourceManager due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock on RMNode and waiting to lock SchedulerNode whereas CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock RMNode. cc Vishal Vyas {code:java} Found one Java-level deadlock: = "qtp1401737458-850": waiting for ownable synchronizer 0x000717e6ff60, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "RM Event dispatcher" "RM Event dispatcher": waiting for ownable synchronizer 0x0007168a7a38, (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), which is held by "SchedulerEventDispatcher:Event Processor" "SchedulerEventDispatcher:Event Processor": waiting for ownable synchronizer 0x000717e6ff60, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "RM Event dispatcher" Java stack information for the threads listed above: === "qtp1401737458-850": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000717e6ff60> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) at
[jira] [Created] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
Prabhu Joseph created YARN-11501: Summary: ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers Key: YARN-11501 URL: https://issues.apache.org/jira/browse/YARN-11501 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.4.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph We have seen a deadlock in ResourceManager due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock on RMNode and waiting to lock SchedulerNode whereas CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock RMNode. {code:java} Found one Java-level deadlock: = "qtp1401737458-850": waiting for ownable synchronizer 0x000717e6ff60, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "RM Event dispatcher" "RM Event dispatcher": waiting for ownable synchronizer 0x0007168a7a38, (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), which is held by "SchedulerEventDispatcher:Event Processor" "SchedulerEventDispatcher:Event Processor": waiting for ownable synchronizer 0x000717e6ff60, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "RM Event dispatcher" Java stack information for the threads listed above: === "qtp1401737458-850": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000717e6ff60> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) at
[jira] [Updated] (YARN-11466) Graceful Decommission for Shuffle Services
[ https://issues.apache.org/jira/browse/YARN-11466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11466: - Description: Currently, YARN Graceful Decommission waits for the completion of both running containers and the running applications (https://issues.apache.org/jira/browse/YARN-9608) of those containers launched on the node under decommission. This adds unnecessary huge cost to users on cloud deployments as most of the idle nodes are under decommission waiting for the running application to complete. This feature aims to improve the Graceful Decommission logic by waiting for the actual shuffle data to be consumed by dependent tasks rather than the entire application. Below is the high-level design I have in mind. Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes shuffle data metrics (like shuffle data being present or not). NodeManager periodically collects the shuffle data metrics from the configured AuxiliaryShuffleServices and sends them along with the heartbeat to the ResourceManager. The graceful decommission logic runs inside ResourceManager waits until the shuffle data is consumed, with a maximum wait time up to the configured graceful decommission timeout. was: Currently, YARN Graceful Decommission waits for the completion of both running containers and the running applications of those containers launched on the node under decommission. This adds unnecessary cost to users on cloud deployments. This feature aims to improve the Graceful Decommission logic by waiting for the actual shuffle data to be consumed by dependent tasks rather than the entire application. Below is the high-level design I have in mind. Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes shuffle data metrics (like shuffle data being present or not). NodeManager periodically collects the shuffle data metrics from the configured AuxiliaryShuffleServices and sends them along with the heartbeat to the ResourceManager. The graceful decommission logic runs inside ResourceManager waits until the shuffle data is consumed, with a maximum wait time up to the configured graceful decommission timeout. > Graceful Decommission for Shuffle Services > -- > > Key: YARN-11466 > URL: https://issues.apache.org/jira/browse/YARN-11466 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Currently, YARN Graceful Decommission waits for the completion of both > running containers and the running applications > (https://issues.apache.org/jira/browse/YARN-9608) of those containers > launched on the node under decommission. This adds unnecessary huge cost to > users on cloud deployments as most of the idle nodes are under decommission > waiting for the running application to complete. > This feature aims to improve the Graceful Decommission logic by waiting for > the actual shuffle data to be consumed by dependent tasks rather than the > entire application. Below is the high-level design I have in mind. > Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) > through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes > shuffle data metrics (like shuffle data being present or not). NodeManager > periodically collects the shuffle data metrics from the configured > AuxiliaryShuffleServices and sends them along with the heartbeat to the > ResourceManager. The graceful decommission logic runs inside ResourceManager > waits until the shuffle data is consumed, with a maximum wait time up to the > configured graceful decommission timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11494) Acquired Containers are killed when the node is reconnected
Prabhu Joseph created YARN-11494: Summary: Acquired Containers are killed when the node is reconnected Key: YARN-11494 URL: https://issues.apache.org/jira/browse/YARN-11494 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Prabhu Joseph When a nodemanager is reconnected, resourcemanager marks the acquired containers on that node as LOST and which leads to job failure. {code} 2023-04-10 02:57:16,412 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC Server handler 41 on 8025): Reconnect from the node at: node1 2023-04-10 02:57:16,412 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC Server handler 41 on 8025): NodeManager from node node1(cmPort: 8041 httpPort: 8042) registered with capability: , assigned nodeId node1:8041, node labels { CORE } 2023-04-10 02:57:16,413 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_e15_1677844874019_238016_01_02 Container Transitioned from ACQUIRED to KILLED {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11466) Graceful Decommission for Shuffle Services
Prabhu Joseph created YARN-11466: Summary: Graceful Decommission for Shuffle Services Key: YARN-11466 URL: https://issues.apache.org/jira/browse/YARN-11466 Project: Hadoop YARN Issue Type: New Feature Reporter: Prabhu Joseph Assignee: Prabhu Joseph Currently, YARN Graceful Decommission waits for the completion of both running containers and the running applications of those containers launched on the node under decommission. This adds unnecessary cost to users on cloud deployments. This feature aims to improve the Graceful Decommission logic by waiting for the actual shuffle data to be consumed by dependent tasks rather than the entire application. Below is the high-level design I have in mind. Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes shuffle data metrics (like shuffle data being present or not). NodeManager periodically collects the shuffle data metrics from the configured AuxiliaryShuffleServices and sends them along with the heartbeat to the ResourceManager. The graceful decommission logic runs inside ResourceManager waits until the shuffle data is consumed, with a maximum wait time up to the configured graceful decommission timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11457) NodeManager Resource Leak when handling a container log with colon
Prabhu Joseph created YARN-11457: Summary: NodeManager Resource Leak when handling a container log with colon Key: YARN-11457 URL: https://issues.apache.org/jira/browse/YARN-11457 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Vineeth Naroju Attachments: Screenshot 2023-03-16 at 1.02.22 PM.png, Screenshot 2023-03-16 at 1.02.45 PM.png, Screenshot 2023-03-16 at 1.02.57 PM.png NodeManager Leaks the resources when handling a container log with colon. The Illegal file name is not handled and leads to resource leak at NodeManager side. {code:java} 2023-03-14 11:03:53,390 WARN org.apache.hadoop.util.concurrent.ExecutorHelper (ContainersLauncher #2683): Caught exception in thread ContainersLauncher #2683: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: taskmanager.log.2023-03-14 09:44-1 at org.apache.hadoop.fs.Path.initialize(Path.java:263) at org.apache.hadoop.fs.Path.(Path.java:221) at org.apache.hadoop.fs.Path.(Path.java:129) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:270) at org.apache.hadoop.fs.Globber.glob(Globber.java:149) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2096) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2078) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.handleContainerExitWithFailure(ContainerLaunch.java:653) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.handleContainerExitCode(ContainerLaunch.java:593) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:337) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:101) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.net.URISyntaxException: Relative path in absolute URI: taskmanager.log.2023-03-14 09:44-1 at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:260) ... 14 more {code} NodeManager status details shows Application stuck in FINISHING_CONTAINER_WAIT, Containers stuck in KILLING state. !Screenshot 2023-03-16 at 1.02.57 PM.png|height=100,width=250! !Screenshot 2023-03-16 at 1.02.45 PM.png|height=100,width=250! !Screenshot 2023-03-16 at 1.02.22 PM.png|height=250,width=250! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11455) All RMs in HA are stuck in standby when the ZK connection is disconnected
Prabhu Joseph created YARN-11455: Summary: All RMs in HA are stuck in standby when the ZK connection is disconnected Key: YARN-11455 URL: https://issues.apache.org/jira/browse/YARN-11455 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.3.3, 2.10.1 Reporter: Prabhu Joseph Assignee: Prabhu Joseph All RMs in HA are stuck in standby when the ZK connection held by the active RM is disconnected. {code:java} 2023-02-22 13:08:19,832 INFO org.apache.hadoop.ha.ActiveStandbyElector (main-EventThread): Session disconnected. Entering neutral mode... 2023-02-22 13:08:19,832 WARN org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService (main-EventThread): Lost contact with Zookeeper. Transitioning to standby in 1 ms if connection is not reestablished.{code} *Repro:* Send a Disconnected Event to the Active RM using below code. {code:java} zkConnectionState = ConnectionState.DISCONNECTED; enterNeutralMode(); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration
[ https://issues.apache.org/jira/browse/YARN-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680718#comment-17680718 ] Prabhu Joseph edited comment on YARN-11417 at 1/25/23 5:36 PM: --- When the NodeManager node label is changed to a new label and restarted, it resyncs to the ResourceManager with the new label. CapacityScheduler receives the NODE_LABELS_UPDATE event, which removes the node from the nodesList of the old node partition in the {{nodesPerLabel}} map <{{partition}}, {{nodesList}}> part of {{ClusterNodeTracker}}#{{updateNodesPerPartition}}. Then {{CapacityScheduler}} receives NODE_REMOVED which removes the node from the {{ClusterNodeTracker}} and also removes the node from the nodesList of the new partition in {{nodesPerLabel}}, which will fail with NPE as the new partition is not yet present in the {{nodesPerLabel}} map and will be added only after the NODE_ADDED event. In the absence of a new partition, {{ClusterNodeTracker}}#{{removeNode}} can skip removing the node from the {{nodesPerLabel}} as anyway that is already removed during NODE_LABELS_UPDATE. was (Author: prabhu joseph): When the NodeManager node label is changed to a new label and restarted, it resyncs to the ResourceManager with the new label. CapacityScheduler receives the NODE_LABELS_UPDATE event, which removes the node from the nodesList of the old node partition in the nodesPerLabel map part of ClusterNodeTracker#updateNodesPerPartition. Then CapacityScheduler receives NODE_REMOVED which removes the node from the ClusterNodeTracker and also removes the node from the nodesList of the new partition in nodesPerLabel, which will fail with NPE as the new partition is not yet present in the nodesPerLabel map and will be added only after the NODE_ADDED event. In the absence of a new partition, ClusterNodeTracker#removeNode can skip removing the node from the nodesPerLabel as anyway that is already removed during NODE_LABELS_UPDATE. > RM Crashes when changing Node Label of a Node in Distributed Configuration > -- > > Key: YARN-11417 > URL: https://issues.apache.org/jira/browse/YARN-11417 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > > RM Crashes when changing Node Label of a Node in Distributed Configuration. > {code:java} > 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > NODE_REMOVED to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83) > at java.lang.Thread.run(Thread.java:750) > {code} > *Repro* > 1. Two NodeManagers with CORE Node Label > {code:java} > yarn.nodemanager.node-labels.provider.configured-node-partition=CORE > yarn.node-labels.enabled = true > yarn.node-labels.configuration-type = distributed > yarn.nodemanager.node-labels.provider = config > {code} > 2. Remove the Node Label from one of the node to make it Default Partition > and restart nodemanager. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration
[ https://issues.apache.org/jira/browse/YARN-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680718#comment-17680718 ] Prabhu Joseph commented on YARN-11417: -- When the NodeManager node label is changed to a new label and restarted, it resyncs to the ResourceManager with the new label. CapacityScheduler receives the NODE_LABELS_UPDATE event, which removes the node from the nodesList of the old node partition in the nodesPerLabel map part of ClusterNodeTracker#updateNodesPerPartition. Then CapacityScheduler receives NODE_REMOVED which removes the node from the ClusterNodeTracker and also removes the node from the nodesList of the new partition in nodesPerLabel, which will fail with NPE as the new partition is not yet present in the nodesPerLabel map and will be added only after the NODE_ADDED event. In the absence of a new partition, ClusterNodeTracker#removeNode can skip removing the node from the nodesPerLabel as anyway that is already removed during NODE_LABELS_UPDATE. > RM Crashes when changing Node Label of a Node in Distributed Configuration > -- > > Key: YARN-11417 > URL: https://issues.apache.org/jira/browse/YARN-11417 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > > RM Crashes when changing Node Label of a Node in Distributed Configuration. > {code:java} > 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > NODE_REMOVED to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83) > at java.lang.Thread.run(Thread.java:750) > {code} > *Repro* > 1. Two NodeManagers with CORE Node Label > {code:java} > yarn.nodemanager.node-labels.provider.configured-node-partition=CORE > yarn.node-labels.enabled = true > yarn.node-labels.configuration-type = distributed > yarn.nodemanager.node-labels.provider = config > {code} > 2. Remove the Node Label from one of the node to make it Default Partition > and restart nodemanager. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11421) Graceful Decommission ignores launched containers and gets deactivated before timeout
[ https://issues.apache.org/jira/browse/YARN-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678667#comment-17678667 ] Prabhu Joseph commented on YARN-11421: -- [~abhishekd0907] Looks same as [YARN-10873|https://issues.apache.org/jira/browse/YARN-10873] > Graceful Decommission ignores launched containers and gets deactivated before > timeout > - > > Key: YARN-11421 > URL: https://issues.apache.org/jira/browse/YARN-11421 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1, 3.3.1, 3.3.4 >Reporter: Abhishek Dixit >Priority: Major > > During Graceful Decommission, a Node gets deactivated before timeout even > though there are launched containers on that node. > We have observed cases when graceful decommission signal is sent to node and > Containers are launched at NodeManager and at the same time, in such cases > ResourceManager moves the node from Decommissioning to Decommissioned state > because launced containers are not checked in DeactivateNodeTransition. > We will suggest using a MultiArc transition instead of > DeactivateNodeTransition which checks for AM containers from the scheduler > and then decides whether to keep the node in Decommissioning state or move it > to Decommissioned State. > > {code:java} > .addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONED, > RMNodeEventType.DECOMMISSION, new > DeactivateNodeTransition(NodeState.DECOMMISSIONED)){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled
[ https://issues.apache.org/jira/browse/YARN-11414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676495#comment-17676495 ] Prabhu Joseph commented on YARN-11414: -- [~maniraj...@gmail.com] We see ClusterMetricsInfo shows available and allocated only for Default Partition. We have QueueMetrics which already shows for every partition level. This Jira intends to change the ClusterMetrics to show Cluster Wide which will help all the Schedulers. Do you have any comments. > ClusterMetricsInfo shows wrong availableMB when node labels enabled > > > Key: YARN-11414 > URL: https://issues.apache.org/jira/browse/YARN-11414 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > > ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows > availableMB of Default Partition alone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration
[ https://issues.apache.org/jira/browse/YARN-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11417: - Description: RM Crashes when changing Node Label of a Node in Distributed Configuration. {code:java} 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher (SchedulerEventDispatcher:Event Processor): Error in handling event type NODE_REMOVED to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83) at java.lang.Thread.run(Thread.java:750) {code} *Repro* 1. Two NodeManagers with CORE Node Label {code:java} yarn.nodemanager.node-labels.provider.configured-node-partition=CORE yarn.node-labels.enabled = true yarn.node-labels.configuration-type = distributed yarn.nodemanager.node-labels.provider = config {code} 2. Remove the Node Label from one of the node to make it Default Partition and restart nodemanager. was: RM Crashes when changing Node Label of a Node in Distributed Configuration. {code} 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher (SchedulerEventDispatcher:Event Processor): Error in handling event type NODE_REMOVED to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83) at java.lang.Thread.run(Thread.java:750) {code} *Repro* 1. Two NodeManagers with CORE Node Label {code} yarn.nodemanager.node-labels.provider.configured-node-partition=CORE yarn.node-labels.enabled = true yarn.node-labels.configuration-type = distributed yarn.nodemanager.node-labels.provider = config {code} 2. Change the Node Label of one of the node into TASK and restart nodemanager. > RM Crashes when changing Node Label of a Node in Distributed Configuration > -- > > Key: YARN-11417 > URL: https://issues.apache.org/jira/browse/YARN-11417 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > > RM Crashes when changing Node Label of a Node in Distributed Configuration. > {code:java} > 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > NODE_REMOVED to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83) > at java.lang.Thread.run(Thread.java:750) > {code} > *Repro* > 1. Two NodeManagers with CORE Node Label > {code:java} > yarn.nodemanager.node-labels.provider.configured-node-partition=CORE > yarn.node-labels.enabled = true > yarn.node-labels.configuration-type = distributed > yarn.nodemanager.node-labels.provider = config > {code} > 2. Remove the Node Label from one of the node to make it Default Partition > and restart nodemanager. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Created] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration
Prabhu Joseph created YARN-11417: Summary: RM Crashes when changing Node Label of a Node in Distributed Configuration Key: YARN-11417 URL: https://issues.apache.org/jira/browse/YARN-11417 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Prabhu Joseph RM Crashes when changing Node Label of a Node in Distributed Configuration. {code} 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher (SchedulerEventDispatcher:Event Processor): Error in handling event type NODE_REMOVED to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83) at java.lang.Thread.run(Thread.java:750) {code} *Repro* 1. Two NodeManagers with CORE Node Label {code} yarn.nodemanager.node-labels.provider.configured-node-partition=CORE yarn.node-labels.enabled = true yarn.node-labels.configuration-type = distributed yarn.nodemanager.node-labels.provider = config {code} 2. Change the Node Label of one of the node into TASK and restart nodemanager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled
[ https://issues.apache.org/jira/browse/YARN-11414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11414: - Description: ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows availableMB of Default Partition alone. (was: ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows availableMB of Default Partition alone. This is a regression from Hadoop-3.2.1 where it has shown cluster wide availableMB.) > ClusterMetricsInfo shows wrong availableMB when node labels enabled > > > Key: YARN-11414 > URL: https://issues.apache.org/jira/browse/YARN-11414 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Ashutosh Gupta >Priority: Major > > ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows > availableMB of Default Partition alone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled
[ https://issues.apache.org/jira/browse/YARN-11414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-11414: Assignee: Ashutosh Gupta (was: Prabhu Joseph) > ClusterMetricsInfo shows wrong availableMB when node labels enabled > > > Key: YARN-11414 > URL: https://issues.apache.org/jira/browse/YARN-11414 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Ashutosh Gupta >Priority: Major > > ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows > availableMB of Default Partition alone. This is a regression from > Hadoop-3.2.1 where it has shown cluster wide availableMB. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled
Prabhu Joseph created YARN-11414: Summary: ClusterMetricsInfo shows wrong availableMB when node labels enabled Key: YARN-11414 URL: https://issues.apache.org/jira/browse/YARN-11414 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Prabhu Joseph ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows availableMB of Default Partition alone. This is a regression from Hadoop-3.2.1 where it has shown cluster wide availableMB. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure
[ https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653796#comment-17653796 ] Prabhu Joseph commented on YARN-11403: -- [~bteke] Currently, the Maximum Allocation value is the maximum of the Healthy NodeManager Capabilities ({{yarn.nodemanager.resource.memory-mb}}). If there is no healthy node manager running, it fallbacks to the configured maximum allocation ({{yarn.scheduler.maximum-allocation-mb}}). This part is correct and not going to be changed. When a node is under decommission, the capability of that node is updated dynamically to the amount of resource in use. This updated value is also considered for the maximum allocation calculation, which leads to inconsistent maximum allocation values and causes job failure. For example, consider a cluster with two worker nodes, node1 (100 GB) and node2 (100 GB) and configured maxAllocation is 20 GB. If both nodes become UNHEALTHY for any reason, MaximumAllocation reverts to the configured value of 20 GB.This part is correct. However, suppose one node is UNHEALTHY and another is under decommission with a usage of 1 GB; the maximum allocation is now 1 GB. This is wrong, and this leads to job failures. The expected value is 20 GB in this scenario. The fix planned in this Jira is to exclude the capability of the node put into decommission during the maximum allocation calculation. > Decommission Node reduces the maximumAllocation and leads to Job Failure > > > Key: YARN-11403 > URL: https://issues.apache.org/jira/browse/YARN-11403 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.4 >Reporter: Prabhu Joseph >Assignee: Vinay Devadiga >Priority: Major > > When a node is put into Decommission, ClusterNodeTracker updates the > maximumAllocation to the totalResources in use from that node. This could > lead to Job Failure (with below error message) when the Job requests for a > container of size greater than the new maximumAllocation. > {code:java} > 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in > a row. > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[vcores], Requested > resource= vCores:2147483647>, maximum allowed allocation=, please > note that maximum allowed allocation is calculated by scheduler based on > maximum resource of registered NodeManagers, which might be less than > configured maximum allocation= > {code} > *Repro:* > 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager > Resource Memory 10GB and configured maxAllocation is 10GB. > 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say > ApplicationMaster (2GB) is launched on node1. > 3. Put both nodes into Decommission. This makes maxAllocation to come down to > 2GB. > 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas > maxAllocation is only 2GB. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure
[ https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-11403: Assignee: Vinay Devadiga (was: Prabhu Joseph) > Decommission Node reduces the maximumAllocation and leads to Job Failure > > > Key: YARN-11403 > URL: https://issues.apache.org/jira/browse/YARN-11403 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.4 >Reporter: Prabhu Joseph >Assignee: Vinay Devadiga >Priority: Major > > When a node is put into Decommission, ClusterNodeTracker updates the > maximumAllocation to the totalResources in use from that node. This could > lead to Job Failure (with below error message) when the Job requests for a > container of size greater than the new maximumAllocation. > {code:java} > 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in > a row. > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[vcores], Requested > resource= vCores:2147483647>, maximum allowed allocation=, please > note that maximum allowed allocation is calculated by scheduler based on > maximum resource of registered NodeManagers, which might be less than > configured maximum allocation= > {code} > *Repro:* > 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager > Resource Memory 10GB and configured maxAllocation is 10GB. > 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say > ApplicationMaster (2GB) is launched on node1. > 3. Put both nodes into Decommission. This makes maxAllocation to come down to > 2GB. > 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas > maxAllocation is only 2GB. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure
[ https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11403: - Description: When a node is put into Decommission, ClusterNodeTracker updates the maximumAllocation to the totalResources in use from that node. This could lead to Job Failure (with below error message) when the Job requests for a container of size greater than the new maximumAllocation. {code:java} 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a row. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[vcores], Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= {code} *Repro:* 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager Resource Memory 10GB and configured maxAllocation is 10GB. 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say ApplicationMaster (2GB) is launched on node1. 3. Put both nodes into Decommission. This makes maxAllocation to come down to 2GB. 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas maxAllocation is only 2GB. was: When a node is put into Decommission, ClusterNodeTracker updates the maximumAllocation to the totalResources in use from that node. This could lead to Job Failure (with below error message) when the Job requests for a container of size greater than the new maximumAllocation. {code} 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a row. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[vcores], Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= {code} *Repro:* 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager Resource Memory 10GB and configured maxAllocation is 10GB. 2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark before it requests for Executors) 3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes maxAllocation to come down to 2GB. 4. Now notify the Spark Job. It requests for 4GB executor Size but the new maxAllocation is 2GB and so will fail. > Decommission Node reduces the maximumAllocation and leads to Job Failure > > > Key: YARN-11403 > URL: https://issues.apache.org/jira/browse/YARN-11403 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.4 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > When a node is put into Decommission, ClusterNodeTracker updates the > maximumAllocation to the totalResources in use from that node. This could > lead to Job Failure (with below error message) when the Job requests for a > container of size greater than the new maximumAllocation. > {code:java} > 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in > a row. > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[vcores], Requested > resource= vCores:2147483647>, maximum allowed allocation=, please > note that maximum allowed allocation is calculated by scheduler based on > maximum resource of registered NodeManagers, which might be less than > configured maximum allocation= > {code} > *Repro:* > 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager > Resource Memory 10GB and configured maxAllocation is 10GB. > 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say > ApplicationMaster (2GB) is launched on node1. > 3. Put both nodes into Decommission. This makes maxAllocation to come down to > 2GB. > 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas > maxAllocation is only 2GB. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Updated] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure
[ https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11403: - Description: When a node is put into Decommission, ClusterNodeTracker updates the maximumAllocation to the totalResources in use from that node. This could lead to Job Failure (with below error message) when the Job requests for a container of size greater than the new maximumAllocation. {code} 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a row. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[vcores], Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= {code} *Repro:* 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager Resource Memory 10GB and configured maxAllocation is 10GB. 2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark before it requests for Executors) 3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes maxAllocation to come down to 2GB. 4. Now notify the Spark Job. It requests for 4GB executor Size but the new maxAllocation is 2GB and so will fail. was: When a node is put into Decommission, ClusterNodeTracker updates the maximumAllocation to the totalResources in use from that node. This could lead to Job Failure (with below error message) when the Job requests for a container of size greater than the new maximumAllocation. {code} 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a row. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[vcores], Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= {code} **Repro:** 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager Resource Memory 10GB and configured maxAllocation is 10GB. 2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark before it requests for Executors) 3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes maxAllocation to come down to 2GB. 4. Now notify the Spark Job. It requests for 4GB executor Size but the new maxAllocation is 2GB and so will fail. > Decommission Node reduces the maximumAllocation and leads to Job Failure > > > Key: YARN-11403 > URL: https://issues.apache.org/jira/browse/YARN-11403 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.4 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > When a node is put into Decommission, ClusterNodeTracker updates the > maximumAllocation to the totalResources in use from that node. This could > lead to Job Failure (with below error message) when the Job requests for a > container of size greater than the new maximumAllocation. > {code} > 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in > a row. > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[vcores], Requested > resource= vCores:2147483647>, maximum allowed allocation=, please > note that maximum allowed allocation is calculated by scheduler based on > maximum resource of registered NodeManagers, which might be less than > configured maximum allocation= > {code} > *Repro:* > 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager > Resource Memory 10GB and configured maxAllocation is 10GB. > 2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say > ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark > before it requests for Executors) > 3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes > maxAllocation to come down to 2GB. > 4. Now notify the Spark Job. It requests for 4GB executor Size but the new > maxAllocation is 2GB and so will fail. -- This message was sent by Atlassian Jira
[jira] [Created] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure
Prabhu Joseph created YARN-11403: Summary: Decommission Node reduces the maximumAllocation and leads to Job Failure Key: YARN-11403 URL: https://issues.apache.org/jira/browse/YARN-11403 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.4 Reporter: Prabhu Joseph Assignee: Prabhu Joseph When a node is put into Decommission, ClusterNodeTracker updates the maximumAllocation to the totalResources in use from that node. This could lead to Job Failure (with below error message) when the Job requests for a container of size greater than the new maximumAllocation. {code} 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a row. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[vcores], Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= {code} **Repro:** 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager Resource Memory 10GB and configured maxAllocation is 10GB. 2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark before it requests for Executors) 3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes maxAllocation to come down to 2GB. 4. Now notify the Spark Job. It requests for 4GB executor Size but the new maxAllocation is 2GB and so will fail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11401) Separate AppMaster cleanup events and launcher event into different resource pools
[ https://issues.apache.org/jira/browse/YARN-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649306#comment-17649306 ] Prabhu Joseph commented on YARN-11401: -- [~Daniel Ma] This is a duplicate of [YARN-11251|https://issues.apache.org/jira/browse/YARN-11251]. > Separate AppMaster cleanup events and launcher event into different resource > pools > -- > > Key: YARN-11401 > URL: https://issues.apache.org/jira/browse/YARN-11401 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Ma >Priority: Major > Labels: pull-request-available > > Currently, there is only one thread pool to handle AM launch and cleanup > event by ResourceManager, > In some cases, too many cleanup event will lead to AM launch stuck for a long > time. > So in this patch, We divide the shared thread pool into two separated ones to > handle different event in case that cleanup event flood in blocking launcher > events being timely handled and vise versa. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared
[ https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-11285. -- Resolution: Duplicate > LocalizedResources are leaked and its LocalPath are not cleared > --- > > Key: YARN-11285 > URL: https://issues.apache.org/jira/browse/YARN-11285 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > LocalizedResources are leaked and its LocalPath are not cleared from NM Local > Directories. > Each container has separate LocalizedResource object and separate local path > like below. > {code} >/mnt/yarn/usercache/hive/filecache/6/2552419: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552420: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552421: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552422: >total 28456 > {code} > NM logs will be filled with below > {code} > 2022-08-07 09:00:00,275 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource > (IPC Server handler 4 on 8040): Resource > hdfs://hdfscluster/user/svc_di_data_eng/.hiveJars/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar(->/mnt/yarn/usercache/data_eng_user/filecache/2498262/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar) > transitioned from LOCALIZED to null > 2022-08-07 09:00:00,340 INFO > org.apache.hadoop.yarn.util.ProcfsBasedProcessTree (Container Monitor): > SmapBasedCumulativeRssmem (bytes) : 0 > 2022-08-07 09:00:00,386 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource > (IPC Server handler 9 on 8040): Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > LOCALIZED at LOCALIZED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:198) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:186) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:58) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.processHeartbeat(ResourceLocalizationService.java:1048) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:722) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:356) > at > org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:48) > at > org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:63) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To
[jira] [Assigned] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3
[ https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-11355: Assignee: Vineeth Naroju (was: Prabhu Joseph) > YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 > -- > > Key: YARN-11355 > URL: https://issues.apache.org/jira/browse/YARN-11355 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Vineeth Naroju >Priority: Major > > YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during > initial retry. > *Repro:* > {code:java} > 1. YARN Cluster with three master nodes rm1,rm2 and rm3 > 2. rm3 is active > 3. yarn node -list or any other yarn client calls takes more than 30 seconds. > {code} > The initial failover to rm2 is immediate but then the failover to rm3 is > after ~3 ms. Current RetryPolicy does not honor the number of master > nodes. It has to perform atleast one immediate failover to every rm. > {code:java} > 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: > Failing over to rm2 > 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Call From local to remote:8032 failed on > connection exception: java.net.ConnectException: Connection refused; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 > failover attempts. Trying to failover after sleeping for 21139ms. > {code} > > *Workaround:* > Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. > This will do immediate failover to rm3 but there will be too many retries > when there is no active resourcemanager. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3
[ https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620888#comment-17620888 ] Prabhu Joseph commented on YARN-11355: -- Yes Done. > YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 > -- > > Key: YARN-11355 > URL: https://issues.apache.org/jira/browse/YARN-11355 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Vineeth Naroju >Priority: Major > > YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during > initial retry. > *Repro:* > {code:java} > 1. YARN Cluster with three master nodes rm1,rm2 and rm3 > 2. rm3 is active > 3. yarn node -list or any other yarn client calls takes more than 30 seconds. > {code} > The initial failover to rm2 is immediate but then the failover to rm3 is > after ~3 ms. Current RetryPolicy does not honor the number of master > nodes. It has to perform atleast one immediate failover to every rm. > {code:java} > 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: > Failing over to rm2 > 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Call From local to remote:8032 failed on > connection exception: java.net.ConnectException: Connection refused; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 > failover attempts. Trying to failover after sleeping for 21139ms. > {code} > > *Workaround:* > Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. > This will do immediate failover to rm3 but there will be too many retries > when there is no active resourcemanager. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3
[ https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11355: - Description: YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during initial retry. *Repro:* {code:java} 1. YARN Cluster with three master nodes rm1,rm2 and rm3 2. rm3 is active 3. yarn node -list or any other yarn client calls takes more than 30 seconds. {code} The initial failover to rm2 is immediate but then the failover to rm3 is after ~3 ms. Current RetryPolicy does not honor the number of master nodes. It has to perform atleast one immediate failover to every rm. {code:java} 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From local to remote:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover attempts. Trying to failover after sleeping for 21139ms. {code} *Workaround:* Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. This will do immediate failover to rm3 but there will be too many retries when there is no active resourcemanager. was: YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during initial retry. *Repro:* {code:java} 1. YARN Cluster with three master nodes rm1,rm2 and rm3 2. rm3 is active 3. yarn node -list or any other yarn client calls takes more than 30 seconds. {code} The initial failover to rm2 is immediate but then the failover to rm3 is after ~3 ms. Current RetryPolicy does not honor the number of master nodes. It has to perform atleast one immediate failover to every rm. {code:java} 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From local to remote:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover attempts. Trying to failover after sleeping for 21139ms. {code} *{*}Workaround:{*}* Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. This will do immediate failover to rm3 but there will be too many retries when there is no active resourcemanager. > YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 > -- > > Key: YARN-11355 > URL: https://issues.apache.org/jira/browse/YARN-11355 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during > initial retry. > *Repro:* > {code:java} > 1. YARN Cluster with three master nodes rm1,rm2 and rm3 > 2. rm3 is active > 3. yarn node -list or any other yarn client calls takes more than 30 seconds. > {code} > The initial failover to rm2 is immediate but then the failover to rm3 is > after ~3 ms. Current RetryPolicy does not honor the number of master > nodes. It has to perform atleast one immediate failover to every rm. > {code:java} > 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: > Failing over to rm2 > 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Call From local to remote:8032 failed on > connection exception: java.net.ConnectException: Connection refused; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 > failover attempts. Trying to failover after sleeping for 21139ms. > {code} > > *Workaround:* > Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. > This will do immediate failover to rm3 but there will be too many retries > when there is no active resourcemanager. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3
Prabhu Joseph created YARN-11355: Summary: YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 Key: YARN-11355 URL: https://issues.apache.org/jira/browse/YARN-11355 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 3.4.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during initial retry. *Repro:* {code:java} 1. YARN Cluster with three master nodes rm1,rm2 and rm3 2. rm3 is active 3. yarn node -list or any other yarn client calls takes more than 30 seconds. {code} The initial failover to rm2 is immediate but then the failover to rm3 is after ~3 ms. Current RetryPolicy does not honor the number of master nodes. It has to perform atleast one immediate failover to every rm. {code:java} 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From local to remote:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover attempts. Trying to failover after sleeping for 21139ms. {code} *{*}Workaround:{*}* Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. This will do immediate failover to rm3 but there will be too many retries when there is no active resourcemanager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11352) Support new API to get the total resource available in Yarn
[ https://issues.apache.org/jira/browse/YARN-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620841#comment-17620841 ] Prabhu Joseph commented on YARN-11352: -- Thanks [~SanjayKumarSahu] for reporting the issue. Currently Tez Splits is based on AMRMClient#getAvailableResources which is the HeadRoom based on How Much the Queue/Job User/Partition Limit is set. Changing Split calculation based on Total YARN Cluster Resource will lead to high Task Parallelism and Tez Job waiting for other queue/user/Partition resources which it won't get. {code:java} /** * Get the currently available resources in the cluster. * A valid value is available after a call to allocate has been made * @return Currently available resources */ public abstract Resource getAvailableResources(); {code} > Support new API to get the total resource available in Yarn > --- > > Key: YARN-11352 > URL: https://issues.apache.org/jira/browse/YARN-11352 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacity scheduler, resourcemanager, yarn >Affects Versions: 3.4.0 >Reporter: Sanjay Kumar Sahu >Priority: Major > > Hive needs total resource available in yarn by AMRMClient interface. This > help hive to decide the split count (Fix the split calculation logic for Hive > on Tez/LLAP in clusters). > > The improvement is identified as a problem in split calculation. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11255) Support loading alternative docker client config from system environment
[ https://issues.apache.org/jira/browse/YARN-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11255: - Labels: (was: pull-request-available) > Support loading alternative docker client config from system environment > > > Key: YARN-11255 > URL: https://issues.apache.org/jira/browse/YARN-11255 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Major > Fix For: 3.4.0 > > > When using YARN docker support, although the hadoop shell supported > {code:java} > -docker_client_config{code} > to pass the client config file that contains security token to generate the > docker config for each job as a temporary file. > For other applications that submit jobs to YARN, e.g. Spark, which loads the > docker setting via system environment e.g. > {code:java} > spark.executorEnv.* {code} > will not be able to add those authorization token because this system > environment isn't considered in YARN. > Add genetic solution to handle these kind of cases without making changes in > spark code or others > Eg > When using remote container registry, the > {{YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG}} must reference the config.json > file containing the credentials used to authenticate. > {code:java} > DOCKER_IMAGE_NAME=hadoop-docker > DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json > spark-submit --master yarn \ > --deploy-mode cluster \ > --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ > --conf > spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \ > --conf > spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG > \ > --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ > --conf > spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME > \ > --conf > spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG > \ > sparkR.R{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11255) Support loading alternative docker client config from system environment
[ https://issues.apache.org/jira/browse/YARN-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-11255. -- Fix Version/s: 3.4.0 Resolution: Fixed > Support loading alternative docker client config from system environment > > > Key: YARN-11255 > URL: https://issues.apache.org/jira/browse/YARN-11255 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When using YARN docker support, although the hadoop shell supported > {code:java} > -docker_client_config{code} > to pass the client config file that contains security token to generate the > docker config for each job as a temporary file. > For other applications that submit jobs to YARN, e.g. Spark, which loads the > docker setting via system environment e.g. > {code:java} > spark.executorEnv.* {code} > will not be able to add those authorization token because this system > environment isn't considered in YARN. > Add genetic solution to handle these kind of cases without making changes in > spark code or others > Eg > When using remote container registry, the > {{YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG}} must reference the config.json > file containing the credentials used to authenticate. > {code:java} > DOCKER_IMAGE_NAME=hadoop-docker > DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json > spark-submit --master yarn \ > --deploy-mode cluster \ > --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ > --conf > spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \ > --conf > spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG > \ > --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ > --conf > spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME > \ > --conf > spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG > \ > sparkR.R{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11299) NMWebService endpoint to expose tracked LocalizedResources and the references
Prabhu Joseph created YARN-11299: Summary: NMWebService endpoint to expose tracked LocalizedResources and the references Key: YARN-11299 URL: https://issues.apache.org/jira/browse/YARN-11299 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.3.0 Reporter: Prabhu Joseph Assignee: Samrat Deb NMWebService endpoint to expose the tracked LocalizedResources and its references. This will be useful for monitoring and debugging purpose. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared
[ https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11285: - Description: LocalizedResources are leaked and its LocalPath are not cleared from NM Local Directories. Each container has separate LocalizedResource object and separate local path like below. {code} /mnt/yarn/usercache/hive/filecache/6/2552419: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552420: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552421: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552422: total 28456 {code} NM logs will be filled with below {code} 2022-08-07 09:00:00,275 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource (IPC Server handler 4 on 8040): Resource hdfs://hdfscluster/user/svc_di_data_eng/.hiveJars/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar(->/mnt/yarn/usercache/data_eng_user/filecache/2498262/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar) transitioned from LOCALIZED to null 2022-08-07 09:00:00,340 INFO org.apache.hadoop.yarn.util.ProcfsBasedProcessTree (Container Monitor): SmapBasedCumulativeRssmem (bytes) : 0 2022-08-07 09:00:00,386 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource (IPC Server handler 9 on 8040): Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LOCALIZED at LOCALIZED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:198) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:186) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:58) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.processHeartbeat(ResourceLocalizationService.java:1048) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:722) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:356) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:48) at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:63) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489) {code} was: LocalizedResources are leaked and its LocalPath are not cleared from NM Local Directories. Each container has separate LocalizedResource object and separate local path like below. {code} /mnt/yarn/usercache/hive/filecache/6/2552419: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552420: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552421: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24
[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared
[ https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11285: - Description: LocalizedResources are leaked and its LocalPath are not cleared from NM Local Directories. Each container has separate LocalizedResource object and separate local path like below. {code} /mnt/yarn/usercache/hive/filecache/6/2552419: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552420: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552421: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552422: total 28456 {code} was: LocalizedResources are leaked and its LocalPath are not cleared from NM Local Directories. When multiple containers are initialized at same time, LocalResourcesTrackerImpl REQUEST handler could create and handle multiple LocalizedResource object for the same input path due to race condition in [below code|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L149] {code} case REQUEST: LocalResourceRequest req = event.getLocalResourceRequest(); LocalizedResource rsrc = localrsrc.get(req); if (null == rsrc) { rsrc = new LocalizedResource(req, dispatcher); localrsrc.put(req, rsrc); } rsrc.handle(event); {code} Each container will have separate LocalizedResource object and separate local path like below. {code} /mnt/yarn/usercache/hive/filecache/6/2552419: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552420: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552421: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552422: total 28456 {code} > LocalizedResources are leaked and its LocalPath are not cleared > --- > > Key: YARN-11285 > URL: https://issues.apache.org/jira/browse/YARN-11285 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > LocalizedResources are leaked and its LocalPath are not cleared from NM Local > Directories. > Each container has separate LocalizedResource object and separate local path > like below. > {code} >/mnt/yarn/usercache/hive/filecache/6/2552419: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552420: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552421: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552422: >total 28456 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared
[ https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11285: - Attachment: (was: TestConcurrency.java) > LocalizedResources are leaked and its LocalPath are not cleared > --- > > Key: YARN-11285 > URL: https://issues.apache.org/jira/browse/YARN-11285 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > LocalizedResources are leaked and its LocalPath are not cleared from NM Local > Directories. > Each container has separate LocalizedResource object and separate local path > like below. > {code} >/mnt/yarn/usercache/hive/filecache/6/2552419: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552420: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552421: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552422: >total 28456 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11196) NUMA Awareness support in DefaultContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-11196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-11196. -- Fix Version/s: 3.4.0 Resolution: Fixed > NUMA Awareness support in DefaultContainerExecutor > -- > > Key: YARN-11196 > URL: https://issues.apache.org/jira/browse/YARN-11196 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Samrat Deb >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > [YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added support > of NUMA Awareness for Containers launched through LinuxContainerExecutor. > This feature is useful to have in DefaultContainerExecutor as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared
[ https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11285: - Attachment: TestConcurrency.java > LocalizedResources are leaked and its LocalPath are not cleared > --- > > Key: YARN-11285 > URL: https://issues.apache.org/jira/browse/YARN-11285 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: TestConcurrency.java > > > LocalizedResources are leaked and its LocalPath are not cleared from NM Local > Directories. When multiple containers are initialized at same time, > LocalResourcesTrackerImpl REQUEST handler could create and handle multiple > LocalizedResource object for the same input path due to race condition in > [below > code|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L149] > {code} > case REQUEST: > LocalResourceRequest req = event.getLocalResourceRequest(); > LocalizedResource rsrc = localrsrc.get(req); > > if (null == rsrc) { > rsrc = new LocalizedResource(req, dispatcher); > localrsrc.put(req, rsrc); > } > rsrc.handle(event); > {code} > Each container will have separate LocalizedResource object and separate local > path like below. > {code} >/mnt/yarn/usercache/hive/filecache/6/2552419: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552420: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552421: >total 28456 >-r-x-- 1 yarn yarn 29135164 Aug 7 10:24 > hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar >/mnt/yarn/usercache/hive/filecache/6/2552422: >total 28456 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared
Prabhu Joseph created YARN-11285: Summary: LocalizedResources are leaked and its LocalPath are not cleared Key: YARN-11285 URL: https://issues.apache.org/jira/browse/YARN-11285 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.2.1 Reporter: Prabhu Joseph Assignee: Prabhu Joseph LocalizedResources are leaked and its LocalPath are not cleared from NM Local Directories. When multiple containers are initialized at same time, LocalResourcesTrackerImpl REQUEST handler could create and handle multiple LocalizedResource object for the same input path due to race condition in [below code|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L149] {code} case REQUEST: LocalResourceRequest req = event.getLocalResourceRequest(); LocalizedResource rsrc = localrsrc.get(req); if (null == rsrc) { rsrc = new LocalizedResource(req, dispatcher); localrsrc.put(req, rsrc); } rsrc.handle(event); {code} Each container will have separate LocalizedResource object and separate local path like below. {code} /mnt/yarn/usercache/hive/filecache/6/2552419: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552420: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552421: total 28456 -r-x-- 1 yarn yarn 29135164 Aug 7 10:24 hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar /mnt/yarn/usercache/hive/filecache/6/2552422: total 28456 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11251) Separate ThreadPool for AMLauncher Launch and Clean Events
Prabhu Joseph created YARN-11251: Summary: Separate ThreadPool for AMLauncher Launch and Clean Events Key: YARN-11251 URL: https://issues.apache.org/jira/browse/YARN-11251 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.0 Reporter: Prabhu Joseph Assignee: Samrat Deb Have seen too many AM Launch Failures due to Token Expired or Container Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM Host (Spot Instances) which are down. Having Separate ThreadPools for both Cleanup and Launch will reduce the AM Launch failures. *Token Expired* {code} 2022-07-19 14:56:33,486 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl (IPC Server handler 39 on 8041): Unauthorized request to start container. This token is expired. current time is 1658242593486 found 1658242289457 Note: System times on machines may be out of sync. Check system time and time zones. {code} *Container Liveliness Expiry* {code} 2022-07-19 16:06:48,663 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_1656573205571_2357731_01_01 Container Transitioned from ACQUIRED to EXPIRED 2022-07-19 16:10:08,663 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker): Expired: Timed out after 600 secs {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11200) Backport YARN-5764 NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-11200. -- Fix Version/s: 2.10.3 Resolution: Fixed > Backport YARN-5764 NUMA awareness support for launching containers > > > Key: YARN-11200 > URL: https://issues.apache.org/jira/browse/YARN-11200 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Reporter: Prabhu Joseph >Assignee: Samrat Deb >Priority: Major > Labels: pull-request-available > Fix For: 2.10.3 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Few users who are on 2.10 are looking for NUMA Support in YARN. Backporting > YARN-5764 to 2.10.3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception
[ https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571353#comment-17571353 ] Prabhu Joseph commented on YARN-11210: -- Thanks [~aajisaka]. I was not aware of that. > Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration > exception > -- > > Key: YARN-11210 > URL: https://issues.apache.org/jira/browse/YARN-11210 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > h2. Description of Problem > Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) > synchronously can be blocked for up to 15 minutes with the default > configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an > issue in of itself, but there is a non-retryable IllegalArgumentException > exception thrown within the YARN ResourceManager client that is getting > swallowed & treated as a retryable "connection exception" meaning that it > gets retried for 15 minutes. > The purpose of this JIRA (and PR) is to modify the YARN client so that it > does not retry on this non-retryable exception. > h2. Background Information > YARN ResourceManager client treats connection exceptions as retryable & with > the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt > to connect to the ResourceManager for up to 15 minutes when facing > "connection exceptions". This arguably makes sense because connection > exceptions are in some cases transient & can be recovered from without any > action needed from the client. See example below where YARN ResourceManager > client was able to recover from connection issues that resulted from the > ResourceManager process being down. > {quote}> yarn rmadmin -refreshNodes > 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 1 failover attempts. Trying to failover after sleeping for 41061ms. > 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort],
[jira] [Resolved] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception
[ https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-11210. -- Fix Version/s: 3.4.0 Resolution: Fixed > Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration > exception > -- > > Key: YARN-11210 > URL: https://issues.apache.org/jira/browse/YARN-11210 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > h2. Description of Problem > Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) > synchronously can be blocked for up to 15 minutes with the default > configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an > issue in of itself, but there is a non-retryable IllegalArgumentException > exception thrown within the YARN ResourceManager client that is getting > swallowed & treated as a retryable "connection exception" meaning that it > gets retried for 15 minutes. > The purpose of this JIRA (and PR) is to modify the YARN client so that it > does not retry on this non-retryable exception. > h2. Background Information > YARN ResourceManager client treats connection exceptions as retryable & with > the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt > to connect to the ResourceManager for up to 15 minutes when facing > "connection exceptions". This arguably makes sense because connection > exceptions are in some cases transient & can be recovered from without any > action needed from the client. See example below where YARN ResourceManager > client was able to recover from connection issues that resulted from the > ResourceManager process being down. > {quote}> yarn rmadmin -refreshNodes > 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 1 failover attempts. Trying to failover after sleeping for 41061ms. > 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort],
[jira] [Commented] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception
[ https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571182#comment-17571182 ] Prabhu Joseph commented on YARN-11210: -- [~aajisaka] Could you make [~KevinWikant] as contributor to YARN. > Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration > exception > -- > > Key: YARN-11210 > URL: https://issues.apache.org/jira/browse/YARN-11210 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > h2. Description of Problem > Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) > synchronously can be blocked for up to 15 minutes with the default > configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an > issue in of itself, but there is a non-retryable IllegalArgumentException > exception thrown within the YARN ResourceManager client that is getting > swallowed & treated as a retryable "connection exception" meaning that it > gets retried for 15 minutes. > The purpose of this JIRA (and PR) is to modify the YARN client so that it > does not retry on this non-retryable exception. > h2. Background Information > YARN ResourceManager client treats connection exceptions as retryable & with > the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt > to connect to the ResourceManager for up to 15 minutes when facing > "connection exceptions". This arguably makes sense because connection > exceptions are in some cases transient & can be recovered from without any > action needed from the client. See example below where YARN ResourceManager > client was able to recover from connection issues that resulted from the > ResourceManager process being down. > {quote}> yarn rmadmin -refreshNodes > 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 1 failover attempts. Trying to failover after sleeping for 41061ms. > 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while >
[jira] [Resolved] (YARN-11198) Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) from State Store
[ https://issues.apache.org/jira/browse/YARN-11198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-11198. -- Fix Version/s: 3.4.0 Resolution: Fixed > Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) from State Store > -- > > Key: YARN-11198 > URL: https://issues.apache.org/jira/browse/YARN-11198 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Samrat Deb >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > [YARN-7033|https://issues.apache.org/jira/browse/YARN-7033] provided support > to recover assigned resources to container. But did not delete them from > State Store as part of removal of container after the configured duration > yarn.nodemanager.duration-to-track-stopped-containers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11198) Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) from State Store
Prabhu Joseph created YARN-11198: Summary: Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) from State Store Key: YARN-11198 URL: https://issues.apache.org/jira/browse/YARN-11198 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Samrat Deb [YARN-7033|https://issues.apache.org/jira/browse/YARN-7033] provided support to recover assigned resources to container. But did not delete them from State Store as part of removal of container after the configured duration yarn.nodemanager.duration-to-track-stopped-containers. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11196) NUMA Awareness support in DefaultContainerExecutor
Prabhu Joseph created YARN-11196: Summary: NUMA Awareness support in DefaultContainerExecutor Key: YARN-11196 URL: https://issues.apache.org/jira/browse/YARN-11196 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Samrat Deb [YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added support of NUMA Awareness for Containers launched through LinuxContainerExecutor. This feature is useful to have in DefaultContainerExecutor as well. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11195) Document how to configure NUMA in YARN
[ https://issues.apache.org/jira/browse/YARN-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-11195: - Summary: Document how to configure NUMA in YARN (was: Doc on how to configure NUMA in YARN) > Document how to configure NUMA in YARN > -- > > Key: YARN-11195 > URL: https://issues.apache.org/jira/browse/YARN-11195 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Affects Versions: 3.3.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > [YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added NUMA > Awareness support for launching containers. This improves the workload > performance on machines which has NUMA support like EC2 m5.24x. > Currently this feature works only on LinuxContainerExecutor and not on > DefaultContainerExecutor. Have seen users configuring on a > DefaultContainerExecutor by mistake and has not found any improvement. > Suggest to document how to enable NUMA in YARN. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11195) Doc on how to configure NUMA in YARN
Prabhu Joseph created YARN-11195: Summary: Doc on how to configure NUMA in YARN Key: YARN-11195 URL: https://issues.apache.org/jira/browse/YARN-11195 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Prabhu Joseph [YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added NUMA Awareness support for launching containers. This improves the workload performance on machines which has NUMA support like EC2 m5.24x. Currently this feature works only on LinuxContainerExecutor and not on DefaultContainerExecutor. Have seen users configuring on a DefaultContainerExecutor by mistake and has not found any improvement. Suggest to document how to enable NUMA in YARN. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9971) YARN Native Service HttpProbe logs THIS_HOST in error messages
[ https://issues.apache.org/jira/browse/YARN-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-9971. - Fix Version/s: 3.4.0 Resolution: Fixed Thanks [~groot] for the patch. Have committed the patch to trunk. > YARN Native Service HttpProbe logs THIS_HOST in error messages > -- > > Key: YARN-9971 > URL: https://issues.apache.org/jira/browse/YARN-9971 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Ashutosh Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > YARN Native Service HttpProbe logs THIS_HOST in error messages. While > logging, missed to use the replaced url string. > {code:java} > 2019-11-12 19:25:47,317 [pool-7-thread-1] INFO probe.HttpProbe - Probe > http://${THIS_HOST}:18010/master-status failed for IP 172.27.75.198: > java.net.ConnectException: Connection refused (Connection refused) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11030) ClassNotFoundException when aux service class is loaded from customized classpath
[ https://issues.apache.org/jira/browse/YARN-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453909#comment-17453909 ] Prabhu Joseph commented on YARN-11030: -- [~hadachi] Thanks for reporting the issue. Looks this is a duplicate of [YARN-9967|https://issues.apache.org/jira/browse/YARN-9967]. Can you confirm the same. Thanks. > ClassNotFoundException when aux service class is loaded from customized > classpath > - > > Key: YARN-11030 > URL: https://issues.apache.org/jira/browse/YARN-11030 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.3.1 >Reporter: Hiroyuki Adachi >Priority: Minor > > NodeManager failed to load the aux service with ClassNotFoundException while > loading the class from the customized classpath. > {noformat} > > > value="org.apache.spark.network.yarn.YarnShuffleService"/> > value="/tmp/spark-3.1.2-yarn-shuffle.jar"/> > > {noformat} > {noformat} > 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar] > 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: > system classes: [org.apache.spark.network.yarn.YarnShuffleService] > 2021-12-06 15:32:09,169 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in > state INITED > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.ja > va:165) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452) > ... 10 more > 2021-12-06 15:32:09,172 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > > failed in state INITED{noformat} > > YARN-9075 may cause this problem. The default system classes were changed by > this patch. > Before YARN-9075: isSystemClass() returns false since the system classes does > not contain the aux service class itself, and the class will be loaded from > the customized classpath. > [https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ApplicationClassLoader.java#L176] > {noformat} > 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath:
[jira] [Commented] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files
[ https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450654#comment-17450654 ] Prabhu Joseph commented on YARN-10975: -- Have committed the patch to trunk. Thanks [~Sushma_28]. > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > -- > > Key: YARN-10975 > URL: https://issues.apache.org/jira/browse/YARN-10975 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > Labels: pull-request-available > Attachments: YARN-10975.001.patch, YARN-10975.002.patch, > YARN-10975.003.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > again and again even though there is no change in the file. This leads to > unnecessary load on DFS where summary files resides and Timeline Store where > timeline entities present. > {code} > 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 275 msec > 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 341 msec > 2021-10-10 19:22:44,065 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 335 msec > 2021-10-10 19:23:44,038 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 370 msec > 2021-10-10 19:24:44,087 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 317 msec > 2021-10-10 19:25:44,092 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 336 msec > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files
[ https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449430#comment-17449430 ] Prabhu Joseph commented on YARN-10975: -- Thanks [~Sushma_28] for the patch. The patch looks good to me. Will commit it shortly. > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > -- > > Key: YARN-10975 > URL: https://issues.apache.org/jira/browse/YARN-10975 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: YARN-10975.001.patch, YARN-10975.002.patch, > YARN-10975.003.patch > > > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > again and again even though there is no change in the file. This leads to > unnecessary load on DFS where summary files resides and Timeline Store where > timeline entities present. > {code} > 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 275 msec > 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 341 msec > 2021-10-10 19:22:44,065 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 335 msec > 2021-10-10 19:23:44,038 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 370 msec > 2021-10-10 19:24:44,087 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 317 msec > 2021-10-10 19:25:44,092 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 336 msec > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7982) Do ACLs check while retrieving entity-types per application
[ https://issues.apache.org/jira/browse/YARN-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444007#comment-17444007 ] Prabhu Joseph commented on YARN-7982: - [~dmmkr] Yes sure, can you provide a patch for 3.2. I will help to review and commit it. Thanks. > Do ACLs check while retrieving entity-types per application > --- > > Key: YARN-7982 > URL: https://issues.apache.org/jira/browse/YARN-7982 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-7982-001.patch, YARN-7982-002.patch, > YARN-7982-003.patch, YARN-7982-004.patch > > > REST end point {{/apps/$appid/entity-types}} retrieves all the entity-types > for given application. This need to be guarded with ACL check > {code} > [yarn@yarn-ats-3 ~]$ curl > "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1552297011473_0002?user.name=ambari-qa1; > {"exception":"ForbiddenException","message":"java.lang.Exception: User > ambari-qa1 is not allowed to read TimelineService V2 > data.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > [yarn@yarn-ats-3 ~]$ curl > "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1552297011473_0002/entity-types?user.name=ambari-qa1; > ["YARN_APPLICATION_ATTEMPT","YARN_CONTAINER"] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files
[ https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441137#comment-17441137 ] Prabhu Joseph commented on YARN-10975: -- The main issue is in the below code which returns 0 always. Every time the file is processed, the offset is set to 0 and next time it starts processing from 0. {code} bytesParsed = parser.getCurrentLocation().getCharOffset() + 1; LOG.trace("Parser now at offset {}", bytesParsed); {code} > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > -- > > Key: YARN-10975 > URL: https://issues.apache.org/jira/browse/YARN-10975 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > again and again even though there is no change in the file. This leads to > unnecessary load on DFS where summary files resides and Timeline Store where > timeline entities present. > {code} > 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 275 msec > 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 341 msec > 2021-10-10 19:22:44,065 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 335 msec > 2021-10-10 19:23:44,038 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 370 msec > 2021-10-10 19:24:44,087 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 317 msec > 2021-10-10 19:25:44,092 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 336 msec > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files
[ https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-10975: Assignee: Ravuri Sushma sree (was: Prabhu Joseph) > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > -- > > Key: YARN-10975 > URL: https://issues.apache.org/jira/browse/YARN-10975 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > again and again even though there is no change in the file. This leads to > unnecessary load on DFS where summary files resides and Timeline Store where > timeline entities present. > {code} > 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 275 msec > 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 341 msec > 2021-10-10 19:22:44,065 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 335 msec > 2021-10-10 19:23:44,038 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 370 msec > 2021-10-10 19:24:44,087 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 317 msec > 2021-10-10 19:25:44,092 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 336 msec > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files
[ https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10975: - Description: EntityGroupFSTimelineStore#ActiveLogParser parses already processed files again and again even though there is no change in the file. This leads to unnecessary load on DFS where summary files resides and Timeline Store where timeline entities present. {code} 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 275 msec 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 341 msec 2021-10-10 19:22:44,065 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 335 msec 2021-10-10 19:23:44,038 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 370 msec 2021-10-10 19:24:44,087 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 317 msec 2021-10-10 19:25:44,092 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 336 msec {code} was: EntityGroupFSTimelineStore#ActiveLogParser parses already processed files again and again. This leads to unnecessary load on DFS where summary files resides and Timeline Store where timeline entities present. {code} 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 275 msec 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 341 msec 2021-10-10 19:22:44,065 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 335 msec 2021-10-10 19:23:44,038 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 370 msec 2021-10-10 19:24:44,087 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 317 msec 2021-10-10 19:25:44,092 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 336 msec {code} > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > -- > > Key: YARN-10975 > URL: https://issues.apache.org/jira/browse/YARN-10975 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > EntityGroupFSTimelineStore#ActiveLogParser parses already processed files > again and again even though there is no change in the file. This leads to > unnecessary load on DFS where summary files resides and Timeline Store where > timeline entities present. > {code} > 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 275 msec > 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from > hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 > in 341
[jira] [Created] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files
Prabhu Joseph created YARN-10975: Summary: EntityGroupFSTimelineStore#ActiveLogParser parses already processed files Key: YARN-10975 URL: https://issues.apache.org/jira/browse/YARN-10975 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 3.3.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph EntityGroupFSTimelineStore#ActiveLogParser parses already processed files again and again. This leads to unnecessary load on DFS where summary files resides and Timeline Store where timeline entities present. {code} 2021-10-10 19:20:43,940 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 275 msec 2021-10-10 19:21:44,079 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 341 msec 2021-10-10 19:22:44,065 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 335 msec 2021-10-10 19:23:44,038 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 370 msec 2021-10-10 19:24:44,087 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 317 msec 2021-10-10 19:25:44,092 INFO timeline.LogInfo - Parsed 6 entities from hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893 in 336 msec {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10896) RM fail over is not reporting the nodes DECOMMISSIONED
[ https://issues.apache.org/jira/browse/YARN-10896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420718#comment-17420718 ] Prabhu Joseph commented on YARN-10896: -- Thanks [~Sushil-K-S] for the patch. {code} assertEquals(2, rm.getRMContext().getInactiveRMNodes().size()); {code} 1. Why it returns 2, there are only one Inactive node present right? > RM fail over is not reporting the nodes DECOMMISSIONED > --- > > Key: YARN-10896 > URL: https://issues.apache.org/jira/browse/YARN-10896 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-10896.001.patch > > > Whenever we add the host entries into the exclude file in order to > DECOMMISSION the Nodemanager, we would issue the *yarn rmadmin -refreshNodes* > command to transition the nodes from RUNNING to DECOMMISSIONED state. However > if the fail over to standby resource manager happens and the exclude file has > the list of hosts to be disallowed, then these disallowed nodes are never > seen through the Cluster Metrics on the new active resource manager. > Whatever host entries that are present in the exclude files are being listed > in the Cluster Metrics whenever resource manager is restarted, i.e as part of > the service init of *NodeListManager* , however during fail over this info is > lost. Hence this patch tries to set the *DECOMMISSIONED* nodes inside the RM > Context so that its available through Cluster Metrics whenever we issue the > *yarn rmadmin -refreshNodes* command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner
[ https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411333#comment-17411333 ] Prabhu Joseph commented on YARN-10884: -- Thanks [~Swathi Chandrashekar] for the patch. Have committed it in trunk. > EntityGroupFSTimelineStore fails to parse log files which has empty owner > - > > Key: YARN-10884 > URL: https://issues.apache.org/jira/browse/YARN-10884 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > Fix For: 3.3.1 > > Time Spent: 1h > Remaining Estimate: 0h > > Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - > Wasb FileSystem sets owner as empty during append operation. > ATS1.5 fails to read such files with below error > {code:java} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > It gets ownership of the file to check ACL. In case of disabled ACL check, > this is not required. Will suggest to add anonymous user in case of empty > user. > {code} > if (owner.isEmpty()) { > user = "anonymous"; > } else { > user = owner; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner
[ https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10884: - Labels: (was: pull-request-available) > EntityGroupFSTimelineStore fails to parse log files which has empty owner > - > > Key: YARN-10884 > URL: https://issues.apache.org/jira/browse/YARN-10884 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > Fix For: 3.3.1 > > Time Spent: 1h > Remaining Estimate: 0h > > Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - > Wasb FileSystem sets owner as empty during append operation. > ATS1.5 fails to read such files with below error > {code:java} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > It gets ownership of the file to check ACL. In case of disabled ACL check, > this is not required. Will suggest to add anonymous user in case of empty > user. > {code} > if (owner.isEmpty()) { > user = "anonymous"; > } else { > user = owner; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10933) Building Timeline Delegation Token Service Text is not needed on unsecure clusters
[ https://issues.apache.org/jira/browse/YARN-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-10933: Assignee: SwathiChandrashekar (was: Prabhu Joseph) > Building Timeline Delegation Token Service Text is not needed on unsecure > clusters > -- > > Key: YARN-10933 > URL: https://issues.apache.org/jira/browse/YARN-10933 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > > Yarn Client Commands fails with below when ATS1.5 TimelineServer is not > reachable. On unsecure cluster, Build Timeline Token Service is not required. > {code:java} > java.lang.IllegalArgumentException: java.net.UnknownHostException: > timelineserver-0 > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445) > at > org.apache.hadoop.yarn.util.timeline.TimelineUtils.buildTimelineTokenService(TimelineUtils.java:163) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:183) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at org.apache.hadoop.yarn.client.cli.YarnCLI.(YarnCLI.java:47) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.(ApplicationCLI.java:65) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:115) > Caused by: java.net.UnknownHostException: timelineserver-0 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10933) Building Timeline Delegation Token Service Text is not needed on unsecure clusters
Prabhu Joseph created YARN-10933: Summary: Building Timeline Delegation Token Service Text is not needed on unsecure clusters Key: YARN-10933 URL: https://issues.apache.org/jira/browse/YARN-10933 Project: Hadoop YARN Issue Type: Bug Components: timelineclient Affects Versions: 3.3.1 Reporter: Prabhu Joseph Assignee: Prabhu Joseph Yarn Client Commands fails with below when ATS1.5 TimelineServer is not reachable. On unsecure cluster, Build Timeline Token Service is not required. {code:java} java.lang.IllegalArgumentException: java.net.UnknownHostException: timelineserver-0 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445) at org.apache.hadoop.yarn.util.timeline.TimelineUtils.buildTimelineTokenService(TimelineUtils.java:163) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:183) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.client.cli.YarnCLI.(YarnCLI.java:47) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.(ApplicationCLI.java:65) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:115) Caused by: java.net.UnknownHostException: timelineserver-0 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout
[ https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-10873. -- Resolution: Fixed > Graceful Decommission ignores launched containers and gets deactivated before > timeout > - > > Key: YARN-10873 > URL: https://issues.apache.org/jira/browse/YARN-10873 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Srinivas S T >Priority: Major > Fix For: 3.4.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Graceful Decommission of a Node gets deactivated before timeout even though > there are launched containers. > On Status update from Node which is in Decommissioning, RM transitions the > node to DECOMMISSIONED before timeout if there are no running applications. > These running applications are added from the Container Statuses from > NodeManager. We have observed Containers are launched at NodeManager and at > the same time ResourceManager forcefully decommissions the node. > This affects the Livy Interactive jobs which supports only one application > attempt. > Will suggest to check FicaSchedulerNode to identify if there are any launched > containers and determine whether to forcefully decommission or not. > {code} > public static class StatusUpdateWhenHealthyTransition implements > MultipleArcTransition { > @Override > public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) { > . > if (isNodeDecommissioning) { > List keepAliveApps = statusEvent.getKeepAliveAppIds(); > if (rmNode.runningApplications.isEmpty() && > (keepAliveApps == null || keepAliveApps.isEmpty())) { > RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED); > return NodeState.DECOMMISSIONED; > } > } > {code} > *ResourceManager Logs:* > {code} > 2021-06-16 08:45:04,140 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: > Launching masterappattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up container Container: [ContainerId: container_1623830067124_0382_01_01, > AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: > 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM > appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > Creating password for appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,154 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done > launching container Container: [ContainerId: > container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, > NodeId: node1:34753, NodeHttpAddress: > 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM > appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node1:34753 with state RUNNING > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node1:34753 in DECOMMISSIONING. > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 > Node Transitioned from RUNNING to DECOMMISSIONING > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating > Node node1:34753 as it is now DECOMMISSIONED > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 > Node Transitioned from DECOMMISSIONING to DECOMMISSIONED > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED > to KILLED > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout
[ https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10873: - Labels: (was: pull-request-available) > Graceful Decommission ignores launched containers and gets deactivated before > timeout > - > > Key: YARN-10873 > URL: https://issues.apache.org/jira/browse/YARN-10873 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Srinivas S T >Priority: Major > Fix For: 3.4.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Graceful Decommission of a Node gets deactivated before timeout even though > there are launched containers. > On Status update from Node which is in Decommissioning, RM transitions the > node to DECOMMISSIONED before timeout if there are no running applications. > These running applications are added from the Container Statuses from > NodeManager. We have observed Containers are launched at NodeManager and at > the same time ResourceManager forcefully decommissions the node. > This affects the Livy Interactive jobs which supports only one application > attempt. > Will suggest to check FicaSchedulerNode to identify if there are any launched > containers and determine whether to forcefully decommission or not. > {code} > public static class StatusUpdateWhenHealthyTransition implements > MultipleArcTransition { > @Override > public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) { > . > if (isNodeDecommissioning) { > List keepAliveApps = statusEvent.getKeepAliveAppIds(); > if (rmNode.runningApplications.isEmpty() && > (keepAliveApps == null || keepAliveApps.isEmpty())) { > RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED); > return NodeState.DECOMMISSIONED; > } > } > {code} > *ResourceManager Logs:* > {code} > 2021-06-16 08:45:04,140 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: > Launching masterappattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up container Container: [ContainerId: container_1623830067124_0382_01_01, > AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: > 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM > appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > Creating password for appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,154 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done > launching container Container: [ContainerId: > container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, > NodeId: node1:34753, NodeHttpAddress: > 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM > appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node1:34753 with state RUNNING > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node1:34753 in DECOMMISSIONING. > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 > Node Transitioned from RUNNING to DECOMMISSIONING > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating > Node node1:34753 as it is now DECOMMISSIONED > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 > Node Transitioned from DECOMMISSIONING to DECOMMISSIONED > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED > to KILLED > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout
[ https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10873: - Fix Version/s: 3.4.0 > Graceful Decommission ignores launched containers and gets deactivated before > timeout > - > > Key: YARN-10873 > URL: https://issues.apache.org/jira/browse/YARN-10873 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Srinivas S T >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Graceful Decommission of a Node gets deactivated before timeout even though > there are launched containers. > On Status update from Node which is in Decommissioning, RM transitions the > node to DECOMMISSIONED before timeout if there are no running applications. > These running applications are added from the Container Statuses from > NodeManager. We have observed Containers are launched at NodeManager and at > the same time ResourceManager forcefully decommissions the node. > This affects the Livy Interactive jobs which supports only one application > attempt. > Will suggest to check FicaSchedulerNode to identify if there are any launched > containers and determine whether to forcefully decommission or not. > {code} > public static class StatusUpdateWhenHealthyTransition implements > MultipleArcTransition { > @Override > public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) { > . > if (isNodeDecommissioning) { > List keepAliveApps = statusEvent.getKeepAliveAppIds(); > if (rmNode.runningApplications.isEmpty() && > (keepAliveApps == null || keepAliveApps.isEmpty())) { > RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED); > return NodeState.DECOMMISSIONED; > } > } > {code} > *ResourceManager Logs:* > {code} > 2021-06-16 08:45:04,140 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: > Launching masterappattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up container Container: [ContainerId: container_1623830067124_0382_01_01, > AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: > 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM > appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,141 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > Creating password for appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,154 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done > launching container Container: [ContainerId: > container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, > NodeId: node1:34753, NodeHttpAddress: > 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM > appattempt_1623830067124_0382_01 > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node1:34753 with state RUNNING > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node1:34753 in DECOMMISSIONING. > 2021-06-16 08:45:04,776 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 > Node Transitioned from RUNNING to DECOMMISSIONING > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating > Node node1:34753 as it is now DECOMMISSIONED > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 > Node Transitioned from DECOMMISSIONING to DECOMMISSIONED > 2021-06-16 08:45:05,131 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED > to KILLED > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Assigned] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner
[ https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-10884: Assignee: SwathiChandrashekar (was: Prabhu Joseph) > EntityGroupFSTimelineStore fails to parse log files which has empty owner > - > > Key: YARN-10884 > URL: https://issues.apache.org/jira/browse/YARN-10884 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > > Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - > Wasb FileSystem sets owner as empty during append operation. > ATS1.5 fails to read such files with below error > {code:java} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > It gets ownership of the file to check ACL. In case of disabled ACL check, > this is not required. Will suggest to add anonymous user in case of empty > user. > {code} > if (owner.isEmpty()) { > user = "anonymous"; > } else { > user = owner; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner
[ https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10884: - Description: Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - Wasb FileSystem sets owner as empty during append operation. ATS1.5 fails to read such files with below error {code:java} java.lang.IllegalArgumentException: Null user at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) at org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) at org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} It gets ownership of the file to check ACL. In case of disabled ACL check, this is not required. Will suggest to add anonymous user in case of empty user. {code} if (owner.isEmpty()) { user = "anonymous"; } else { user = owner; } {code} was: Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] Hadoop NativeAzureFileSystem append removes ownership set on the file - ASF JIRA (apache.org)] - Wasb FileSystem sets owner as empty during append operation. ATS1.5 fails to read such files with below error {code:java} java.lang.IllegalArgumentException: Null user at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) at org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) at org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} It gets ownership of the file to check ACL. In case of disabled ACL check, this is not required. Will suggest to add anonymous user in case of empty user. {code} if (owner.isEmpty()) { user = "anonymous"; } else { user = owner; } {code} > EntityGroupFSTimelineStore fails to parse log files which has empty owner > - > > Key: YARN-10884 > URL: https://issues.apache.org/jira/browse/YARN-10884 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - > Wasb FileSystem sets owner as empty during append operation. > ATS1.5 fails to read such files with below error > {code:java} >
[jira] [Updated] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner
[ https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10884: - Description: Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] Hadoop NativeAzureFileSystem append removes ownership set on the file - ASF JIRA (apache.org)] - Wasb FileSystem sets owner as empty during append operation. ATS1.5 fails to read such files with below error {code:java} java.lang.IllegalArgumentException: Null user at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) at org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) at org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} It gets ownership of the file to check ACL. In case of disabled ACL check, this is not required. Will suggest to add anonymous user in case of empty user. {code} if (owner.isEmpty()) { user = "anonymous"; } else { user = owner; } {code} was: Due to [HADOOP-17848|[HADOOP-17848] Hadoop NativeAzureFileSystem append removes ownership set on the file - ASF JIRA (apache.org)] - Wasb FileSystem sets owner as empty during append operation. ATS1.5 fails to read such files with below error {code:java} java.lang.IllegalArgumentException: Null user at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) at org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) at org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} It gets ownership of the file to check ACL. In case of disabled ACL check, this is not required. Will suggest to add anonymous user in case of empty user. {code} if (owner.isEmpty()) { user = "anonymous"; } else { user = owner; } {code} > EntityGroupFSTimelineStore fails to parse log files which has empty owner > - > > Key: YARN-10884 > URL: https://issues.apache.org/jira/browse/YARN-10884 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] > Hadoop NativeAzureFileSystem append removes ownership set on the file - ASF > JIRA (apache.org)] -
[jira] [Created] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner
Prabhu Joseph created YARN-10884: Summary: EntityGroupFSTimelineStore fails to parse log files which has empty owner Key: YARN-10884 URL: https://issues.apache.org/jira/browse/YARN-10884 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 3.3.1 Reporter: Prabhu Joseph Assignee: Prabhu Joseph Due to [HADOOP-17848|[HADOOP-17848] Hadoop NativeAzureFileSystem append removes ownership set on the file - ASF JIRA (apache.org)] - Wasb FileSystem sets owner as empty during append operation. ATS1.5 fails to read such files with below error {code:java} java.lang.IllegalArgumentException: Null user at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271) at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258) at org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141) at org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} It gets ownership of the file to check ACL. In case of disabled ACL check, this is not required. Will suggest to add anonymous user in case of empty user. {code} if (owner.isEmpty()) { user = "anonymous"; } else { user = owner; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390099#comment-17390099 ] Prabhu Joseph commented on YARN-10848: -- Hi [~pbacsko], IMO this is breaking the existing behavior of DefaultResourceCalculator. DefaultResourceCalculator is useful when the workloads are not CPU intensive like MapReduce, Tez and user need not worry on CPU configurations here. >> IMO whether a container "fits in" or not should depend on both values DominantResourceCalaculator provides this support which users configures if they want to consider both memory and cpu resources in scheduling. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Attachments: TestTooManyContainers.java > > Time Spent: 20m > Remaining Estimate: 0h > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path
[ https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390079#comment-17390079 ] Prabhu Joseph commented on YARN-10854: -- Thanks [~Tao Yang] for the patch. This is very useful to us as well, else we would have end up in adding lot of code change in managing the includes node list. > Support marking inactive node as untracked without configured include path > -- > > Key: YARN-10854 > URL: https://issues.apache.org/jira/browse/YARN-10854 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-10854.001.patch, YARN-10854.002.patch, > YARN-10854.003.patch > > > Currently inactive nodes which have been decommissioned/shutdown/lost for a > while(specified expiration time defined via > {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by > default) and not exist in both include and exclude files can be marked as > untracked nodes and can be removed from RM state (YARN-4311). It's very > useful when auto-scaling is enabled in elastic cloud environment, which can > avoid unlimited increase of inactive nodes (mostly are decommissioned nodes). > But this only works when the include path is configured, mismatched for most > of our cloud environments without configured white list of nodes, which can > lead to easily control for the auto-scaling of nodes without further security > requirements. > So I propose to support marking inactive node as untracked without configured > include path, to be compatible with the former versions, we can add a switch > config for this. > Any thoughts/suggestions/feedbacks are welcome! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout
Prabhu Joseph created YARN-10873: Summary: Graceful Decommission ignores launched containers and gets deactivated before timeout Key: YARN-10873 URL: https://issues.apache.org/jira/browse/YARN-10873 Project: Hadoop YARN Issue Type: Bug Components: RM Affects Versions: 3.3.1 Reporter: Prabhu Joseph Assignee: Srinivas S T Graceful Decommission of a Node gets deactivated before timeout even though there are launched containers. On Status update from Node which is in Decommissioning, RM transitions the node to DECOMMISSIONED before timeout if there are no running applications. These running applications are added from the Container Statuses from NodeManager. We have observed Containers are launched at NodeManager and at the same time ResourceManager forcefully decommissions the node. This affects the Livy Interactive jobs which supports only one application attempt. Will suggest to check FicaSchedulerNode to identify if there are any launched containers and determine whether to forcefully decommission or not. {code} public static class StatusUpdateWhenHealthyTransition implements MultipleArcTransition { @Override public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) { . if (isNodeDecommissioning) { List keepAliveApps = statusEvent.getKeepAliveAppIds(); if (rmNode.runningApplications.isEmpty() && (keepAliveApps == null || keepAliveApps.isEmpty())) { RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED); return NodeState.DECOMMISSIONED; } } {code} *ResourceManager Logs:* {code} 2021-06-16 08:45:04,140 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1623830067124_0382_01 2021-06-16 08:45:04,141 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM appattempt_1623830067124_0382_01 2021-06-16 08:45:04,141 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01 2021-06-16 08:45:04,141 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1623830067124_0382_01 2021-06-16 08:45:04,154 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM appattempt_1623830067124_0382_01 2021-06-16 08:45:04,776 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully decommission node node1:34753 with state RUNNING 2021-06-16 08:45:04,776 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node node1:34753 in DECOMMISSIONING. 2021-06-16 08:45:04,776 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 Node Transitioned from RUNNING to DECOMMISSIONING 2021-06-16 08:45:05,131 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node node1:34753 as it is now DECOMMISSIONED 2021-06-16 08:45:05,131 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 Node Transitioned from DECOMMISSIONING to DECOMMISSIONED 2021-06-16 08:45:05,131 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED to KILLED {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10871) Aborted AM is considered as App Failure when user sets MaxAttempts as 1
[ https://issues.apache.org/jira/browse/YARN-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10871: - Description: When an AM Container is ABORTED due to Node Decommission, the AppAttempt failure is not counted. But if user sets number of attempts as 1, then YARN considers the ABORTED AM as a failure. {code} int numberOfFailure = app.getNumFailedAppAttempts(); if (app.maxAppAttempts == 1) { // If the user explicitly set the attempts to 1 then there are likely // correctness issues if the AM restarts for any reason. LOG.info("Max app attempts is 1 for " + app.applicationId + ", preventing further attempts."); numberOfFailure = app.maxAppAttempts; } {code} Livy sets the number of attempts as 1 since it's Rpc Server does not yet support multiple connections for the same registered app. But in our case AM is ABORTED before even the AM starts (AM was in ACQUIRED state) Usually users won't decommission the node where the Container is in RUNNING state (where the session is established). But the decommission can happen on nodes where the container is in ACQUIRED or ALLOCATED state. Will suggest to expose an config where user can decide whether to consider this as a failure or not. was: When an AM Container is ABORTED due to Node Decommission, the AppAttempt failure is not counted. But if user sets number of attempts as 1, then YARN considers the ABORTED AM as a failure. {code} int numberOfFailure = app.getNumFailedAppAttempts(); if (app.maxAppAttempts == 1) { // If the user explicitly set the attempts to 1 then there are likely // correctness issues if the AM restarts for any reason. LOG.info("Max app attempts is 1 for " + app.applicationId + ", preventing further attempts."); numberOfFailure = app.maxAppAttempts; } {code} Livy sets the number of attempts as 1 since it's Rpc Server does not yet support multiple connections for the same registered app. But in our case AM is ABORTED before even the AM starts (AM was in ACAUIRED state) Usually users won't decommission the node where the Container is in RUNNING state (where the session is established). But the decommission can happen on nodes where the container is in ACQUIRED or ALLOCATED state. Will suggest to expose an config where user can decide whether to consider this as a failure or not. > Aborted AM is considered as App Failure when user sets MaxAttempts as 1 > --- > > Key: YARN-10871 > URL: https://issues.apache.org/jira/browse/YARN-10871 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Srinivas S T >Priority: Major > > When an AM Container is ABORTED due to Node Decommission, the AppAttempt > failure is not counted. But if user sets number of attempts as 1, then YARN > considers the ABORTED AM as a failure. > {code} > int numberOfFailure = app.getNumFailedAppAttempts(); > if (app.maxAppAttempts == 1) { > // If the user explicitly set the attempts to 1 then there are likely > // correctness issues if the AM restarts for any reason. > LOG.info("Max app attempts is 1 for " + app.applicationId > + ", preventing further attempts."); > numberOfFailure = app.maxAppAttempts; > } > {code} > Livy sets the number of attempts as 1 since it's Rpc Server does not yet > support multiple connections for the same registered app. But in our case AM > is ABORTED before even the AM starts (AM was in ACQUIRED state) > Usually users won't decommission the node where the Container is in RUNNING > state (where the session is established). But the decommission can happen on > nodes where the container is in ACQUIRED or ALLOCATED state. > Will suggest to expose an config where user can decide whether to consider > this as a failure or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10871) Aborted AM is considered as App Failure when user sets MaxAttempts as 1
[ https://issues.apache.org/jira/browse/YARN-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-10871: Assignee: Srinivas S T (was: Prabhu Joseph) > Aborted AM is considered as App Failure when user sets MaxAttempts as 1 > --- > > Key: YARN-10871 > URL: https://issues.apache.org/jira/browse/YARN-10871 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Srinivas S T >Priority: Major > > When an AM Container is ABORTED due to Node Decommission, the AppAttempt > failure is not counted. But if user sets number of attempts as 1, then YARN > considers the ABORTED AM as a failure. > {code} > int numberOfFailure = app.getNumFailedAppAttempts(); > if (app.maxAppAttempts == 1) { > // If the user explicitly set the attempts to 1 then there are likely > // correctness issues if the AM restarts for any reason. > LOG.info("Max app attempts is 1 for " + app.applicationId > + ", preventing further attempts."); > numberOfFailure = app.maxAppAttempts; > } > {code} > Livy sets the number of attempts as 1 since it's Rpc Server does not yet > support multiple connections for the same registered app. But in our case AM > is ABORTED before even the AM starts (AM was in ACAUIRED state) > Usually users won't decommission the node where the Container is in RUNNING > state (where the session is established). But the decommission can happen on > nodes where the container is in ACQUIRED or ALLOCATED state. > Will suggest to expose an config where user can decide whether to consider > this as a failure or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10871) Aborted AM is considered as App Failure when user sets MaxAttempts as 1
Prabhu Joseph created YARN-10871: Summary: Aborted AM is considered as App Failure when user sets MaxAttempts as 1 Key: YARN-10871 URL: https://issues.apache.org/jira/browse/YARN-10871 Project: Hadoop YARN Issue Type: Bug Components: RM Affects Versions: 3.3.1 Reporter: Prabhu Joseph Assignee: Prabhu Joseph When an AM Container is ABORTED due to Node Decommission, the AppAttempt failure is not counted. But if user sets number of attempts as 1, then YARN considers the ABORTED AM as a failure. {code} int numberOfFailure = app.getNumFailedAppAttempts(); if (app.maxAppAttempts == 1) { // If the user explicitly set the attempts to 1 then there are likely // correctness issues if the AM restarts for any reason. LOG.info("Max app attempts is 1 for " + app.applicationId + ", preventing further attempts."); numberOfFailure = app.maxAppAttempts; } {code} Livy sets the number of attempts as 1 since it's Rpc Server does not yet support multiple connections for the same registered app. But in our case AM is ABORTED before even the AM starts (AM was in ACAUIRED state) Usually users won't decommission the node where the Container is in RUNNING state (where the session is established). But the decommission can happen on nodes where the container is in ACQUIRED or ALLOCATED state. Will suggest to expose an config where user can decide whether to consider this as a failure or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10857) YarnClient Caching Addresses
[ https://issues.apache.org/jira/browse/YARN-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-10857: Assignee: Prabhu Joseph > YarnClient Caching Addresses > > > Key: YARN-10857 > URL: https://issues.apache.org/jira/browse/YARN-10857 > Project: Hadoop YARN > Issue Type: Improvement > Components: client, yarn >Reporter: Steve Suh >Assignee: Prabhu Joseph >Priority: Minor > > We have noticed that when the YarnClient is initialized and used, it is not > very resilient when dns or /etc/hosts is modified in the following scenario: > Take for instance the following (and reproducable) sequence of events that > can occur on a service that instantiates and uses YarnClient. > - Yarn has rm HA enabled (*yarn.resourcemanager.ha.enabled* is *true*) and > there are two rms (rm1 and rm2). > - *yarn.client.failover-proxy-provider* is set to > *org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider* > 1)rm2 is currently the active rm > 2)/etc/hosts (or dns) is missing host information for rm2 > 3)A service is started and it initializes the YarnClient at startup. > 4)At some point in time after YarnClient is done initializing, /etc/hosts > is updated and contains host information for rm2 > 5)Yarn is queried, for instance calling *yarnclient.getApplications()* > 6)All YarnClient attempts to communicate with rm2 fail with > UnknownHostExceptions, even though /etc/hosts now contains host information > for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10866) RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing
[ https://issues.apache.org/jira/browse/YARN-10866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10866: - Description: RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing in /etc/hosts {code} 2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.IllegalArgumentException: java.net.UnknownHostException: resourcemanager-1.resourcemanager at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466) at org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154) at org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139) at org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81) at org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100) at org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76) at org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75) at org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194) at org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130) at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103) {code} was: RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing {code} 2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.IllegalArgumentException: java.net.UnknownHostException: resourcemanager-1.resourcemanager at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466) at org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154) at org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139) at org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81) at org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100) at org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76) at org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75) at org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194) at org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130) at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103) {code} > RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if > standby host info is missing > --- > > Key: YARN-10866 > URL: https://issues.apache.org/jira/browse/YARN-10866 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if > standby host info is missing in /etc/hosts > {code} > 2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster > java.lang.IllegalArgumentException: java.net.UnknownHostException: > resourcemanager-1.resourcemanager > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466) > at > org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154) > at > org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139) > at > org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81) > at > org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100) > at > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75) > at > org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194) > at > org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130) > at > org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103) > {code} -- This message was sent by Atlassian Jira
[jira] [Created] (YARN-10866) RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing
Prabhu Joseph created YARN-10866: Summary: RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing Key: YARN-10866 URL: https://issues.apache.org/jira/browse/YARN-10866 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 3.3.1 Reporter: Prabhu Joseph Assignee: Prabhu Joseph RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing {code} 2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.IllegalArgumentException: java.net.UnknownHostException: resourcemanager-1.resourcemanager at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466) at org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154) at org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139) at org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81) at org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100) at org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76) at org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75) at org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194) at org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130) at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10840) yarn app status fails with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/YARN-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10840: - Attachment: YARN-10840-001.patch > yarn app status fails with ArrayIndexOutOfBoundsException > -- > > Key: YARN-10840 > URL: https://issues.apache.org/jira/browse/YARN-10840 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Abhinaba Sarkar >Assignee: Abhinaba Sarkar >Priority: Major > Attachments: YARN-10840-001.patch > > > Array index out of bounds exception in the ClientAMService.getStatus() - > {code:java} > 2021-07-04 20:00:24,488 [IPC Server handler 0 on 25347] INFO ipc.Server - > IPC Server handler 0 on 25347, call Call#163 Retry#0 > org.apache.hadoop.yarn.service.ClientAMProtocol.getStatus from 10.0.0.10:42446 > org.codehaus.jackson.map.JsonMappingException: Index: 11, Size: 11 (through > reference chain: > org.apache.hadoop.yarn.service.api.records.Service["components"]->java.util.ArrayList[0]->org.apache.hadoop.yarn.service.api.records.Component["containers"]->java.util.ArrayList[11]) > at > org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:218) > at > org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:197) > at > org.codehaus.jackson.map.ser.std.SerializerBase.wrapAndThrow(SerializerBase.java:166) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:127) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71) > at > org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86) > at > org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446) > at > org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150) > at > org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:122) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71) > at > org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86) > at > org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446) > at > org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150) > at > org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112) > at > org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:610) > at > org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256) > at > org.codehaus.jackson.map.ObjectMapper._configAndWriteValue(ObjectMapper.java:2575) > at > org.codehaus.jackson.map.ObjectMapper.writeValueAsString(ObjectMapper.java:2097) > at > org.apache.hadoop.yarn.service.utils.JsonSerDeser.toJson(JsonSerDeser.java:249) > at > org.apache.hadoop.yarn.service.ClientAMService.getStatus(ClientAMService.java:125) > at > org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.getStatus(ClientAMProtocolPBServiceImpl.java:59) > at > org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:6159) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > Caused by: java.lang.IndexOutOfBoundsException: Index: 11, Size: 11 > at java.util.ArrayList.rangeCheck(ArrayList.java:659) > at java.util.ArrayList.get(ArrayList.java:435) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:106) > ... 27 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (YARN-10840) yarn app status fails with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/YARN-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10840: - Attachment: (was: YARN-10840-001.patch) > yarn app status fails with ArrayIndexOutOfBoundsException > -- > > Key: YARN-10840 > URL: https://issues.apache.org/jira/browse/YARN-10840 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Abhinaba Sarkar >Assignee: Abhinaba Sarkar >Priority: Major > Attachments: YARN-10840-001.patch > > > Array index out of bounds exception in the ClientAMService.getStatus() - > {code:java} > 2021-07-04 20:00:24,488 [IPC Server handler 0 on 25347] INFO ipc.Server - > IPC Server handler 0 on 25347, call Call#163 Retry#0 > org.apache.hadoop.yarn.service.ClientAMProtocol.getStatus from 10.0.0.10:42446 > org.codehaus.jackson.map.JsonMappingException: Index: 11, Size: 11 (through > reference chain: > org.apache.hadoop.yarn.service.api.records.Service["components"]->java.util.ArrayList[0]->org.apache.hadoop.yarn.service.api.records.Component["containers"]->java.util.ArrayList[11]) > at > org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:218) > at > org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:197) > at > org.codehaus.jackson.map.ser.std.SerializerBase.wrapAndThrow(SerializerBase.java:166) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:127) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71) > at > org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86) > at > org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446) > at > org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150) > at > org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:122) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71) > at > org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86) > at > org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446) > at > org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150) > at > org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112) > at > org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:610) > at > org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256) > at > org.codehaus.jackson.map.ObjectMapper._configAndWriteValue(ObjectMapper.java:2575) > at > org.codehaus.jackson.map.ObjectMapper.writeValueAsString(ObjectMapper.java:2097) > at > org.apache.hadoop.yarn.service.utils.JsonSerDeser.toJson(JsonSerDeser.java:249) > at > org.apache.hadoop.yarn.service.ClientAMService.getStatus(ClientAMService.java:125) > at > org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.getStatus(ClientAMProtocolPBServiceImpl.java:59) > at > org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:6159) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > Caused by: java.lang.IndexOutOfBoundsException: Index: 11, Size: 11 > at java.util.ArrayList.rangeCheck(ArrayList.java:659) > at java.util.ArrayList.get(ArrayList.java:435) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:106) > ... 27 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (YARN-10840) yarn app status fails with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/YARN-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10840: - Summary: yarn app status fails with ArrayIndexOutOfBoundsException (was: yarn app status fails with arrayindexoutofbounsexception) > yarn app status fails with ArrayIndexOutOfBoundsException > -- > > Key: YARN-10840 > URL: https://issues.apache.org/jira/browse/YARN-10840 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Abhinaba Sarkar >Assignee: Abhinaba Sarkar >Priority: Major > Attachments: YARN-10840-001.patch > > > Array index out of bounds exception in the ClientAMService.getStatus() - > {code:java} > 2021-07-04 20:00:24,488 [IPC Server handler 0 on 25347] INFO ipc.Server - > IPC Server handler 0 on 25347, call Call#163 Retry#0 > org.apache.hadoop.yarn.service.ClientAMProtocol.getStatus from 10.0.0.10:42446 > org.codehaus.jackson.map.JsonMappingException: Index: 11, Size: 11 (through > reference chain: > org.apache.hadoop.yarn.service.api.records.Service["components"]->java.util.ArrayList[0]->org.apache.hadoop.yarn.service.api.records.Component["containers"]->java.util.ArrayList[11]) > at > org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:218) > at > org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:197) > at > org.codehaus.jackson.map.ser.std.SerializerBase.wrapAndThrow(SerializerBase.java:166) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:127) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71) > at > org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86) > at > org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446) > at > org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150) > at > org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:122) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71) > at > org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86) > at > org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446) > at > org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150) > at > org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112) > at > org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:610) > at > org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256) > at > org.codehaus.jackson.map.ObjectMapper._configAndWriteValue(ObjectMapper.java:2575) > at > org.codehaus.jackson.map.ObjectMapper.writeValueAsString(ObjectMapper.java:2097) > at > org.apache.hadoop.yarn.service.utils.JsonSerDeser.toJson(JsonSerDeser.java:249) > at > org.apache.hadoop.yarn.service.ClientAMService.getStatus(ClientAMService.java:125) > at > org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.getStatus(ClientAMProtocolPBServiceImpl.java:59) > at > org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:6159) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > Caused by: java.lang.IndexOutOfBoundsException: Index: 11, Size: 11 > at java.util.ArrayList.rangeCheck(ArrayList.java:659) > at java.util.ArrayList.get(ArrayList.java:435) > at > org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:106) > ... 27 more >
[jira] [Resolved] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-10820. -- Resolution: Fixed > Make GetClusterNodesRequestPBImpl thread safe > - > > Key: YARN-10820 > URL: https://issues.apache.org/jira/browse/YARN-10820 > Project: Hadoop YARN > Issue Type: Task > Components: client >Affects Versions: 3.1.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > yarn node list intermittently fails with below > {code:java} > 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on > [resourcemanager-1], so propagating back to caller. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at java.util.ArrayList.add(ArrayList.java:465) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.UnsupportedOperationException on > [resourcemanager-0], so propagating back to caller. > Exception in thread "main" java.lang.UnsupportedOperationException > at > java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >
[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370200#comment-17370200 ] Prabhu Joseph commented on YARN-10820: -- Thanks [~Swathi Chandrashekar] for the patch. Have committed in 3.4.0. > Make GetClusterNodesRequestPBImpl thread safe > - > > Key: YARN-10820 > URL: https://issues.apache.org/jira/browse/YARN-10820 > Project: Hadoop YARN > Issue Type: Task > Components: client >Affects Versions: 3.1.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > yarn node list intermittently fails with below > {code:java} > 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on > [resourcemanager-1], so propagating back to caller. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at java.util.ArrayList.add(ArrayList.java:465) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.UnsupportedOperationException on > [resourcemanager-0], so propagating back to caller. > Exception in thread "main" java.lang.UnsupportedOperationException > at > java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Updated] (YARN-10810) YARN Native Service Definition is not backward compatible
[ https://issues.apache.org/jira/browse/YARN-10810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10810: - Attachment: YARN-10810-001.patch > YARN Native Service Definition is not backward compatible > - > > Key: YARN-10810 > URL: https://issues.apache.org/jira/browse/YARN-10810 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10810-001.patch > > > YARN Native Service Spec PlacementScope value was *NODE* in hadoop-3.1 > version but got changed to *node* in hadoop-3.3. This causes older Service > Client (hadoop-3.1) to fail while getting the status from new Api Server > (hadoop-3.3). This looks caused due to jackson upgrade. > > {code:java} > 2021-06-07 06:08:40,095 INFO utils.ServiceApiUtil: Loading service definition > from hdfs://prabhuhdfs/user/root/.yarn/services/llap0/llap0.json > 2021-06-07 06:08:40,798 ERROR utils.JsonSerDeser: Exception while parsing > json : org.codehaus.jackson.map.JsonMappingException: Can not construct > instance of org.apache.hadoop.yarn.service.api.records.PlacementScope from > String value 'node': value not one of declared Enum instance names > at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through > reference chain: > org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"]) > "placement_policy" : { > "constraints" : [ { > "name" : null, > "type" : "ANTI_AFFINITY", > "scope" : "node", > "target_tags" : [ "llap" ], > "node_attributes" : { }, > "node_partitions" : [ ], > "min_cardinality" : null, > "max_cardinality" : null > } ] > },org.codehaus.jackson.map.JsonMappingException: Can not construct > instance of org.apache.hadoop.yarn.service.api.records.PlacementScope from > String value 'node': value not one of declared Enum instance names > at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through > reference chain: > org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"]) > at > org.codehaus.jackson.map.JsonMappingException.from(JsonMappingException.java:163) > at > org.codehaus.jackson.map.deser.StdDeserializationContext.weirdStringException(StdDeserializationContext.java:243) > at > org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:80) > at > org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:23) > at > org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299) > at > org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414) > at > org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697) > at > org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580) > at > org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217) > at > org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:194) > at > org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:30) > at > org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299) > at > org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414) > at > org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697) > at > org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580) > at > org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299) > at > org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414) > at > org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697) > at >
[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362724#comment-17362724 ] Prabhu Joseph commented on YARN-10820: -- Hi [~bibinchundatt], Could you please assign [~Swathi Chandrashekar] the contributor. Thanks. > Make GetClusterNodesRequestPBImpl thread safe > - > > Key: YARN-10820 > URL: https://issues.apache.org/jira/browse/YARN-10820 > Project: Hadoop YARN > Issue Type: Task > Components: client >Affects Versions: 3.1.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > yarn node list intermittently fails with below > {code:java} > 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on > [resourcemanager-1], so propagating back to caller. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at java.util.ArrayList.add(ArrayList.java:465) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.UnsupportedOperationException on > [resourcemanager-0], so propagating back to caller. > Exception in thread "main" java.lang.UnsupportedOperationException > at > java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Created] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
Prabhu Joseph created YARN-10820: Summary: Make GetClusterNodesRequestPBImpl thread safe Key: YARN-10820 URL: https://issues.apache.org/jira/browse/YARN-10820 Project: Hadoop YARN Issue Type: Task Components: client Affects Versions: 3.3.0, 3.1.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph yarn node list intermittently fails with below {code:java} 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on [resourcemanager-1], so propagating back to caller. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at java.util.ArrayList.add(ArrayList.java:465) at org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: Invocation returned exception: java.lang.UnsupportedOperationException on [resourcemanager-0], so propagating back to caller. Exception in thread "main" java.lang.UnsupportedOperationException at java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) at org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at
[jira] [Commented] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url
[ https://issues.apache.org/jira/browse/YARN-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359448#comment-17359448 ] Prabhu Joseph commented on YARN-10792: -- Thanks [~abhinaba.sarkar] for the contribution. Have committed the patch to 3.4 > Set Completed AppAttempt LogsLink to Log Server Url > --- > > Key: YARN-10792 > URL: https://issues.apache.org/jira/browse/YARN-10792 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Abhinaba Sarkar >Priority: Major > Attachments: YARN-10792-001.patch, YARN-10792-002.patch, > YARN-10792-003.patch > > > Completed AppAttempts listed by YARN UI has logslink pointing to the > NodeManager containerlogs url. The completed container logs will be under > aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. > On frequent Scale Down, these NMs won't be available and so makes difficulty > to look for appattempt logs of completed apps from RM UI. Setting the > logslink for Completed AppAttempts to LogServer url will avoid this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10810) YARN Native Service Definition is not backward compatible
Prabhu Joseph created YARN-10810: Summary: YARN Native Service Definition is not backward compatible Key: YARN-10810 URL: https://issues.apache.org/jira/browse/YARN-10810 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Affects Versions: 3.3.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph YARN Native Service Spec PlacementScope value was *NODE* in hadoop-3.1 version but got changed to *node* in hadoop-3.3. This causes older Service Client (hadoop-3.1) to fail while getting the status from new Api Server (hadoop-3.3). This looks caused due to jackson upgrade. {code:java} 2021-06-07 06:08:40,095 INFO utils.ServiceApiUtil: Loading service definition from hdfs://prabhuhdfs/user/root/.yarn/services/llap0/llap0.json 2021-06-07 06:08:40,798 ERROR utils.JsonSerDeser: Exception while parsing json : org.codehaus.jackson.map.JsonMappingException: Can not construct instance of org.apache.hadoop.yarn.service.api.records.PlacementScope from String value 'node': value not one of declared Enum instance names at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through reference chain: org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"]) "placement_policy" : { "constraints" : [ { "name" : null, "type" : "ANTI_AFFINITY", "scope" : "node", "target_tags" : [ "llap" ], "node_attributes" : { }, "node_partitions" : [ ], "min_cardinality" : null, "max_cardinality" : null } ] },org.codehaus.jackson.map.JsonMappingException: Can not construct instance of org.apache.hadoop.yarn.service.api.records.PlacementScope from String value 'node': value not one of declared Enum instance names at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through reference chain: org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"]) at org.codehaus.jackson.map.JsonMappingException.from(JsonMappingException.java:163) at org.codehaus.jackson.map.deser.StdDeserializationContext.weirdStringException(StdDeserializationContext.java:243) at org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:80) at org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:23) at org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299) at org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414) at org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697) at org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580) at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217) at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:194) at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:30) at org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299) at org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414) at org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697) at org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580) at org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299) at org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414) at org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697) at org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580) at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217) at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:194) at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:30) at
[jira] [Commented] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url
[ https://issues.apache.org/jira/browse/YARN-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358782#comment-17358782 ] Prabhu Joseph commented on YARN-10792: -- Thanks [~abhinaba.sarkar] for the patch. [^YARN-10792-003.patch] looks fine, +1. Failed test cases are not related to this patch. {{TestCapacitySchedulerSurgicalPreemption#testPriorityPreemptionWithNodeLabels}} and {{TestFSRMStateStore}} are running fine on local. Looks intermittent issue, will check if Jira exists to track the same. If not, will report the same. Will commit this Patch by tomorrow EOD, if no other comments. Thanks. > Set Completed AppAttempt LogsLink to Log Server Url > --- > > Key: YARN-10792 > URL: https://issues.apache.org/jira/browse/YARN-10792 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Abhinaba Sarkar >Priority: Major > Attachments: YARN-10792-001.patch, YARN-10792-002.patch, > YARN-10792-003.patch > > > Completed AppAttempts listed by YARN UI has logslink pointing to the > NodeManager containerlogs url. The completed container logs will be under > aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. > On frequent Scale Down, these NMs won't be available and so makes difficulty > to look for appattempt logs of completed apps from RM UI. Setting the > logslink for Completed AppAttempts to LogServer url will avoid this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url
[ https://issues.apache.org/jira/browse/YARN-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358782#comment-17358782 ] Prabhu Joseph edited comment on YARN-10792 at 6/7/21, 6:21 PM: --- Thanks [~abhinaba.sarkar] for the patch. [^YARN-10792-003.patch] looks good to me, +1. Failed test cases are not related to this patch. {{TestCapacitySchedulerSurgicalPreemption#testPriorityPreemptionWithNodeLabels}} and {{TestFSRMStateStore}} are running fine on local. Looks intermittent issue, will check if Jira exists to track the same. If not, will report the same. Will commit this Patch by tomorrow EOD, if no other comments. Thanks. was (Author: prabhu joseph): Thanks [~abhinaba.sarkar] for the patch. [^YARN-10792-003.patch] looks fine, +1. Failed test cases are not related to this patch. {{TestCapacitySchedulerSurgicalPreemption#testPriorityPreemptionWithNodeLabels}} and {{TestFSRMStateStore}} are running fine on local. Looks intermittent issue, will check if Jira exists to track the same. If not, will report the same. Will commit this Patch by tomorrow EOD, if no other comments. Thanks. > Set Completed AppAttempt LogsLink to Log Server Url > --- > > Key: YARN-10792 > URL: https://issues.apache.org/jira/browse/YARN-10792 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Abhinaba Sarkar >Priority: Major > Attachments: YARN-10792-001.patch, YARN-10792-002.patch, > YARN-10792-003.patch > > > Completed AppAttempts listed by YARN UI has logslink pointing to the > NodeManager containerlogs url. The completed container logs will be under > aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. > On frequent Scale Down, these NMs won't be available and so makes difficulty > to look for appattempt logs of completed apps from RM UI. Setting the > logslink for Completed AppAttempts to LogServer url will avoid this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url
Prabhu Joseph created YARN-10792: Summary: Set Completed AppAttempt LogsLink to Log Server Url Key: YARN-10792 URL: https://issues.apache.org/jira/browse/YARN-10792 Project: Hadoop YARN Issue Type: Improvement Components: webapp Affects Versions: 3.3.0 Reporter: Prabhu Joseph Assignee: Abhinaba Sarkar Completed AppAttempts listed by YARN UI has logslink pointing to the NodeManager containerlogs url. The completed container logs will be under aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. On frequent Scale Down, these NMs won't be available and so makes difficulty to look for appattempt logs of completed apps from RM UI. Setting the logslink for Completed AppAttempts to LogServer url will avoid this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10742) Discard old domain data from RollingLevelDBTimelineStore
[ https://issues.apache.org/jira/browse/YARN-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-10742. -- Resolution: Duplicate > Discard old domain data from RollingLevelDBTimelineStore > > > Key: YARN-10742 > URL: https://issues.apache.org/jira/browse/YARN-10742 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Discard old domain data from domaindb and ownerdb in > RollingLevelDBTimelineStore -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10741) Discard old domain data from RollingLevelDBTimelineStore
[ https://issues.apache.org/jira/browse/YARN-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-10741. -- Resolution: Duplicate > Discard old domain data from RollingLevelDBTimelineStore > > > Key: YARN-10741 > URL: https://issues.apache.org/jira/browse/YARN-10741 > Project: Hadoop YARN > Issue Type: Task > Components: timelineserver >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Discard old domain data from domaindb and ownerdb in > RollingLevelDBTimelineStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10742) Discard old domain data from RollingLevelDBTimelineStore
Prabhu Joseph created YARN-10742: Summary: Discard old domain data from RollingLevelDBTimelineStore Key: YARN-10742 URL: https://issues.apache.org/jira/browse/YARN-10742 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 3.3.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph Discard old domain data from domaindb and ownerdb in RollingLevelDBTimelineStore -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10741) Discard old domain data from RollingLevelDBTimelineStore
Prabhu Joseph created YARN-10741: Summary: Discard old domain data from RollingLevelDBTimelineStore Key: YARN-10741 URL: https://issues.apache.org/jira/browse/YARN-10741 Project: Hadoop YARN Issue Type: Task Components: timelineserver Affects Versions: 3.3.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph Discard old domain data from domaindb and ownerdb in RollingLevelDBTimelineStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org