[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-26 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737005#comment-17737005
 ] 

Prabhu Joseph edited comment on YARN-11501 at 6/26/23 6:19 AM:
---

>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.


was (Author: prabhu joseph):
>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.

 

 

 

 

 

 

 

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> 

[jira] [Commented] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-26 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737005#comment-17737005
 ] 

Prabhu Joseph commented on YARN-11501:
--

>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.

 

 

 

 

 

 

 

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>   at 
> 

[jira] [Resolved] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-26 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11501.
--
Resolution: Invalid

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>   at 
> 

[jira] [Commented] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-05-31 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727866#comment-17727866
 ] 

Prabhu Joseph commented on YARN-11501:
--

[~srinivasst]  Hope you are doing well. If you get some bandwidth, could you 
take a look into it and give some ideas on how to fix this?

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>   at 
> 

[jira] [Updated] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-05-31 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11501:
-
Priority: Critical  (was: Major)

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>   at 
> 

[jira] [Updated] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-05-31 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11501:
-
Description: 
We have seen a deadlock in ResourceManager due to 
StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock on 
RMNode and waiting to lock SchedulerNode whereas CapacityScheduler#removeNode 
taken lock on SchedulerNode and waiting to lock RMNode. 

cc *Vishal Vyas*

 
{code:java}
Found one Java-level deadlock:
=
"qtp1401737458-850":
  waiting for ownable synchronizer 0x000717e6ff60, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "RM Event dispatcher"
"RM Event dispatcher":
  waiting for ownable synchronizer 0x0007168a7a38, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
  which is held by "SchedulerEventDispatcher:Event Processor"
"SchedulerEventDispatcher:Event Processor":
  waiting for ownable synchronizer 0x000717e6ff60, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "RM Event dispatcher"

Java stack information for the threads listed above:
===
"qtp1401737458-850":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000717e6ff60> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
at 

[jira] [Updated] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-05-31 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11501:
-
Description: 
We have seen a deadlock in ResourceManager due to 
StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock on 
RMNode and waiting to lock SchedulerNode whereas CapacityScheduler#removeNode 
taken lock on SchedulerNode and waiting to lock RMNode. 

cc Vishal Vyas

 
{code:java}
Found one Java-level deadlock:
=
"qtp1401737458-850":
  waiting for ownable synchronizer 0x000717e6ff60, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "RM Event dispatcher"
"RM Event dispatcher":
  waiting for ownable synchronizer 0x0007168a7a38, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
  which is held by "SchedulerEventDispatcher:Event Processor"
"SchedulerEventDispatcher:Event Processor":
  waiting for ownable synchronizer 0x000717e6ff60, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "RM Event dispatcher"

Java stack information for the threads listed above:
===
"qtp1401737458-850":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000717e6ff60> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
at 

[jira] [Created] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-05-31 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11501:


 Summary: ResourceManager deadlock due to 
StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
 Key: YARN-11501
 URL: https://issues.apache.org/jira/browse/YARN-11501
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


We have seen a deadlock in ResourceManager due to 
StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock on 
RMNode and waiting to lock SchedulerNode whereas CapacityScheduler#removeNode 
taken lock on SchedulerNode and waiting to lock RMNode. 

 
{code:java}

Found one Java-level deadlock:
=
"qtp1401737458-850":
  waiting for ownable synchronizer 0x000717e6ff60, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "RM Event dispatcher"
"RM Event dispatcher":
  waiting for ownable synchronizer 0x0007168a7a38, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
  which is held by "SchedulerEventDispatcher:Event Processor"
"SchedulerEventDispatcher:Event Processor":
  waiting for ownable synchronizer 0x000717e6ff60, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "RM Event dispatcher"

Java stack information for the threads listed above:
===
"qtp1401737458-850":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000717e6ff60> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at 

[jira] [Updated] (YARN-11466) Graceful Decommission for Shuffle Services

2023-05-25 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11466:
-
Description: 
Currently, YARN Graceful Decommission waits for the completion of both running 
containers and the running applications 
(https://issues.apache.org/jira/browse/YARN-9608) of those containers launched 
on the node under decommission. This adds unnecessary huge cost to users on 
cloud deployments as most of the idle nodes are under decommission waiting for 
the running application to complete.

This feature aims to improve the Graceful Decommission logic by waiting for the 
actual shuffle data to be consumed by dependent tasks rather than the entire 
application. Below is the high-level design I have in mind.

Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) 
through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes 
shuffle data metrics (like shuffle data being present or not). NodeManager 
periodically collects the shuffle data metrics from the configured 
AuxiliaryShuffleServices and sends them along with the heartbeat to the 
ResourceManager. The graceful decommission logic runs inside ResourceManager 
waits until the shuffle data is consumed, with a maximum wait time up to the 
configured graceful decommission timeout.

  was:
Currently, YARN Graceful Decommission waits for the completion of both running 
containers and the running applications of those containers launched on the 
node under decommission. This adds unnecessary cost to users on cloud 
deployments. This feature aims to improve the Graceful Decommission logic by 
waiting for the actual shuffle data to be consumed by dependent tasks rather 
than the entire application.

Below is the high-level design I have in mind.

Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) 
through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes 
shuffle data metrics (like shuffle data being present or not). NodeManager 
periodically collects the shuffle data metrics from the configured 
AuxiliaryShuffleServices and sends them along with the heartbeat to the 
ResourceManager. The graceful decommission logic runs inside ResourceManager 
waits until the shuffle data is consumed, with a maximum wait time up to the 
configured graceful decommission timeout.




> Graceful Decommission for Shuffle Services
> --
>
> Key: YARN-11466
> URL: https://issues.apache.org/jira/browse/YARN-11466
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Currently, YARN Graceful Decommission waits for the completion of both 
> running containers and the running applications 
> (https://issues.apache.org/jira/browse/YARN-9608) of those containers 
> launched on the node under decommission. This adds unnecessary huge cost to 
> users on cloud deployments as most of the idle nodes are under decommission 
> waiting for the running application to complete.
> This feature aims to improve the Graceful Decommission logic by waiting for 
> the actual shuffle data to be consumed by dependent tasks rather than the 
> entire application. Below is the high-level design I have in mind.
> Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) 
> through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes 
> shuffle data metrics (like shuffle data being present or not). NodeManager 
> periodically collects the shuffle data metrics from the configured 
> AuxiliaryShuffleServices and sends them along with the heartbeat to the 
> ResourceManager. The graceful decommission logic runs inside ResourceManager 
> waits until the shuffle data is consumed, with a maximum wait time up to the 
> configured graceful decommission timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11494) Acquired Containers are killed when the node is reconnected

2023-05-12 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11494:


 Summary: Acquired Containers are killed when the node is 
reconnected
 Key: YARN-11494
 URL: https://issues.apache.org/jira/browse/YARN-11494
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.3.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


When a nodemanager is reconnected, resourcemanager marks the acquired 
containers on that node as LOST and which leads to job failure.

{code}
2023-04-10 02:57:16,412 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC 
Server handler 41 on 8025): Reconnect from the node at: node1
2023-04-10 02:57:16,412 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC 
Server handler 41 on 8025): NodeManager from node node1(cmPort: 8041 httpPort: 
8042) registered with capability: , assigned nodeId 
node1:8041, node labels { CORE } 
2023-04-10 02:57:16,413 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_e15_1677844874019_238016_01_02 
Container Transitioned from ACQUIRED to KILLED
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11466) Graceful Decommission for Shuffle Services

2023-04-16 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11466:


 Summary: Graceful Decommission for Shuffle Services
 Key: YARN-11466
 URL: https://issues.apache.org/jira/browse/YARN-11466
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Currently, YARN Graceful Decommission waits for the completion of both running 
containers and the running applications of those containers launched on the 
node under decommission. This adds unnecessary cost to users on cloud 
deployments. This feature aims to improve the Graceful Decommission logic by 
waiting for the actual shuffle data to be consumed by dependent tasks rather 
than the entire application.

Below is the high-level design I have in mind.

Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) 
through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes 
shuffle data metrics (like shuffle data being present or not). NodeManager 
periodically collects the shuffle data metrics from the configured 
AuxiliaryShuffleServices and sends them along with the heartbeat to the 
ResourceManager. The graceful decommission logic runs inside ResourceManager 
waits until the shuffle data is consumed, with a maximum wait time up to the 
configured graceful decommission timeout.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11457) NodeManager Resource Leak when handling a container log with colon

2023-03-16 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11457:


 Summary: NodeManager Resource Leak when handling a container log 
with colon 
 Key: YARN-11457
 URL: https://issues.apache.org/jira/browse/YARN-11457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.3.3
Reporter: Prabhu Joseph
Assignee: Vineeth Naroju
 Attachments: Screenshot 2023-03-16 at 1.02.22 PM.png, Screenshot 
2023-03-16 at 1.02.45 PM.png, Screenshot 2023-03-16 at 1.02.57 PM.png

NodeManager Leaks the resources when handling a container log with colon. The 
Illegal file name is not handled and leads to resource leak at NodeManager side.

 
{code:java}
2023-03-14 11:03:53,390 WARN org.apache.hadoop.util.concurrent.ExecutorHelper 
(ContainersLauncher #2683): Caught exception in thread ContainersLauncher 
#2683: 
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: taskmanager.log.2023-03-14 09:44-1
at org.apache.hadoop.fs.Path.initialize(Path.java:263)
at org.apache.hadoop.fs.Path.(Path.java:221)
at org.apache.hadoop.fs.Path.(Path.java:129)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:270)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2096)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2078)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.handleContainerExitWithFailure(ContainerLaunch.java:653)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.handleContainerExitCode(ContainerLaunch.java:593)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:337)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:101)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
taskmanager.log.2023-03-14 09:44-1
at java.net.URI.checkPath(URI.java:1823)
at java.net.URI.(URI.java:745)
at org.apache.hadoop.fs.Path.initialize(Path.java:260)
... 14 more 
{code}

NodeManager status details shows Application stuck in FINISHING_CONTAINER_WAIT, 
Containers stuck in KILLING state.

 !Screenshot 2023-03-16 at 1.02.57 PM.png|height=100,width=250!

 !Screenshot 2023-03-16 at 1.02.45 PM.png|height=100,width=250!

 !Screenshot 2023-03-16 at 1.02.22 PM.png|height=250,width=250!




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11455) All RMs in HA are stuck in standby when the ZK connection is disconnected

2023-03-07 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11455:


 Summary: All RMs in HA are stuck in standby when the ZK connection 
is disconnected
 Key: YARN-11455
 URL: https://issues.apache.org/jira/browse/YARN-11455
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.3.3, 2.10.1
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


All RMs in HA are stuck in standby when the ZK connection held by the active RM 
is disconnected.
{code:java}
2023-02-22 13:08:19,832 INFO org.apache.hadoop.ha.ActiveStandbyElector 
(main-EventThread): Session disconnected. Entering neutral mode...
2023-02-22 13:08:19,832 WARN 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService
 (main-EventThread): Lost contact with Zookeeper. Transitioning to standby in 
1 ms if connection is not reestablished.{code}
 

*Repro:*

Send a Disconnected Event to the Active RM using below code.
{code:java}
zkConnectionState = ConnectionState.DISCONNECTED;
enterNeutralMode();
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration

2023-01-25 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680718#comment-17680718
 ] 

Prabhu Joseph edited comment on YARN-11417 at 1/25/23 5:36 PM:
---

When the NodeManager node label is changed to a new label and restarted, it 
resyncs to the ResourceManager with the new label. CapacityScheduler receives 
the NODE_LABELS_UPDATE event, which removes the node from the nodesList of the 
old node partition in the {{nodesPerLabel}} map <{{partition}}, {{nodesList}}> 
part of {{ClusterNodeTracker}}#{{updateNodesPerPartition}}. Then 
{{CapacityScheduler}} receives NODE_REMOVED which removes the node from the 
{{ClusterNodeTracker}} and also removes the node from the nodesList of the new 
partition in {{nodesPerLabel}}, which will fail with NPE as the new partition 
is not yet present in the {{nodesPerLabel}} map and will be added only after 
the NODE_ADDED event. 

In the absence of a new partition, {{ClusterNodeTracker}}#{{removeNode}} can 
skip removing the node from the {{nodesPerLabel}} as anyway that is already 
removed during NODE_LABELS_UPDATE.



was (Author: prabhu joseph):
When the NodeManager node label is changed to a new label and restarted, it 
resyncs to the ResourceManager with the new label. CapacityScheduler receives 
the NODE_LABELS_UPDATE event, which removes the node from the nodesList of the 
old node partition in the nodesPerLabel map  part of 
ClusterNodeTracker#updateNodesPerPartition. Then CapacityScheduler receives 
NODE_REMOVED which removes the node from the ClusterNodeTracker and also 
removes the node from the nodesList of the new partition in nodesPerLabel, 
which will fail with NPE as the new partition is not yet present in the 
nodesPerLabel map and will be added only after the NODE_ADDED event. 

In the absence of a new partition, ClusterNodeTracker#removeNode can skip 
removing the node from the nodesPerLabel as anyway that is already removed 
during NODE_LABELS_UPDATE.


> RM Crashes when changing Node Label of a Node in Distributed Configuration
> --
>
> Key: YARN-11417
> URL: https://issues.apache.org/jira/browse/YARN-11417
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
>
> RM Crashes when changing Node Label of a Node in Distributed Configuration.
> {code:java}
> 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> NODE_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> *Repro*
> 1. Two NodeManagers with CORE Node Label
> {code:java}
> yarn.nodemanager.node-labels.provider.configured-node-partition=CORE
> yarn.node-labels.enabled = true
> yarn.node-labels.configuration-type = distributed
> yarn.nodemanager.node-labels.provider = config
> {code}
> 2. Remove the Node Label from one of the node to make it Default Partition 
> and restart nodemanager.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration

2023-01-25 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680718#comment-17680718
 ] 

Prabhu Joseph commented on YARN-11417:
--

When the NodeManager node label is changed to a new label and restarted, it 
resyncs to the ResourceManager with the new label. CapacityScheduler receives 
the NODE_LABELS_UPDATE event, which removes the node from the nodesList of the 
old node partition in the nodesPerLabel map  part of 
ClusterNodeTracker#updateNodesPerPartition. Then CapacityScheduler receives 
NODE_REMOVED which removes the node from the ClusterNodeTracker and also 
removes the node from the nodesList of the new partition in nodesPerLabel, 
which will fail with NPE as the new partition is not yet present in the 
nodesPerLabel map and will be added only after the NODE_ADDED event. 

In the absence of a new partition, ClusterNodeTracker#removeNode can skip 
removing the node from the nodesPerLabel as anyway that is already removed 
during NODE_LABELS_UPDATE.


> RM Crashes when changing Node Label of a Node in Distributed Configuration
> --
>
> Key: YARN-11417
> URL: https://issues.apache.org/jira/browse/YARN-11417
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
>
> RM Crashes when changing Node Label of a Node in Distributed Configuration.
> {code:java}
> 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> NODE_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> *Repro*
> 1. Two NodeManagers with CORE Node Label
> {code:java}
> yarn.nodemanager.node-labels.provider.configured-node-partition=CORE
> yarn.node-labels.enabled = true
> yarn.node-labels.configuration-type = distributed
> yarn.nodemanager.node-labels.provider = config
> {code}
> 2. Remove the Node Label from one of the node to make it Default Partition 
> and restart nodemanager.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11421) Graceful Decommission ignores launched containers and gets deactivated before timeout

2023-01-19 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678667#comment-17678667
 ] 

Prabhu Joseph commented on YARN-11421:
--

[~abhishekd0907] Looks same as 
[YARN-10873|https://issues.apache.org/jira/browse/YARN-10873]

> Graceful Decommission ignores launched containers and gets deactivated before 
> timeout
> -
>
> Key: YARN-11421
> URL: https://issues.apache.org/jira/browse/YARN-11421
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1, 3.3.1, 3.3.4
>Reporter: Abhishek Dixit
>Priority: Major
>
> During Graceful Decommission, a Node gets deactivated before timeout even 
> though there are launched containers on that node.
> We have observed cases when graceful decommission signal is sent to node and 
> Containers are launched at NodeManager and at the same time,  in such cases 
> ResourceManager moves the node from Decommissioning to Decommissioned state 
> because launced containers are not checked in DeactivateNodeTransition.
> We will suggest using a MultiArc transition instead of 
> DeactivateNodeTransition which checks for AM containers from the scheduler 
> and then decides whether to keep the node in Decommissioning state or move it 
> to Decommissioned State.
>  
> {code:java}
> .addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONED, 
> RMNodeEventType.DECOMMISSION,  new 
> DeactivateNodeTransition(NodeState.DECOMMISSIONED)){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled

2023-01-12 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676495#comment-17676495
 ] 

Prabhu Joseph commented on YARN-11414:
--

[~maniraj...@gmail.com]  We see ClusterMetricsInfo shows available and 
allocated only for Default Partition. We have QueueMetrics which already shows 
for every partition level. This Jira intends to change the ClusterMetrics to 
show Cluster Wide which will help all the Schedulers.
Do you have any comments. 

> ClusterMetricsInfo shows wrong availableMB when node labels enabled 
> 
>
> Key: YARN-11414
> URL: https://issues.apache.org/jira/browse/YARN-11414
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
>
> ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows 
> availableMB of Default Partition alone. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration

2023-01-11 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11417:
-
Description: 
RM Crashes when changing Node Label of a Node in Distributed Configuration.
{code:java}
2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher 
(SchedulerEventDispatcher:Event Processor): Error in handling event type 
NODE_REMOVED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
at java.lang.Thread.run(Thread.java:750)

{code}
*Repro*

1. Two NodeManagers with CORE Node Label
{code:java}
yarn.nodemanager.node-labels.provider.configured-node-partition=CORE
yarn.node-labels.enabled = true
yarn.node-labels.configuration-type = distributed
yarn.nodemanager.node-labels.provider = config
{code}
2. Remove the Node Label from one of the node to make it Default Partition and 
restart nodemanager.

 

  was:
RM Crashes when changing Node Label of a Node in Distributed Configuration.

{code}
2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher 
(SchedulerEventDispatcher:Event Processor): Error in handling event type 
NODE_REMOVED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
at java.lang.Thread.run(Thread.java:750)

{code}


*Repro*

1. Two NodeManagers with CORE Node Label

{code}
yarn.nodemanager.node-labels.provider.configured-node-partition=CORE
yarn.node-labels.enabled = true
yarn.node-labels.configuration-type = distributed
yarn.nodemanager.node-labels.provider = config
{code}

2. Change the Node Label of one of the node into TASK and restart nodemanager.

 


> RM Crashes when changing Node Label of a Node in Distributed Configuration
> --
>
> Key: YARN-11417
> URL: https://issues.apache.org/jira/browse/YARN-11417
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
>
> RM Crashes when changing Node Label of a Node in Distributed Configuration.
> {code:java}
> 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> NODE_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> *Repro*
> 1. Two NodeManagers with CORE Node Label
> {code:java}
> yarn.nodemanager.node-labels.provider.configured-node-partition=CORE
> yarn.node-labels.enabled = true
> yarn.node-labels.configuration-type = distributed
> yarn.nodemanager.node-labels.provider = config
> {code}
> 2. Remove the Node Label from one of the node to make it Default Partition 
> and restart nodemanager.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Created] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration

2023-01-11 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11417:


 Summary: RM Crashes when changing Node Label of a Node in 
Distributed Configuration
 Key: YARN-11417
 URL: https://issues.apache.org/jira/browse/YARN-11417
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.3.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


RM Crashes when changing Node Label of a Node in Distributed Configuration.

{code}
2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher 
(SchedulerEventDispatcher:Event Processor): Error in handling event type 
NODE_REMOVED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
at java.lang.Thread.run(Thread.java:750)

{code}


*Repro*

1. Two NodeManagers with CORE Node Label

{code}
yarn.nodemanager.node-labels.provider.configured-node-partition=CORE
yarn.node-labels.enabled = true
yarn.node-labels.configuration-type = distributed
yarn.nodemanager.node-labels.provider = config
{code}

2. Change the Node Label of one of the node into TASK and restart nodemanager.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled

2023-01-11 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11414:
-
Description: ClusterMetricsInfo shows wrong availableMB when node labels 
enabled. It shows availableMB of Default Partition alone.   (was: 
ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows 
availableMB of Default Partition alone. This is a regression from Hadoop-3.2.1 
where it has shown cluster wide availableMB.)

> ClusterMetricsInfo shows wrong availableMB when node labels enabled 
> 
>
> Key: YARN-11414
> URL: https://issues.apache.org/jira/browse/YARN-11414
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Ashutosh Gupta
>Priority: Major
>
> ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows 
> availableMB of Default Partition alone. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled

2023-01-11 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-11414:


Assignee: Ashutosh Gupta  (was: Prabhu Joseph)

> ClusterMetricsInfo shows wrong availableMB when node labels enabled 
> 
>
> Key: YARN-11414
> URL: https://issues.apache.org/jira/browse/YARN-11414
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Ashutosh Gupta
>Priority: Major
>
> ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows 
> availableMB of Default Partition alone. This is a regression from 
> Hadoop-3.2.1 where it has shown cluster wide availableMB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11414) ClusterMetricsInfo shows wrong availableMB when node labels enabled

2023-01-10 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11414:


 Summary: ClusterMetricsInfo shows wrong availableMB when node 
labels enabled 
 Key: YARN-11414
 URL: https://issues.apache.org/jira/browse/YARN-11414
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


ClusterMetricsInfo shows wrong availableMB when node labels enabled. It shows 
availableMB of Default Partition alone. This is a regression from Hadoop-3.2.1 
where it has shown cluster wide availableMB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

2023-01-02 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653796#comment-17653796
 ] 

Prabhu Joseph commented on YARN-11403:
--

[~bteke] Currently, the Maximum Allocation value is the maximum of the Healthy 
NodeManager Capabilities ({{yarn.nodemanager.resource.memory-mb}}). If there is 
no healthy node manager running, it fallbacks to the configured maximum 
allocation ({{yarn.scheduler.maximum-allocation-mb}}). This part is correct and 
not going to be changed.

When a node is under decommission, the capability of that node is updated 
dynamically to the amount of resource in use. This updated value is also 
considered for the maximum allocation calculation, which leads to inconsistent 
maximum allocation values and causes job failure.

For example, consider a cluster with two worker nodes, node1 (100 GB) and node2 
(100 GB) and configured maxAllocation is 20 GB.

If both nodes become UNHEALTHY for any reason, MaximumAllocation reverts to the 
configured value of 20 GB.This part is correct.

However, suppose one node is UNHEALTHY and another is under decommission with a 
usage of 1 GB; the maximum allocation is now 1 GB. This is wrong, and this 
leads to job failures. The expected value is 20 GB in this scenario.

The fix planned in this Jira is to exclude the capability of the node put into 
decommission during the maximum allocation calculation. 

> Decommission Node reduces the maximumAllocation and leads to Job Failure
> 
>
> Key: YARN-11403
> URL: https://issues.apache.org/jira/browse/YARN-11403
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Prabhu Joseph
>Assignee: Vinay Devadiga
>Priority: Major
>
> When a node is put into Decommission, ClusterNodeTracker updates the 
> maximumAllocation to the totalResources in use from that node. This could 
> lead to Job Failure (with below error message) when the Job requests for a 
> container of size greater than the new maximumAllocation.
> {code:java}
> 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in 
> a row.
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[vcores], Requested 
> resource= vCores:2147483647>, maximum allowed allocation=, please 
> note that maximum allowed allocation is calculated by scheduler based on 
> maximum resource of registered NodeManagers, which might be less than 
> configured maximum allocation=
> {code}
> *Repro:*
> 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
> Resource Memory 10GB and configured maxAllocation is 10GB.
> 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
> ApplicationMaster (2GB) is launched on node1. 
> 3. Put both nodes into Decommission. This makes maxAllocation to come down to 
> 2GB.
> 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas 
> maxAllocation is only 2GB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

2022-12-28 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-11403:


Assignee: Vinay Devadiga  (was: Prabhu Joseph)

> Decommission Node reduces the maximumAllocation and leads to Job Failure
> 
>
> Key: YARN-11403
> URL: https://issues.apache.org/jira/browse/YARN-11403
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Prabhu Joseph
>Assignee: Vinay Devadiga
>Priority: Major
>
> When a node is put into Decommission, ClusterNodeTracker updates the 
> maximumAllocation to the totalResources in use from that node. This could 
> lead to Job Failure (with below error message) when the Job requests for a 
> container of size greater than the new maximumAllocation.
> {code:java}
> 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in 
> a row.
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[vcores], Requested 
> resource= vCores:2147483647>, maximum allowed allocation=, please 
> note that maximum allowed allocation is calculated by scheduler based on 
> maximum resource of registered NodeManagers, which might be less than 
> configured maximum allocation=
> {code}
> *Repro:*
> 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
> Resource Memory 10GB and configured maxAllocation is 10GB.
> 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
> ApplicationMaster (2GB) is launched on node1. 
> 3. Put both nodes into Decommission. This makes maxAllocation to come down to 
> 2GB.
> 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas 
> maxAllocation is only 2GB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

2022-12-27 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11403:
-
Description: 
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.
{code:java}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}
*Repro:*

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. 
3. Put both nodes into Decommission. This makes maxAllocation to come down to 
2GB.
4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas 
maxAllocation is only 2GB.

  was:
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}

*Repro:*

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.







> Decommission Node reduces the maximumAllocation and leads to Job Failure
> 
>
> Key: YARN-11403
> URL: https://issues.apache.org/jira/browse/YARN-11403
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When a node is put into Decommission, ClusterNodeTracker updates the 
> maximumAllocation to the totalResources in use from that node. This could 
> lead to Job Failure (with below error message) when the Job requests for a 
> container of size greater than the new maximumAllocation.
> {code:java}
> 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in 
> a row.
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[vcores], Requested 
> resource= vCores:2147483647>, maximum allowed allocation=, please 
> note that maximum allowed allocation is calculated by scheduler based on 
> maximum resource of registered NodeManagers, which might be less than 
> configured maximum allocation=
> {code}
> *Repro:*
> 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
> Resource Memory 10GB and configured maxAllocation is 10GB.
> 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
> ApplicationMaster (2GB) is launched on node1. 
> 3. Put both nodes into Decommission. This makes maxAllocation to come down to 
> 2GB.
> 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas 
> maxAllocation is only 2GB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

2022-12-27 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11403:
-
Description: 
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}

*Repro:*

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.






  was:
When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}

**Repro:**

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.







> Decommission Node reduces the maximumAllocation and leads to Job Failure
> 
>
> Key: YARN-11403
> URL: https://issues.apache.org/jira/browse/YARN-11403
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When a node is put into Decommission, ClusterNodeTracker updates the 
> maximumAllocation to the totalResources in use from that node. This could 
> lead to Job Failure (with below error message) when the Job requests for a 
> container of size greater than the new maximumAllocation.
> {code}
> 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in 
> a row.
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[vcores], Requested 
> resource= vCores:2147483647>, maximum allowed allocation=, please 
> note that maximum allowed allocation is calculated by scheduler based on 
> maximum resource of registered NodeManagers, which might be less than 
> configured maximum allocation=
> {code}
> *Repro:*
> 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
> Resource Memory 10GB and configured maxAllocation is 10GB.
> 2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
> ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
> before it requests for Executors)
> 3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
> maxAllocation to come down to 2GB.
> 4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
> maxAllocation is 2GB and so will fail.



--
This message was sent by Atlassian Jira

[jira] [Created] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

2022-12-27 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11403:


 Summary: Decommission Node reduces the maximumAllocation and leads 
to Job Failure
 Key: YARN-11403
 URL: https://issues.apache.org/jira/browse/YARN-11403
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.4
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


When a node is put into Decommission, ClusterNodeTracker updates the 
maximumAllocation to the totalResources in use from that node. This could lead 
to Job Failure (with below error message) when the Job requests for a container 
of size greater than the new maximumAllocation.

{code}
22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a 
row.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[vcores], Requested 
resource=, 
maximum allowed allocation=, please note that maximum 
allowed allocation is calculated by scheduler based on maximum resource of 
registered NodeManagers, which might be less than configured maximum 
allocation=
{code}

**Repro:**

1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager 
Resource Memory 10GB and configured maxAllocation is 10GB.
2. Submit Spark Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say 
ApplicationMaster (2GB) is launched on node1. (Add a wait condition in Spark 
before it requests for Executors)
3. Put node1 into Decommission and make node2 into UNHEALTHY. This makes 
maxAllocation to come down to 2GB.
4. Now notify the Spark Job. It requests for 4GB executor Size but the new 
maxAllocation is 2GB and so will fail.








--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11401) Separate AppMaster cleanup events and launcher event into different resource pools

2022-12-19 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649306#comment-17649306
 ] 

Prabhu Joseph commented on YARN-11401:
--

[~Daniel Ma] This is a duplicate of 
[YARN-11251|https://issues.apache.org/jira/browse/YARN-11251]. 

> Separate AppMaster cleanup events and launcher event into different resource 
> pools
> --
>
> Key: YARN-11401
> URL: https://issues.apache.org/jira/browse/YARN-11401
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Ma
>Priority: Major
>  Labels: pull-request-available
>
> Currently, there is only one thread pool to handle AM launch and cleanup 
> event by ResourceManager, 
> In some cases, too many cleanup event will lead to AM launch stuck for a long 
> time.
> So in this patch, We divide the shared thread pool into two separated ones to 
> handle different event in case that cleanup event flood in blocking launcher 
> events being timely handled and vise versa.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared

2022-10-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11285.
--
Resolution: Duplicate

> LocalizedResources are leaked and its LocalPath are not cleared
> ---
>
> Key: YARN-11285
> URL: https://issues.apache.org/jira/browse/YARN-11285
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
> Directories.  
> Each container has separate LocalizedResource object and separate local path 
> like below.
> {code}
>/mnt/yarn/usercache/hive/filecache/6/2552419:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552420:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552421:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552422:
>total 28456
> {code}
> NM logs will be filled with below
> {code}
> 2022-08-07 09:00:00,275 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource
>  (IPC Server handler 4 on 8040): Resource 
> hdfs://hdfscluster/user/svc_di_data_eng/.hiveJars/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar(->/mnt/yarn/usercache/data_eng_user/filecache/2498262/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar)
>  transitioned from LOCALIZED to null
> 2022-08-07 09:00:00,340 INFO 
> org.apache.hadoop.yarn.util.ProcfsBasedProcessTree (Container Monitor): 
> SmapBasedCumulativeRssmem (bytes) : 0
> 2022-08-07 09:00:00,386 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource
>  (IPC Server handler 9 on 8040): Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> LOCALIZED at LOCALIZED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:198)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:186)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:58)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.processHeartbeat(ResourceLocalizationService.java:1048)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:722)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:356)
> at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:48)
> at 
> org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:63)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To 

[jira] [Assigned] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3

2022-10-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-11355:


Assignee: Vineeth Naroju  (was: Prabhu Joseph)

> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3
> --
>
> Key: YARN-11355
> URL: https://issues.apache.org/jira/browse/YARN-11355
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Vineeth Naroju
>Priority: Major
>
> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during 
> initial retry.
> *Repro:*
> {code:java}
> 1. YARN Cluster with three master nodes rm1,rm2 and rm3
> 2. rm3 is active
> 3. yarn node -list or any other yarn client calls takes more than 30 seconds.
>  {code}
> The initial failover to rm2 is immediate but then the failover to rm3 is 
> after ~3 ms. Current RetryPolicy does not honor the number of master 
> nodes. It has to perform atleast one immediate failover to every rm.
> {code:java}
> 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: 
> Failing over to rm2
> 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Call From local to remote:8032 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 
> failover attempts. Trying to failover after sleeping for 21139ms.
> {code}
>  
> *Workaround:*
> Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. 
> This will do immediate failover to rm3 but there will be too many retries 
> when there is no active resourcemanager.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3

2022-10-20 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620888#comment-17620888
 ] 

Prabhu Joseph commented on YARN-11355:
--

Yes Done.

> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3
> --
>
> Key: YARN-11355
> URL: https://issues.apache.org/jira/browse/YARN-11355
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Vineeth Naroju
>Priority: Major
>
> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during 
> initial retry.
> *Repro:*
> {code:java}
> 1. YARN Cluster with three master nodes rm1,rm2 and rm3
> 2. rm3 is active
> 3. yarn node -list or any other yarn client calls takes more than 30 seconds.
>  {code}
> The initial failover to rm2 is immediate but then the failover to rm3 is 
> after ~3 ms. Current RetryPolicy does not honor the number of master 
> nodes. It has to perform atleast one immediate failover to every rm.
> {code:java}
> 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: 
> Failing over to rm2
> 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Call From local to remote:8032 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 
> failover attempts. Trying to failover after sleeping for 21139ms.
> {code}
>  
> *Workaround:*
> Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. 
> This will do immediate failover to rm3 but there will be too many retries 
> when there is no active resourcemanager.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3

2022-10-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11355:
-
Description: 
YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during 
initial retry.

*Repro:*
{code:java}
1. YARN Cluster with three master nodes rm1,rm2 and rm3
2. rm3 is active
3. yarn node -list or any other yarn client calls takes more than 30 seconds.
 {code}
The initial failover to rm2 is immediate but then the failover to rm3 is after 
~3 ms. Current RetryPolicy does not honor the number of master nodes. It 
has to perform atleast one immediate failover to every rm.
{code:java}
2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
java.net.ConnectException: Call From local to remote:8032 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover 
attempts. Trying to failover after sleeping for 21139ms.
{code}
 

*Workaround:*

Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. 
This will do immediate failover to rm3 but there will be too many retries when 
there is no active resourcemanager.
 

 

  was:
YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during 
initial retry.

*Repro:*
{code:java}
1. YARN Cluster with three master nodes rm1,rm2 and rm3
2. rm3 is active
3. yarn node -list or any other yarn client calls takes more than 30 seconds.
 {code}
The initial failover to rm2 is immediate but then the failover to rm3 is after 
~3 ms. Current RetryPolicy does not honor the number of master nodes. It 
has to perform atleast one immediate failover to every rm.
{code:java}
2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
java.net.ConnectException: Call From local to remote:8032 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover 
attempts. Trying to failover after sleeping for 21139ms.
{code}
 

*{*}Workaround:{*}*

Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. 
This will do immediate failover to rm3 but there will be too many retries when 
there is no active resourcemanager.
 

 


> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3
> --
>
> Key: YARN-11355
> URL: https://issues.apache.org/jira/browse/YARN-11355
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during 
> initial retry.
> *Repro:*
> {code:java}
> 1. YARN Cluster with three master nodes rm1,rm2 and rm3
> 2. rm3 is active
> 3. yarn node -list or any other yarn client calls takes more than 30 seconds.
>  {code}
> The initial failover to rm2 is immediate but then the failover to rm3 is 
> after ~3 ms. Current RetryPolicy does not honor the number of master 
> nodes. It has to perform atleast one immediate failover to every rm.
> {code:java}
> 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: 
> Failing over to rm2
> 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Call From local to remote:8032 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 
> failover attempts. Trying to failover after sleeping for 21139ms.
> {code}
>  
> *Workaround:*
> Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. 
> This will do immediate failover to rm3 but there will be too many retries 
> when there is no active resourcemanager.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3

2022-10-20 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11355:


 Summary: YARN Client Failovers immediately to rm2 but takes 
~3ms to rm3
 Key: YARN-11355
 URL: https://issues.apache.org/jira/browse/YARN-11355
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during 
initial retry.

*Repro:*
{code:java}
1. YARN Cluster with three master nodes rm1,rm2 and rm3
2. rm3 is active
3. yarn node -list or any other yarn client calls takes more than 30 seconds.
 {code}
The initial failover to rm2 is immediate but then the failover to rm3 is after 
~3 ms. Current RetryPolicy does not honor the number of master nodes. It 
has to perform atleast one immediate failover to every rm.
{code:java}
2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
java.net.ConnectException: Call From local to remote:8032 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover 
attempts. Trying to failover after sleeping for 21139ms.
{code}
 

*{*}Workaround:{*}*

Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. 
This will do immediate failover to rm3 but there will be too many retries when 
there is no active resourcemanager.
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11352) Support new API to get the total resource available in Yarn

2022-10-20 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620841#comment-17620841
 ] 

Prabhu Joseph commented on YARN-11352:
--

Thanks [~SanjayKumarSahu] for reporting the issue. Currently Tez Splits is 
based on AMRMClient#getAvailableResources which is the HeadRoom based on How 
Much the Queue/Job User/Partition Limit is set. Changing Split calculation 
based on Total YARN Cluster Resource will lead to high Task Parallelism and Tez 
Job waiting for other queue/user/Partition resources which it won't get.

 
{code:java}
  /**
   * Get the currently available resources in the cluster.
   * A valid value is available after a call to allocate has been made
   * @return Currently available resources
   */
  public abstract Resource getAvailableResources();
{code}

> Support new API to get the total resource available in Yarn
> ---
>
> Key: YARN-11352
> URL: https://issues.apache.org/jira/browse/YARN-11352
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacity scheduler, resourcemanager, yarn
>Affects Versions: 3.4.0
>Reporter: Sanjay Kumar Sahu
>Priority: Major
>
> Hive needs total resource available in yarn by AMRMClient interface. This 
> help hive to decide the split count (Fix the split calculation logic for Hive 
> on Tez/LLAP in  clusters).
>  
> The improvement is identified as a problem in split calculation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11255) Support loading alternative docker client config from system environment

2022-09-21 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11255:
-
Labels:   (was: pull-request-available)

> Support loading alternative docker client config from system environment
> 
>
> Key: YARN-11255
> URL: https://issues.apache.org/jira/browse/YARN-11255
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
> Fix For: 3.4.0
>
>
> When using YARN docker support, although the hadoop shell supported 
> {code:java}
> -docker_client_config{code}
>  to pass the client config file that contains security token to generate the 
> docker config for each job as a temporary file.
> For other applications that submit jobs to YARN, e.g. Spark, which loads the 
> docker setting via system environment e.g. 
> {code:java}
> spark.executorEnv.* {code}
> will not be able to add those authorization token because this system 
> environment isn't considered in YARN.
> Add genetic solution to handle these kind of cases without making changes in 
> spark code or others
> Eg
> When using remote container registry, the 
> {{YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG}} must reference the config.json
> file containing the credentials used to authenticate.
> {code:java}
> DOCKER_IMAGE_NAME=hadoop-docker 
> DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json
> spark-submit --master yarn \
> --deploy-mode cluster \
> --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
> --conf 
> spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
> --conf 
> spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG
>  \
> --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
> --conf 
> spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME
>  \
> --conf 
> spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG
>  \
> sparkR.R{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11255) Support loading alternative docker client config from system environment

2022-09-21 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11255.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> Support loading alternative docker client config from system environment
> 
>
> Key: YARN-11255
> URL: https://issues.apache.org/jira/browse/YARN-11255
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When using YARN docker support, although the hadoop shell supported 
> {code:java}
> -docker_client_config{code}
>  to pass the client config file that contains security token to generate the 
> docker config for each job as a temporary file.
> For other applications that submit jobs to YARN, e.g. Spark, which loads the 
> docker setting via system environment e.g. 
> {code:java}
> spark.executorEnv.* {code}
> will not be able to add those authorization token because this system 
> environment isn't considered in YARN.
> Add genetic solution to handle these kind of cases without making changes in 
> spark code or others
> Eg
> When using remote container registry, the 
> {{YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG}} must reference the config.json
> file containing the credentials used to authenticate.
> {code:java}
> DOCKER_IMAGE_NAME=hadoop-docker 
> DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json
> spark-submit --master yarn \
> --deploy-mode cluster \
> --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
> --conf 
> spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
> --conf 
> spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG
>  \
> --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
> --conf 
> spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME
>  \
> --conf 
> spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG
>  \
> sparkR.R{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11299) NMWebService endpoint to expose tracked LocalizedResources and the references

2022-09-07 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11299:


 Summary: NMWebService endpoint to expose tracked 
LocalizedResources and the references
 Key: YARN-11299
 URL: https://issues.apache.org/jira/browse/YARN-11299
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.3.0
Reporter: Prabhu Joseph
Assignee: Samrat Deb


NMWebService endpoint to expose the tracked LocalizedResources and its 
references. This will be useful for monitoring and debugging purpose.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared

2022-08-31 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11285:
-
Description: 
LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
Directories.  

Each container has separate LocalizedResource object and separate local path 
like below.
{code}
   /mnt/yarn/usercache/hive/filecache/6/2552419:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552420:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552421:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552422:
   total 28456
{code}


NM logs will be filled with below

{code}
2022-08-07 09:00:00,275 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource
 (IPC Server handler 4 on 8040): Resource 
hdfs://hdfscluster/user/svc_di_data_eng/.hiveJars/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar(->/mnt/yarn/usercache/data_eng_user/filecache/2498262/hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar)
 transitioned from LOCALIZED to null
2022-08-07 09:00:00,340 INFO org.apache.hadoop.yarn.util.ProcfsBasedProcessTree 
(Container Monitor): SmapBasedCumulativeRssmem (bytes) : 0
2022-08-07 09:00:00,386 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource
 (IPC Server handler 9 on 8040): Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LOCALIZED at LOCALIZED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:198)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:186)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:58)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.processHeartbeat(ResourceLocalizationService.java:1048)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:356)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:48)
at 
org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:63)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
{code}


  was:
LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
Directories.  

Each container has separate LocalizedResource object and separate local path 
like below.
{code}
   /mnt/yarn/usercache/hive/filecache/6/2552419:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552420:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552421:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 

[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared

2022-08-30 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11285:
-
Description: 
LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
Directories.  

Each container has separate LocalizedResource object and separate local path 
like below.
{code}
   /mnt/yarn/usercache/hive/filecache/6/2552419:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552420:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552421:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552422:
   total 28456
{code}



  was:
LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
Directories. When multiple containers are initialized at same time, 
LocalResourcesTrackerImpl REQUEST handler could create and handle multiple 
LocalizedResource object for the same input path due to race condition in 
[below 
code|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L149]

{code}
case REQUEST:
LocalResourceRequest req = event.getLocalResourceRequest();
LocalizedResource rsrc = localrsrc.get(req);
 
  if (null == rsrc) {
rsrc = new LocalizedResource(req, dispatcher);
localrsrc.put(req, rsrc);
  }
 rsrc.handle(event);
{code}


Each container will have separate LocalizedResource object and separate local 
path like below.
{code}
   /mnt/yarn/usercache/hive/filecache/6/2552419:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552420:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552421:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552422:
   total 28456
{code}




> LocalizedResources are leaked and its LocalPath are not cleared
> ---
>
> Key: YARN-11285
> URL: https://issues.apache.org/jira/browse/YARN-11285
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
> Directories.  
> Each container has separate LocalizedResource object and separate local path 
> like below.
> {code}
>/mnt/yarn/usercache/hive/filecache/6/2552419:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552420:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552421:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552422:
>total 28456
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared

2022-08-30 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11285:
-
Attachment: (was: TestConcurrency.java)

> LocalizedResources are leaked and its LocalPath are not cleared
> ---
>
> Key: YARN-11285
> URL: https://issues.apache.org/jira/browse/YARN-11285
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
> Directories.  
> Each container has separate LocalizedResource object and separate local path 
> like below.
> {code}
>/mnt/yarn/usercache/hive/filecache/6/2552419:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552420:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552421:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552422:
>total 28456
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11196) NUMA Awareness support in DefaultContainerExecutor

2022-08-29 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11196.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> NUMA Awareness support in DefaultContainerExecutor
> --
>
> Key: YARN-11196
> URL: https://issues.apache.org/jira/browse/YARN-11196
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Samrat Deb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> [YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added support 
> of NUMA Awareness for Containers launched through LinuxContainerExecutor. 
> This feature is useful to have in DefaultContainerExecutor as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared

2022-08-29 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11285:
-
Attachment: TestConcurrency.java

> LocalizedResources are leaked and its LocalPath are not cleared
> ---
>
> Key: YARN-11285
> URL: https://issues.apache.org/jira/browse/YARN-11285
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: TestConcurrency.java
>
>
> LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
> Directories. When multiple containers are initialized at same time, 
> LocalResourcesTrackerImpl REQUEST handler could create and handle multiple 
> LocalizedResource object for the same input path due to race condition in 
> [below 
> code|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L149]
> {code}
> case REQUEST:
> LocalResourceRequest req = event.getLocalResourceRequest();
> LocalizedResource rsrc = localrsrc.get(req);
>  
>   if (null == rsrc) {
> rsrc = new LocalizedResource(req, dispatcher);
> localrsrc.put(req, rsrc);
>   }
>  rsrc.handle(event);
> {code}
> Each container will have separate LocalizedResource object and separate local 
> path like below.
> {code}
>/mnt/yarn/usercache/hive/filecache/6/2552419:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552420:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552421:
>total 28456
>-r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
> hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar
>/mnt/yarn/usercache/hive/filecache/6/2552422:
>total 28456
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11285) LocalizedResources are leaked and its LocalPath are not cleared

2022-08-29 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11285:


 Summary: LocalizedResources are leaked and its LocalPath are not 
cleared
 Key: YARN-11285
 URL: https://issues.apache.org/jira/browse/YARN-11285
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.2.1
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


LocalizedResources are leaked and its LocalPath are not cleared from NM Local 
Directories. When multiple containers are initialized at same time, 
LocalResourcesTrackerImpl REQUEST handler could create and handle multiple 
LocalizedResource object for the same input path due to race condition in 
[below 
code|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L149]

{code}
case REQUEST:
LocalResourceRequest req = event.getLocalResourceRequest();
LocalizedResource rsrc = localrsrc.get(req);
 
  if (null == rsrc) {
rsrc = new LocalizedResource(req, dispatcher);
localrsrc.put(req, rsrc);
  }
 rsrc.handle(event);
{code}


Each container will have separate LocalizedResource object and separate local 
path like below.
{code}
   /mnt/yarn/usercache/hive/filecache/6/2552419:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552420:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552421:
   total 28456
   -r-x-- 1 yarn yarn 29135164 Aug  7 10:24 
hive-exec-2.3.4.50-3fd48f33b0c0b82ab431013f0fe794dfe75c31a5027567e6865cccbb49de862b.jar

   /mnt/yarn/usercache/hive/filecache/6/2552422:
   total 28456
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11251) Separate ThreadPool for AMLauncher Launch and Clean Events

2022-08-11 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11251:


 Summary: Separate ThreadPool for AMLauncher Launch and Clean Events
 Key: YARN-11251
 URL: https://issues.apache.org/jira/browse/YARN-11251
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Samrat Deb


Have seen too many AM Launch Failures due to Token Expired or Container 
Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM 
Host (Spot Instances) which are down. Having Separate ThreadPools for both 
Cleanup and Launch will reduce the AM Launch failures.

*Token Expired*
{code}
2022-07-19 14:56:33,486 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
(IPC Server handler 39 on 8041): Unauthorized request to start container.
This token is expired. current time is 1658242593486 found 1658242289457
Note: System times on machines may be out of sync. Check system time and time 
zones.
{code}

*Container Liveliness Expiry*
{code}
2022-07-19 16:06:48,663 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_1656573205571_2357731_01_01 
Container Transitioned from ACQUIRED to EXPIRED

2022-07-19 16:10:08,663 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker): 
Expired: 
Timed out after 600 secs
{code}






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11200) Backport YARN-5764 NUMA awareness support for launching containers

2022-07-28 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11200.
--
Fix Version/s: 2.10.3
   Resolution: Fixed

> Backport YARN-5764  NUMA awareness support for launching containers 
> 
>
> Key: YARN-11200
> URL: https://issues.apache.org/jira/browse/YARN-11200
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Reporter: Prabhu Joseph
>Assignee: Samrat Deb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.10.3
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Few users who are on 2.10 are looking for NUMA Support in YARN. Backporting 
> YARN-5764 to 2.10.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception

2022-07-26 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571353#comment-17571353
 ] 

Prabhu Joseph commented on YARN-11210:
--

Thanks [~aajisaka]. I was not aware of that.

> Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration 
> exception
> --
>
> Key: YARN-11210
> URL: https://issues.apache.org/jira/browse/YARN-11210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Description of Problem
> Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
> synchronously can be blocked for up to 15 minutes with the default 
> configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
> issue in of itself, but there is a non-retryable IllegalArgumentException 
> exception thrown within the YARN ResourceManager client that is getting 
> swallowed & treated as a retryable "connection exception" meaning that it 
> gets retried for 15 minutes.
> The purpose of this JIRA (and PR) is to modify the YARN client so that it 
> does not retry on this non-retryable exception.
> h2. Background Information
> YARN ResourceManager client treats connection exceptions as retryable & with 
> the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt 
> to connect to the ResourceManager for up to 15 minutes when facing 
> "connection exceptions". This arguably makes sense because connection 
> exceptions are in some cases transient & can be recovered from without any 
> action needed from the client. See example below where YARN ResourceManager 
> client was able to recover from connection issues that resulted from the 
> ResourceManager process being down.
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 1 failover attempts. Trying to failover after sleeping for 41061ms.
> 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], 

[jira] [Resolved] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception

2022-07-26 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11210.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration 
> exception
> --
>
> Key: YARN-11210
> URL: https://issues.apache.org/jira/browse/YARN-11210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Description of Problem
> Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
> synchronously can be blocked for up to 15 minutes with the default 
> configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
> issue in of itself, but there is a non-retryable IllegalArgumentException 
> exception thrown within the YARN ResourceManager client that is getting 
> swallowed & treated as a retryable "connection exception" meaning that it 
> gets retried for 15 minutes.
> The purpose of this JIRA (and PR) is to modify the YARN client so that it 
> does not retry on this non-retryable exception.
> h2. Background Information
> YARN ResourceManager client treats connection exceptions as retryable & with 
> the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt 
> to connect to the ResourceManager for up to 15 minutes when facing 
> "connection exceptions". This arguably makes sense because connection 
> exceptions are in some cases transient & can be recovered from without any 
> action needed from the client. See example below where YARN ResourceManager 
> client was able to recover from connection issues that resulted from the 
> ResourceManager process being down.
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 1 failover attempts. Trying to failover after sleeping for 41061ms.
> 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], 

[jira] [Commented] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception

2022-07-25 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571182#comment-17571182
 ] 

Prabhu Joseph commented on YARN-11210:
--

[~aajisaka] Could you make [~KevinWikant] as contributor to YARN.

> Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration 
> exception
> --
>
> Key: YARN-11210
> URL: https://issues.apache.org/jira/browse/YARN-11210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Description of Problem
> Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
> synchronously can be blocked for up to 15 minutes with the default 
> configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
> issue in of itself, but there is a non-retryable IllegalArgumentException 
> exception thrown within the YARN ResourceManager client that is getting 
> swallowed & treated as a retryable "connection exception" meaning that it 
> gets retried for 15 minutes.
> The purpose of this JIRA (and PR) is to modify the YARN client so that it 
> does not retry on this non-retryable exception.
> h2. Background Information
> YARN ResourceManager client treats connection exceptions as retryable & with 
> the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt 
> to connect to the ResourceManager for up to 15 minutes when facing 
> "connection exceptions". This arguably makes sense because connection 
> exceptions are in some cases transient & can be recovered from without any 
> action needed from the client. See example below where YARN ResourceManager 
> client was able to recover from connection issues that resulted from the 
> ResourceManager process being down.
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 1 failover attempts. Trying to failover after sleeping for 41061ms.
> 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> 

[jira] [Resolved] (YARN-11198) Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) from State Store

2022-07-13 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11198.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) from State Store
> --
>
> Key: YARN-11198
> URL: https://issues.apache.org/jira/browse/YARN-11198
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Samrat Deb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> [YARN-7033|https://issues.apache.org/jira/browse/YARN-7033] provided support 
> to recover  assigned resources to container. But did not delete them from 
> State Store as part of removal of container after the configured duration 
> yarn.nodemanager.duration-to-track-stopped-containers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11198) Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) from State Store

2022-06-27 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11198:


 Summary: Deletion of assigned resources (e.g. GPU's, NUMA, FPGA's) 
from State Store
 Key: YARN-11198
 URL: https://issues.apache.org/jira/browse/YARN-11198
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: Prabhu Joseph
Assignee: Samrat Deb


[YARN-7033|https://issues.apache.org/jira/browse/YARN-7033] provided support to 
recover  assigned resources to container. But did not delete them from State 
Store as part of removal of container after the configured duration 
yarn.nodemanager.duration-to-track-stopped-containers.





--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11196) NUMA Awareness support in DefaultContainerExecutor

2022-06-24 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11196:


 Summary: NUMA Awareness support in DefaultContainerExecutor
 Key: YARN-11196
 URL: https://issues.apache.org/jira/browse/YARN-11196
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.3.3
Reporter: Prabhu Joseph
Assignee: Samrat Deb


[YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added support 
of NUMA Awareness for Containers launched through LinuxContainerExecutor. This 
feature is useful to have in DefaultContainerExecutor as well.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11195) Document how to configure NUMA in YARN

2022-06-23 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11195:
-
Summary: Document how to configure NUMA in YARN  (was: Doc on how to 
configure NUMA in YARN)

> Document how to configure NUMA in YARN
> --
>
> Key: YARN-11195
> URL: https://issues.apache.org/jira/browse/YARN-11195
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> [YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added NUMA 
> Awareness support for launching containers. This improves the workload 
> performance on machines which has NUMA support like EC2 m5.24x.  
> Currently this feature works only on LinuxContainerExecutor and not on 
> DefaultContainerExecutor. Have seen users configuring on a 
> DefaultContainerExecutor by mistake and has not found any improvement. 
> Suggest to document how to enable NUMA in YARN.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11195) Doc on how to configure NUMA in YARN

2022-06-23 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-11195:


 Summary: Doc on how to configure NUMA in YARN
 Key: YARN-11195
 URL: https://issues.apache.org/jira/browse/YARN-11195
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.3.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


[YARN-5764|https://issues.apache.org/jira/browse/YARN-5764] has added NUMA 
Awareness support for launching containers. This improves the workload 
performance on machines which has NUMA support like EC2 m5.24x.  

Currently this feature works only on LinuxContainerExecutor and not on 
DefaultContainerExecutor. Have seen users configuring on a 
DefaultContainerExecutor by mistake and has not found any improvement. Suggest 
to document how to enable NUMA in YARN.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9971) YARN Native Service HttpProbe logs THIS_HOST in error messages

2022-06-21 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-9971.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Thanks [~groot]  for the patch. Have committed the patch to trunk.

> YARN Native Service HttpProbe logs THIS_HOST in error messages
> --
>
> Key: YARN-9971
> URL: https://issues.apache.org/jira/browse/YARN-9971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> YARN Native Service HttpProbe logs THIS_HOST in error messages. While 
> logging, missed to use the replaced url string.
> {code:java}
> 2019-11-12 19:25:47,317 [pool-7-thread-1] INFO  probe.HttpProbe - Probe 
> http://${THIS_HOST}:18010/master-status failed for IP 172.27.75.198: 
> java.net.ConnectException: Connection refused (Connection refused)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11030) ClassNotFoundException when aux service class is loaded from customized classpath

2021-12-06 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453909#comment-17453909
 ] 

Prabhu Joseph commented on YARN-11030:
--

[~hadachi] Thanks for reporting the issue. Looks this is a duplicate of 
[YARN-9967|https://issues.apache.org/jira/browse/YARN-9967]. Can you confirm 
the same. Thanks.

> ClassNotFoundException when aux service class is loaded from customized 
> classpath
> -
>
> Key: YARN-11030
> URL: https://issues.apache.org/jira/browse/YARN-11030
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Hiroyuki Adachi
>Priority: Minor
>
> NodeManager failed to load the aux service with ClassNotFoundException while 
> loading the class from the customized classpath.
> {noformat}
> 
>   
>    value="org.apache.spark.network.yarn.YarnShuffleService"/>
>    value="/tmp/spark-3.1.2-yarn-shuffle.jar"/>
>   
>  {noformat}
> {noformat}
> 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar]
> 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> system classes: [org.apache.spark.network.yarn.YarnShuffleService]
> 2021-12-06 15:32:09,169 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in
>  state INITED
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>         at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>         at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:348)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.ja
> va:165)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452)
>         ... 10 more
> 2021-12-06 15:32:09,172 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  
> failed in state INITED{noformat}
>  
> YARN-9075 may cause this problem. The default system classes were changed by 
> this patch.
> Before YARN-9075: isSystemClass() returns false since the system classes does 
> not contain the aux service class itself, and the class will be loaded from 
> the customized classpath.
> [https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ApplicationClassLoader.java#L176]
> {noformat}
> 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: 

[jira] [Commented] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files

2021-11-29 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450654#comment-17450654
 ] 

Prabhu Joseph commented on YARN-10975:
--

Have committed the patch to trunk. Thanks [~Sushma_28].

> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> --
>
> Key: YARN-10975
> URL: https://issues.apache.org/jira/browse/YARN-10975
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Ravuri Sushma sree
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10975.001.patch, YARN-10975.002.patch, 
> YARN-10975.003.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> again and again even though there is no change in the file. This leads to 
> unnecessary load on DFS where summary files resides and Timeline Store where 
> timeline entities present.
> {code}
> 2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 275 msec
> 2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 341 msec
> 2021-10-10 19:22:44,065 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 335 msec
> 2021-10-10 19:23:44,038 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 370 msec
> 2021-10-10 19:24:44,087 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 317 msec
> 2021-10-10 19:25:44,092 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 336 msec
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files

2021-11-25 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449430#comment-17449430
 ] 

Prabhu Joseph commented on YARN-10975:
--

Thanks [~Sushma_28] for the patch. The patch looks good to me. Will commit it 
shortly.

> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> --
>
> Key: YARN-10975
> URL: https://issues.apache.org/jira/browse/YARN-10975
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: YARN-10975.001.patch, YARN-10975.002.patch, 
> YARN-10975.003.patch
>
>
> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> again and again even though there is no change in the file. This leads to 
> unnecessary load on DFS where summary files resides and Timeline Store where 
> timeline entities present.
> {code}
> 2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 275 msec
> 2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 341 msec
> 2021-10-10 19:22:44,065 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 335 msec
> 2021-10-10 19:23:44,038 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 370 msec
> 2021-10-10 19:24:44,087 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 317 msec
> 2021-10-10 19:25:44,092 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 336 msec
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7982) Do ACLs check while retrieving entity-types per application

2021-11-15 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444007#comment-17444007
 ] 

Prabhu Joseph commented on YARN-7982:
-

[~dmmkr] Yes sure, can you provide a patch for 3.2. I will help to review and 
commit it. Thanks.

> Do ACLs check while retrieving entity-types per application
> ---
>
> Key: YARN-7982
> URL: https://issues.apache.org/jira/browse/YARN-7982
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-7982-001.patch, YARN-7982-002.patch, 
> YARN-7982-003.patch, YARN-7982-004.patch
>
>
> REST end point {{/apps/$appid/entity-types}} retrieves all the entity-types 
> for given application. This need to be guarded with ACL check
> {code}
> [yarn@yarn-ats-3 ~]$ curl 
> "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1552297011473_0002?user.name=ambari-qa1;
> {"exception":"ForbiddenException","message":"java.lang.Exception: User 
> ambari-qa1 is not allowed to read TimelineService V2 
> data.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> [yarn@yarn-ats-3 ~]$ curl 
> "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1552297011473_0002/entity-types?user.name=ambari-qa1;
> ["YARN_APPLICATION_ATTEMPT","YARN_CONTAINER"]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files

2021-11-09 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441137#comment-17441137
 ] 

Prabhu Joseph commented on YARN-10975:
--

The main issue is in the below code which returns 0 always. Every time the file 
is processed, the offset is set to 0 and next time it starts processing from 0.

{code}
bytesParsed = parser.getCurrentLocation().getCharOffset() + 1;
LOG.trace("Parser now at offset {}", bytesParsed);
{code}

> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> --
>
> Key: YARN-10975
> URL: https://issues.apache.org/jira/browse/YARN-10975
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Ravuri Sushma sree
>Priority: Major
>
> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> again and again even though there is no change in the file. This leads to 
> unnecessary load on DFS where summary files resides and Timeline Store where 
> timeline entities present.
> {code}
> 2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 275 msec
> 2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 341 msec
> 2021-10-10 19:22:44,065 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 335 msec
> 2021-10-10 19:23:44,038 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 370 msec
> 2021-10-10 19:24:44,087 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 317 msec
> 2021-10-10 19:25:44,092 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 336 msec
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files

2021-10-18 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-10975:


Assignee: Ravuri Sushma sree  (was: Prabhu Joseph)

> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> --
>
> Key: YARN-10975
> URL: https://issues.apache.org/jira/browse/YARN-10975
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Ravuri Sushma sree
>Priority: Major
>
> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> again and again even though there is no change in the file. This leads to 
> unnecessary load on DFS where summary files resides and Timeline Store where 
> timeline entities present.
> {code}
> 2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 275 msec
> 2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 341 msec
> 2021-10-10 19:22:44,065 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 335 msec
> 2021-10-10 19:23:44,038 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 370 msec
> 2021-10-10 19:24:44,087 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 317 msec
> 2021-10-10 19:25:44,092 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 336 msec
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files

2021-10-13 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10975:
-
Description: 
EntityGroupFSTimelineStore#ActiveLogParser parses already processed files again 
and again even though there is no change in the file. This leads to unnecessary 
load on DFS where summary files resides and Timeline Store where timeline 
entities present.

{code}
2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 275 msec
2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 341 msec
2021-10-10 19:22:44,065 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 335 msec
2021-10-10 19:23:44,038 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 370 msec
2021-10-10 19:24:44,087 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 317 msec
2021-10-10 19:25:44,092 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 336 msec
{code}

  was:
EntityGroupFSTimelineStore#ActiveLogParser parses already processed files again 
and again. This leads to unnecessary load on DFS where summary files resides 
and Timeline Store where timeline entities present.

{code}
2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 275 msec
2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 341 msec
2021-10-10 19:22:44,065 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 335 msec
2021-10-10 19:23:44,038 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 370 msec
2021-10-10 19:24:44,087 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 317 msec
2021-10-10 19:25:44,092 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 336 msec
{code}


> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> --
>
> Key: YARN-10975
> URL: https://issues.apache.org/jira/browse/YARN-10975
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> EntityGroupFSTimelineStore#ActiveLogParser parses already processed files 
> again and again even though there is no change in the file. This leads to 
> unnecessary load on DFS where summary files resides and Timeline Store where 
> timeline entities present.
> {code}
> 2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 275 msec
> 2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
> hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
>  in 341 

[jira] [Created] (YARN-10975) EntityGroupFSTimelineStore#ActiveLogParser parses already processed files

2021-10-13 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10975:


 Summary: EntityGroupFSTimelineStore#ActiveLogParser parses already 
processed files 
 Key: YARN-10975
 URL: https://issues.apache.org/jira/browse/YARN-10975
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 3.3.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


EntityGroupFSTimelineStore#ActiveLogParser parses already processed files again 
and again. This leads to unnecessary load on DFS where summary files resides 
and Timeline Store where timeline entities present.

{code}
2021-10-10 19:20:43,940 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 275 msec
2021-10-10 19:21:44,079 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 341 msec
2021-10-10 19:22:44,065 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 335 msec
2021-10-10 19:23:44,038 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 370 msec
2021-10-10 19:24:44,087 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 317 msec
2021-10-10 19:25:44,092 INFO  timeline.LogInfo - Parsed 6 entities from 
hdfs:/prabhuJoseph/atshistory/active/application_1631559260564_0009/appattempt_1631559260564_0009_01/summarylog-appattempt_1631559260564_0009_01_2331123893
 in 336 msec
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10896) RM fail over is not reporting the nodes DECOMMISSIONED

2021-09-27 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420718#comment-17420718
 ] 

Prabhu Joseph commented on YARN-10896:
--

Thanks [~Sushil-K-S] for the patch. 

{code} assertEquals(2, rm.getRMContext().getInactiveRMNodes().size()); {code}

1. Why it returns 2, there are only one Inactive node present right?

> RM fail over is not reporting the nodes DECOMMISSIONED 
> ---
>
> Key: YARN-10896
> URL: https://issues.apache.org/jira/browse/YARN-10896
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10896.001.patch
>
>
> Whenever we add the host entries into the exclude file in order to 
> DECOMMISSION the Nodemanager, we would issue the *yarn rmadmin -refreshNodes* 
> command to transition the nodes from RUNNING to DECOMMISSIONED state. However 
> if the fail over to standby resource manager happens and the exclude file has 
> the list of hosts to be disallowed, then these disallowed nodes are never 
> seen through the Cluster Metrics on the new active resource manager. 
> Whatever host entries that are present in the exclude files are being listed 
> in the Cluster Metrics whenever resource manager is restarted, i.e as part of 
> the service init of *NodeListManager* , however during fail over this info is 
> lost. Hence this patch tries to set the  *DECOMMISSIONED* nodes inside the RM 
> Context so that its available through Cluster Metrics whenever we issue the 
> *yarn rmadmin -refreshNodes* command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-09-07 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411333#comment-17411333
 ] 

Prabhu Joseph commented on YARN-10884:
--

Thanks [~Swathi Chandrashekar] for the patch. Have committed it in trunk.

> EntityGroupFSTimelineStore fails to parse log files which has empty owner
> -
>
> Key: YARN-10884
> URL: https://issues.apache.org/jira/browse/YARN-10884
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
> Fix For: 3.3.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - 
> Wasb FileSystem sets owner as empty during append operation. 
> ATS1.5 fails to read such files with below error 
> {code:java}
>  java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> It gets ownership of the file to check ACL. In case of disabled ACL check, 
> this is not required. Will suggest to add anonymous user in case of empty 
> user.
> {code}
> if (owner.isEmpty()) {
>   user = "anonymous";
> } else {
>   user = owner;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-09-07 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10884:
-
Labels:   (was: pull-request-available)

> EntityGroupFSTimelineStore fails to parse log files which has empty owner
> -
>
> Key: YARN-10884
> URL: https://issues.apache.org/jira/browse/YARN-10884
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
> Fix For: 3.3.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - 
> Wasb FileSystem sets owner as empty during append operation. 
> ATS1.5 fails to read such files with below error 
> {code:java}
>  java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> It gets ownership of the file to check ACL. In case of disabled ACL check, 
> this is not required. Will suggest to add anonymous user in case of empty 
> user.
> {code}
> if (owner.isEmpty()) {
>   user = "anonymous";
> } else {
>   user = owner;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10933) Building Timeline Delegation Token Service Text is not needed on unsecure clusters

2021-09-06 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-10933:


Assignee: SwathiChandrashekar  (was: Prabhu Joseph)

> Building Timeline Delegation Token Service Text is not needed on unsecure 
> clusters
> --
>
> Key: YARN-10933
> URL: https://issues.apache.org/jira/browse/YARN-10933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
>
> Yarn Client Commands fails with below when ATS1.5 TimelineServer is not 
> reachable. On unsecure cluster, Build Timeline Token Service is not required.
> {code:java}
> java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> timelineserver-0
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
> at 
> org.apache.hadoop.yarn.util.timeline.TimelineUtils.buildTimelineTokenService(TimelineUtils.java:163)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:183)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at org.apache.hadoop.yarn.client.cli.YarnCLI.(YarnCLI.java:47)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.(ApplicationCLI.java:65)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:115)
> Caused by: java.net.UnknownHostException: timelineserver-0
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10933) Building Timeline Delegation Token Service Text is not needed on unsecure clusters

2021-09-06 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10933:


 Summary: Building Timeline Delegation Token Service Text is not 
needed on unsecure clusters
 Key: YARN-10933
 URL: https://issues.apache.org/jira/browse/YARN-10933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineclient
Affects Versions: 3.3.1
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Yarn Client Commands fails with below when ATS1.5 TimelineServer is not 
reachable. On unsecure cluster, Build Timeline Token Service is not required.

{code:java}
java.lang.IllegalArgumentException: java.net.UnknownHostException: 
timelineserver-0
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
at 
org.apache.hadoop.yarn.util.timeline.TimelineUtils.buildTimelineTokenService(TimelineUtils.java:163)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:183)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.client.cli.YarnCLI.(YarnCLI.java:47)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.(ApplicationCLI.java:65)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:115)
Caused by: java.net.UnknownHostException: timelineserver-0
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout

2021-08-17 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-10873.
--
Resolution: Fixed

> Graceful Decommission ignores launched containers and gets deactivated before 
> timeout
> -
>
> Key: YARN-10873
> URL: https://issues.apache.org/jira/browse/YARN-10873
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Srinivas S T
>Priority: Major
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Graceful Decommission of a Node gets deactivated before timeout even though 
> there are launched containers. 
> On Status update from Node which is in Decommissioning, RM transitions the 
> node to DECOMMISSIONED before timeout if there are no running applications. 
> These running applications are added from the Container Statuses from 
> NodeManager. We have observed Containers are launched at NodeManager and at 
> the same time ResourceManager forcefully decommissions the node.
> This affects the Livy Interactive jobs which supports only one application 
> attempt.
> Will suggest to check FicaSchedulerNode to identify if there are any launched 
> containers and determine whether to forcefully decommission or not.
> {code}
>   public static class StatusUpdateWhenHealthyTransition implements
>   MultipleArcTransition {
> @Override
> public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
>   .
>   if (isNodeDecommissioning) {
> List keepAliveApps = statusEvent.getKeepAliveAppIds();
> if (rmNode.runningApplications.isEmpty() &&
> (keepAliveApps == null || keepAliveApps.isEmpty())) {
>   RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
>   return NodeState.DECOMMISSIONED;
> }
>   }
> {code}
> *ResourceManager Logs:*
> {code}
> 2021-06-16 08:45:04,140 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: 
> Launching masterappattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up container Container: [ContainerId: container_1623830067124_0382_01_01, 
> AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource:  vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Creating password for appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,154 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
> launching container Container: [ContainerId: 
> container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, 
> NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource:  vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node1:34753 with state RUNNING
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node1:34753 in DECOMMISSIONING.
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from RUNNING to DECOMMISSIONING
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node node1:34753 as it is now DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED 
> to KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout

2021-08-17 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10873:
-
Labels:   (was: pull-request-available)

> Graceful Decommission ignores launched containers and gets deactivated before 
> timeout
> -
>
> Key: YARN-10873
> URL: https://issues.apache.org/jira/browse/YARN-10873
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Srinivas S T
>Priority: Major
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Graceful Decommission of a Node gets deactivated before timeout even though 
> there are launched containers. 
> On Status update from Node which is in Decommissioning, RM transitions the 
> node to DECOMMISSIONED before timeout if there are no running applications. 
> These running applications are added from the Container Statuses from 
> NodeManager. We have observed Containers are launched at NodeManager and at 
> the same time ResourceManager forcefully decommissions the node.
> This affects the Livy Interactive jobs which supports only one application 
> attempt.
> Will suggest to check FicaSchedulerNode to identify if there are any launched 
> containers and determine whether to forcefully decommission or not.
> {code}
>   public static class StatusUpdateWhenHealthyTransition implements
>   MultipleArcTransition {
> @Override
> public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
>   .
>   if (isNodeDecommissioning) {
> List keepAliveApps = statusEvent.getKeepAliveAppIds();
> if (rmNode.runningApplications.isEmpty() &&
> (keepAliveApps == null || keepAliveApps.isEmpty())) {
>   RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
>   return NodeState.DECOMMISSIONED;
> }
>   }
> {code}
> *ResourceManager Logs:*
> {code}
> 2021-06-16 08:45:04,140 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: 
> Launching masterappattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up container Container: [ContainerId: container_1623830067124_0382_01_01, 
> AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource:  vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Creating password for appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,154 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
> launching container Container: [ContainerId: 
> container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, 
> NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource:  vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node1:34753 with state RUNNING
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node1:34753 in DECOMMISSIONING.
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from RUNNING to DECOMMISSIONING
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node node1:34753 as it is now DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED 
> to KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout

2021-08-17 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10873:
-
Fix Version/s: 3.4.0

> Graceful Decommission ignores launched containers and gets deactivated before 
> timeout
> -
>
> Key: YARN-10873
> URL: https://issues.apache.org/jira/browse/YARN-10873
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Srinivas S T
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Graceful Decommission of a Node gets deactivated before timeout even though 
> there are launched containers. 
> On Status update from Node which is in Decommissioning, RM transitions the 
> node to DECOMMISSIONED before timeout if there are no running applications. 
> These running applications are added from the Container Statuses from 
> NodeManager. We have observed Containers are launched at NodeManager and at 
> the same time ResourceManager forcefully decommissions the node.
> This affects the Livy Interactive jobs which supports only one application 
> attempt.
> Will suggest to check FicaSchedulerNode to identify if there are any launched 
> containers and determine whether to forcefully decommission or not.
> {code}
>   public static class StatusUpdateWhenHealthyTransition implements
>   MultipleArcTransition {
> @Override
> public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
>   .
>   if (isNodeDecommissioning) {
> List keepAliveApps = statusEvent.getKeepAliveAppIds();
> if (rmNode.runningApplications.isEmpty() &&
> (keepAliveApps == null || keepAliveApps.isEmpty())) {
>   RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
>   return NodeState.DECOMMISSIONED;
> }
>   }
> {code}
> *ResourceManager Logs:*
> {code}
> 2021-06-16 08:45:04,140 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: 
> Launching masterappattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up container Container: [ContainerId: container_1623830067124_0382_01_01, 
> AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource:  vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Creating password for appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,154 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
> launching container Container: [ContainerId: 
> container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, 
> NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource:  vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_01
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node1:34753 with state RUNNING
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node1:34753 in DECOMMISSIONING.
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from RUNNING to DECOMMISSIONING
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node node1:34753 as it is now DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED 
> to KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-08-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-10884:


Assignee: SwathiChandrashekar  (was: Prabhu Joseph)

> EntityGroupFSTimelineStore fails to parse log files which has empty owner
> -
>
> Key: YARN-10884
> URL: https://issues.apache.org/jira/browse/YARN-10884
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
>
> Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - 
> Wasb FileSystem sets owner as empty during append operation. 
> ATS1.5 fails to read such files with below error 
> {code:java}
>  java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> It gets ownership of the file to check ACL. In case of disabled ACL check, 
> this is not required. Will suggest to add anonymous user in case of empty 
> user.
> {code}
> if (owner.isEmpty()) {
>   user = "anonymous";
> } else {
>   user = owner;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-08-15 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10884:
-
Description: 
Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - Wasb 
FileSystem sets owner as empty during append operation. 

ATS1.5 fails to read such files with below error 
{code:java}

 java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}

It gets ownership of the file to check ACL. In case of disabled ACL check, this 
is not required. Will suggest to add anonymous user in case of empty user.

{code}
if (owner.isEmpty()) {
  user = "anonymous";
} else {
  user = owner;
}
{code}


  was:
Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] Hadoop 
NativeAzureFileSystem append removes ownership set on the file - ASF JIRA 
(apache.org)] - Wasb FileSystem sets owner as empty during append operation. 

ATS1.5 fails to read such files with below error 
{code:java}

 java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}

It gets ownership of the file to check ACL. In case of disabled ACL check, this 
is not required. Will suggest to add anonymous user in case of empty user.

{code}
if (owner.isEmpty()) {
  user = "anonymous";
} else {
  user = owner;
}
{code}



> EntityGroupFSTimelineStore fails to parse log files which has empty owner
> -
>
> Key: YARN-10884
> URL: https://issues.apache.org/jira/browse/YARN-10884
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - 
> Wasb FileSystem sets owner as empty during append operation. 
> ATS1.5 fails to read such files with below error 
> {code:java}
>  

[jira] [Updated] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-08-15 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10884:
-
Description: 
Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] Hadoop 
NativeAzureFileSystem append removes ownership set on the file - ASF JIRA 
(apache.org)] - Wasb FileSystem sets owner as empty during append operation. 

ATS1.5 fails to read such files with below error 
{code:java}

 java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}

It gets ownership of the file to check ACL. In case of disabled ACL check, this 
is not required. Will suggest to add anonymous user in case of empty user.

{code}
if (owner.isEmpty()) {
  user = "anonymous";
} else {
  user = owner;
}
{code}


  was:
Due to [HADOOP-17848|[HADOOP-17848] Hadoop NativeAzureFileSystem append removes 
ownership set on the file - ASF JIRA (apache.org)] - Wasb FileSystem sets owner 
as empty during append operation. 

ATS1.5 fails to read such files with below error 
{code:java}

 java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}

It gets ownership of the file to check ACL. In case of disabled ACL check, this 
is not required. Will suggest to add anonymous user in case of empty user.

{code}
if (owner.isEmpty()) {
  user = "anonymous";
} else {
  user = owner;
}
{code}



> EntityGroupFSTimelineStore fails to parse log files which has empty owner
> -
>
> Key: YARN-10884
> URL: https://issues.apache.org/jira/browse/YARN-10884
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] 
> Hadoop NativeAzureFileSystem append removes ownership set on the file - ASF 
> JIRA (apache.org)] - 

[jira] [Created] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-08-15 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10884:


 Summary: EntityGroupFSTimelineStore fails to parse log files which 
has empty owner
 Key: YARN-10884
 URL: https://issues.apache.org/jira/browse/YARN-10884
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 3.3.1
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Due to [HADOOP-17848|[HADOOP-17848] Hadoop NativeAzureFileSystem append removes 
ownership set on the file - ASF JIRA (apache.org)] - Wasb FileSystem sets owner 
as empty during append operation. 

ATS1.5 fails to read such files with below error 
{code:java}

 java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}

It gets ownership of the file to check ACL. In case of disabled ACL check, this 
is not required. Will suggest to add anonymous user in case of empty user.

{code}
if (owner.isEmpty()) {
  user = "anonymous";
} else {
  user = owner;
}
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-29 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390099#comment-17390099
 ] 

Prabhu Joseph commented on YARN-10848:
--

Hi [~pbacsko], IMO this is breaking the existing behavior of 
DefaultResourceCalculator. DefaultResourceCalculator is useful when the 
workloads are not CPU intensive like MapReduce, Tez and user need not worry on 
CPU configurations here.

>> IMO whether a container "fits in" or not should depend on both values

DominantResourceCalaculator provides this support which users configures if 
they want to consider both memory and cpu resources in scheduling.




> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
>  Labels: pull-request-available
> Attachments: TestTooManyContainers.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-29 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390079#comment-17390079
 ] 

Prabhu Joseph commented on YARN-10854:
--

Thanks [~Tao Yang] for the patch. This is very useful to us as well, else we 
would have end up in adding lot of code change in managing the includes node 
list.

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout

2021-07-27 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10873:


 Summary: Graceful Decommission ignores launched containers and 
gets deactivated before timeout
 Key: YARN-10873
 URL: https://issues.apache.org/jira/browse/YARN-10873
 Project: Hadoop YARN
  Issue Type: Bug
  Components: RM
Affects Versions: 3.3.1
Reporter: Prabhu Joseph
Assignee: Srinivas S T


Graceful Decommission of a Node gets deactivated before timeout even though 
there are launched containers. 

On Status update from Node which is in Decommissioning, RM transitions the node 
to DECOMMISSIONED before timeout if there are no running applications. These 
running applications are added from the Container Statuses from NodeManager. We 
have observed Containers are launched at NodeManager and at the same time 
ResourceManager forcefully decommissions the node.

This affects the Livy Interactive jobs which supports only one application 
attempt.

Will suggest to check FicaSchedulerNode to identify if there are any launched 
containers and determine whether to forcefully decommission or not.

{code}
  public static class StatusUpdateWhenHealthyTransition implements
  MultipleArcTransition {
@Override
public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
  .
  if (isNodeDecommissioning) {
List keepAliveApps = statusEvent.getKeepAliveAppIds();
if (rmNode.runningApplications.isEmpty() &&
(keepAliveApps == null || keepAliveApps.isEmpty())) {
  RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
  return NodeState.DECOMMISSIONED;
}
  }
{code}


*ResourceManager Logs:*
{code}
2021-06-16 08:45:04,140 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching 
masterappattempt_1623830067124_0382_01
2021-06-16 08:45:04,141 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up 
container Container: [ContainerId: container_1623830067124_0382_01_01, 
AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 
927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 
10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
appattempt_1623830067124_0382_01
2021-06-16 08:45:04,141 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_01
2021-06-16 08:45:04,141 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
Creating password for appattempt_1623830067124_0382_01
2021-06-16 08:45:04,154 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
launching container Container: [ContainerId: 
container_1623830067124_0382_01_01, AllocationRequestId: 0, Version: 0, 
NodeId: node1:34753, NodeHttpAddress: 
927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 
10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
appattempt_1623830067124_0382_01


2021-06-16 08:45:04,776 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
decommission node node1:34753 with state RUNNING
2021-06-16 08:45:04,776 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
node1:34753 in DECOMMISSIONING.
2021-06-16 08:45:04,776 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
Node Transitioned from RUNNING to DECOMMISSIONING
2021-06-16 08:45:05,131 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node node1:34753 as it is now DECOMMISSIONED
2021-06-16 08:45:05,131 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
2021-06-16 08:45:05,131 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1623830067124_0382_01_01 Container Transitioned from ACQUIRED to 
KILLED
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10871) Aborted AM is considered as App Failure when user sets MaxAttempts as 1

2021-07-24 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10871:
-
Description: 
When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
failure is not counted. But if user sets number of attempts as 1, then YARN 
considers the ABORTED AM as a failure. 

{code}
  int numberOfFailure = app.getNumFailedAppAttempts();
  if (app.maxAppAttempts == 1) {
// If the user explicitly set the attempts to 1 then there are likely
// correctness issues if the AM restarts for any reason.
LOG.info("Max app attempts is 1 for " + app.applicationId
+ ", preventing further attempts.");
numberOfFailure = app.maxAppAttempts;
  } 
{code}

Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
support multiple connections for the same registered app. But in our case AM is 
ABORTED before even the AM starts (AM was in ACQUIRED state)

Usually users won't decommission the node where the Container is in RUNNING 
state (where the session is established). But the decommission can happen on 
nodes where the container is in ACQUIRED or ALLOCATED state. 

Will suggest to expose an config where user can decide whether to consider this 
as a failure or not. 

  was:
When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
failure is not counted. But if user sets number of attempts as 1, then YARN 
considers the ABORTED AM as a failure. 

{code}
  int numberOfFailure = app.getNumFailedAppAttempts();
  if (app.maxAppAttempts == 1) {
// If the user explicitly set the attempts to 1 then there are likely
// correctness issues if the AM restarts for any reason.
LOG.info("Max app attempts is 1 for " + app.applicationId
+ ", preventing further attempts.");
numberOfFailure = app.maxAppAttempts;
  } 
{code}

Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
support multiple connections for the same registered app. But in our case AM is 
ABORTED before even the AM starts (AM was in ACAUIRED state)

Usually users won't decommission the node where the Container is in RUNNING 
state (where the session is established). But the decommission can happen on 
nodes where the container is in ACQUIRED or ALLOCATED state. 

Will suggest to expose an config where user can decide whether to consider this 
as a failure or not. 


> Aborted AM is considered as App Failure when user sets MaxAttempts as 1
> ---
>
> Key: YARN-10871
> URL: https://issues.apache.org/jira/browse/YARN-10871
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Srinivas S T
>Priority: Major
>
> When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
> failure is not counted. But if user sets number of attempts as 1, then YARN 
> considers the ABORTED AM as a failure. 
> {code}
>   int numberOfFailure = app.getNumFailedAppAttempts();
>   if (app.maxAppAttempts == 1) {
> // If the user explicitly set the attempts to 1 then there are likely
> // correctness issues if the AM restarts for any reason.
> LOG.info("Max app attempts is 1 for " + app.applicationId
> + ", preventing further attempts.");
> numberOfFailure = app.maxAppAttempts;
>   } 
> {code}
> Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
> support multiple connections for the same registered app. But in our case AM 
> is ABORTED before even the AM starts (AM was in ACQUIRED state)
> Usually users won't decommission the node where the Container is in RUNNING 
> state (where the session is established). But the decommission can happen on 
> nodes where the container is in ACQUIRED or ALLOCATED state. 
> Will suggest to expose an config where user can decide whether to consider 
> this as a failure or not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10871) Aborted AM is considered as App Failure when user sets MaxAttempts as 1

2021-07-23 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-10871:


Assignee: Srinivas S T  (was: Prabhu Joseph)

> Aborted AM is considered as App Failure when user sets MaxAttempts as 1
> ---
>
> Key: YARN-10871
> URL: https://issues.apache.org/jira/browse/YARN-10871
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Srinivas S T
>Priority: Major
>
> When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
> failure is not counted. But if user sets number of attempts as 1, then YARN 
> considers the ABORTED AM as a failure. 
> {code}
>   int numberOfFailure = app.getNumFailedAppAttempts();
>   if (app.maxAppAttempts == 1) {
> // If the user explicitly set the attempts to 1 then there are likely
> // correctness issues if the AM restarts for any reason.
> LOG.info("Max app attempts is 1 for " + app.applicationId
> + ", preventing further attempts.");
> numberOfFailure = app.maxAppAttempts;
>   } 
> {code}
> Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
> support multiple connections for the same registered app. But in our case AM 
> is ABORTED before even the AM starts (AM was in ACAUIRED state)
> Usually users won't decommission the node where the Container is in RUNNING 
> state (where the session is established). But the decommission can happen on 
> nodes where the container is in ACQUIRED or ALLOCATED state. 
> Will suggest to expose an config where user can decide whether to consider 
> this as a failure or not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10871) Aborted AM is considered as App Failure when user sets MaxAttempts as 1

2021-07-23 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10871:


 Summary: Aborted AM is considered as App Failure when user sets 
MaxAttempts as 1
 Key: YARN-10871
 URL: https://issues.apache.org/jira/browse/YARN-10871
 Project: Hadoop YARN
  Issue Type: Bug
  Components: RM
Affects Versions: 3.3.1
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
failure is not counted. But if user sets number of attempts as 1, then YARN 
considers the ABORTED AM as a failure. 

{code}
  int numberOfFailure = app.getNumFailedAppAttempts();
  if (app.maxAppAttempts == 1) {
// If the user explicitly set the attempts to 1 then there are likely
// correctness issues if the AM restarts for any reason.
LOG.info("Max app attempts is 1 for " + app.applicationId
+ ", preventing further attempts.");
numberOfFailure = app.maxAppAttempts;
  } 
{code}

Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
support multiple connections for the same registered app. But in our case AM is 
ABORTED before even the AM starts (AM was in ACAUIRED state)

Usually users won't decommission the node where the Container is in RUNNING 
state (where the session is established). But the decommission can happen on 
nodes where the container is in ACQUIRED or ALLOCATED state. 

Will suggest to expose an config where user can decide whether to consider this 
as a failure or not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10857) YarnClient Caching Addresses

2021-07-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-10857:


Assignee: Prabhu Joseph

> YarnClient Caching Addresses
> 
>
> Key: YARN-10857
> URL: https://issues.apache.org/jira/browse/YARN-10857
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client, yarn
>Reporter: Steve Suh
>Assignee: Prabhu Joseph
>Priority: Minor
>
> We have noticed that when the YarnClient is initialized and used, it is not 
> very resilient when dns or /etc/hosts is modified in the following scenario:
> Take for instance the following (and reproducable) sequence of events that 
> can occur on a service that instantiates and uses YarnClient. 
>   - Yarn has rm HA enabled (*yarn.resourcemanager.ha.enabled* is *true*) and 
> there are two rms (rm1 and rm2).
>   - *yarn.client.failover-proxy-provider* is set to 
> *org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider*
> 1)rm2 is currently the active rm
> 2)/etc/hosts (or dns) is missing host information for rm2
> 3)A service is started and it initializes the YarnClient at startup.
> 4)At some point in time after YarnClient is done initializing, /etc/hosts 
> is updated and contains host information for rm2
> 5)Yarn is queried, for instance calling *yarnclient.getApplications()*
> 6)All YarnClient attempts to communicate with rm2 fail with 
> UnknownHostExceptions, even though /etc/hosts now contains host information 
> for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10866) RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing

2021-07-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10866:
-
Description: 
RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby 
host info is missing in /etc/hosts

{code}
2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.IllegalArgumentException: java.net.UnknownHostException: 
resourcemanager-1.resourcemanager
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100)
at 
org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76)
at 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75)
at 
org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194)
at 
org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130)
at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103)
{code}


  was:
RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby 
host info is missing

{code}
2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.IllegalArgumentException: java.net.UnknownHostException: 
resourcemanager-1.resourcemanager
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100)
at 
org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76)
at 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75)
at 
org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194)
at 
org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130)
at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103)
{code}



> RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if 
> standby host info is missing
> ---
>
> Key: YARN-10866
> URL: https://issues.apache.org/jira/browse/YARN-10866
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if 
> standby host info is missing in /etc/hosts
> {code}
> 2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> resourcemanager-1.resourcemanager
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466)
> at 
> org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154)
> at 
> org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139)
> at 
> org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81)
> at 
> org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100)
> at 
> org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76)
> at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75)
> at 
> org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194)
> at 
> org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130)
> at 
> org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103)
> {code}



--
This message was sent by Atlassian Jira

[jira] [Created] (YARN-10866) RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby host info is missing

2021-07-20 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10866:


 Summary: RequestHedgingRMFailoverProxyProvider fails to connect to 
Active RM if standby host info is missing
 Key: YARN-10866
 URL: https://issues.apache.org/jira/browse/YARN-10866
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 3.3.1
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


RequestHedgingRMFailoverProxyProvider fails to connect to Active RM if standby 
host info is missing

{code}
2021-07-19 13:07:18,892 ERROR [Listener at 0.0.0.0/45951] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.IllegalArgumentException: java.net.UnknownHostException: 
resourcemanager-1.resourcemanager
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getTokenService(ClientRMProxy.java:154)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getAMRMTokenService(ClientRMProxy.java:139)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.setAMRMTokenService(ClientRMProxy.java:81)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.getRMAddress(ClientRMProxy.java:100)
at 
org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal(ConfiguredRMFailoverProxyProvider.java:76)
at 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75)
at 
org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:194)
at 
org.apache.hadoop.yarn.client.RMProxy.newProxyInstance(RMProxy.java:130)
at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:103)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10840) yarn app status fails with ArrayIndexOutOfBoundsException

2021-07-07 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10840:
-
Attachment: YARN-10840-001.patch

> yarn app status fails with ArrayIndexOutOfBoundsException 
> --
>
> Key: YARN-10840
> URL: https://issues.apache.org/jira/browse/YARN-10840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Abhinaba Sarkar
>Assignee: Abhinaba Sarkar
>Priority: Major
> Attachments: YARN-10840-001.patch
>
>
> Array index out of bounds exception in the ClientAMService.getStatus() - 
> {code:java}
> 2021-07-04 20:00:24,488 [IPC Server handler 0 on 25347] INFO  ipc.Server - 
> IPC Server handler 0 on 25347, call Call#163 Retry#0 
> org.apache.hadoop.yarn.service.ClientAMProtocol.getStatus from 10.0.0.10:42446
> org.codehaus.jackson.map.JsonMappingException: Index: 11, Size: 11 (through 
> reference chain: 
> org.apache.hadoop.yarn.service.api.records.Service["components"]->java.util.ArrayList[0]->org.apache.hadoop.yarn.service.api.records.Component["containers"]->java.util.ArrayList[11])
>   at 
> org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:218)
>   at 
> org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:197)
>   at 
> org.codehaus.jackson.map.ser.std.SerializerBase.wrapAndThrow(SerializerBase.java:166)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:127)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71)
>   at 
> org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86)
>   at 
> org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446)
>   at 
> org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150)
>   at 
> org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:122)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71)
>   at 
> org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86)
>   at 
> org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446)
>   at 
> org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150)
>   at 
> org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112)
>   at 
> org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:610)
>   at 
> org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256)
>   at 
> org.codehaus.jackson.map.ObjectMapper._configAndWriteValue(ObjectMapper.java:2575)
>   at 
> org.codehaus.jackson.map.ObjectMapper.writeValueAsString(ObjectMapper.java:2097)
>   at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.toJson(JsonSerDeser.java:249)
>   at 
> org.apache.hadoop.yarn.service.ClientAMService.getStatus(ClientAMService.java:125)
>   at 
> org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.getStatus(ClientAMProtocolPBServiceImpl.java:59)
>   at 
> org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:6159)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 11, Size: 11
>   at java.util.ArrayList.rangeCheck(ArrayList.java:659)
>   at java.util.ArrayList.get(ArrayList.java:435)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:106)
>   ... 27 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (YARN-10840) yarn app status fails with ArrayIndexOutOfBoundsException

2021-07-07 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10840:
-
Attachment: (was: YARN-10840-001.patch)

> yarn app status fails with ArrayIndexOutOfBoundsException 
> --
>
> Key: YARN-10840
> URL: https://issues.apache.org/jira/browse/YARN-10840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Abhinaba Sarkar
>Assignee: Abhinaba Sarkar
>Priority: Major
> Attachments: YARN-10840-001.patch
>
>
> Array index out of bounds exception in the ClientAMService.getStatus() - 
> {code:java}
> 2021-07-04 20:00:24,488 [IPC Server handler 0 on 25347] INFO  ipc.Server - 
> IPC Server handler 0 on 25347, call Call#163 Retry#0 
> org.apache.hadoop.yarn.service.ClientAMProtocol.getStatus from 10.0.0.10:42446
> org.codehaus.jackson.map.JsonMappingException: Index: 11, Size: 11 (through 
> reference chain: 
> org.apache.hadoop.yarn.service.api.records.Service["components"]->java.util.ArrayList[0]->org.apache.hadoop.yarn.service.api.records.Component["containers"]->java.util.ArrayList[11])
>   at 
> org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:218)
>   at 
> org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:197)
>   at 
> org.codehaus.jackson.map.ser.std.SerializerBase.wrapAndThrow(SerializerBase.java:166)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:127)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71)
>   at 
> org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86)
>   at 
> org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446)
>   at 
> org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150)
>   at 
> org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:122)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71)
>   at 
> org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86)
>   at 
> org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446)
>   at 
> org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150)
>   at 
> org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112)
>   at 
> org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:610)
>   at 
> org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256)
>   at 
> org.codehaus.jackson.map.ObjectMapper._configAndWriteValue(ObjectMapper.java:2575)
>   at 
> org.codehaus.jackson.map.ObjectMapper.writeValueAsString(ObjectMapper.java:2097)
>   at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.toJson(JsonSerDeser.java:249)
>   at 
> org.apache.hadoop.yarn.service.ClientAMService.getStatus(ClientAMService.java:125)
>   at 
> org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.getStatus(ClientAMProtocolPBServiceImpl.java:59)
>   at 
> org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:6159)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 11, Size: 11
>   at java.util.ArrayList.rangeCheck(ArrayList.java:659)
>   at java.util.ArrayList.get(ArrayList.java:435)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:106)
>   ... 27 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (YARN-10840) yarn app status fails with ArrayIndexOutOfBoundsException

2021-07-05 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10840:
-
Summary: yarn app status fails with ArrayIndexOutOfBoundsException   (was: 
yarn app status fails with arrayindexoutofbounsexception)

> yarn app status fails with ArrayIndexOutOfBoundsException 
> --
>
> Key: YARN-10840
> URL: https://issues.apache.org/jira/browse/YARN-10840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Abhinaba Sarkar
>Assignee: Abhinaba Sarkar
>Priority: Major
> Attachments: YARN-10840-001.patch
>
>
> Array index out of bounds exception in the ClientAMService.getStatus() - 
> {code:java}
> 2021-07-04 20:00:24,488 [IPC Server handler 0 on 25347] INFO  ipc.Server - 
> IPC Server handler 0 on 25347, call Call#163 Retry#0 
> org.apache.hadoop.yarn.service.ClientAMProtocol.getStatus from 10.0.0.10:42446
> org.codehaus.jackson.map.JsonMappingException: Index: 11, Size: 11 (through 
> reference chain: 
> org.apache.hadoop.yarn.service.api.records.Service["components"]->java.util.ArrayList[0]->org.apache.hadoop.yarn.service.api.records.Component["containers"]->java.util.ArrayList[11])
>   at 
> org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:218)
>   at 
> org.codehaus.jackson.map.JsonMappingException.wrapWithPath(JsonMappingException.java:197)
>   at 
> org.codehaus.jackson.map.ser.std.SerializerBase.wrapAndThrow(SerializerBase.java:166)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:127)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71)
>   at 
> org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86)
>   at 
> org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446)
>   at 
> org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150)
>   at 
> org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:122)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:71)
>   at 
> org.codehaus.jackson.map.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:86)
>   at 
> org.codehaus.jackson.map.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:446)
>   at 
> org.codehaus.jackson.map.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:150)
>   at 
> org.codehaus.jackson.map.ser.BeanSerializer.serialize(BeanSerializer.java:112)
>   at 
> org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:610)
>   at 
> org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256)
>   at 
> org.codehaus.jackson.map.ObjectMapper._configAndWriteValue(ObjectMapper.java:2575)
>   at 
> org.codehaus.jackson.map.ObjectMapper.writeValueAsString(ObjectMapper.java:2097)
>   at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.toJson(JsonSerDeser.java:249)
>   at 
> org.apache.hadoop.yarn.service.ClientAMService.getStatus(ClientAMService.java:125)
>   at 
> org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.getStatus(ClientAMProtocolPBServiceImpl.java:59)
>   at 
> org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:6159)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 11, Size: 11
>   at java.util.ArrayList.rangeCheck(ArrayList.java:659)
>   at java.util.ArrayList.get(ArrayList.java:435)
>   at 
> org.codehaus.jackson.map.ser.std.StdContainerSerializers$IndexedListSerializer.serializeContents(StdContainerSerializers.java:106)
>   ... 27 more
> 

[jira] [Resolved] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-27 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-10820.
--
Resolution: Fixed

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 

[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-27 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370200#comment-17370200
 ] 

Prabhu Joseph commented on YARN-10820:
--

Thanks [~Swathi Chandrashekar] for the patch. Have committed in 3.4.0. 

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> 

[jira] [Updated] (YARN-10810) YARN Native Service Definition is not backward compatible

2021-06-17 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10810:
-
Attachment: YARN-10810-001.patch

> YARN Native Service Definition is not backward compatible
> -
>
> Key: YARN-10810
> URL: https://issues.apache.org/jira/browse/YARN-10810
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10810-001.patch
>
>
> YARN Native Service Spec PlacementScope value was *NODE* in hadoop-3.1 
> version but got changed to *node* in hadoop-3.3. This causes older Service 
> Client (hadoop-3.1) to fail while getting the status from new Api Server 
> (hadoop-3.3). This looks caused due to jackson upgrade.
>  
> {code:java}
> 2021-06-07 06:08:40,095 INFO utils.ServiceApiUtil: Loading service definition 
> from hdfs://prabhuhdfs/user/root/.yarn/services/llap0/llap0.json
> 2021-06-07 06:08:40,798 ERROR utils.JsonSerDeser: Exception while parsing 
> json : org.codehaus.jackson.map.JsonMappingException: Can not construct 
> instance of org.apache.hadoop.yarn.service.api.records.PlacementScope from 
> String value 'node': value not one of declared Enum instance names
>  at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through 
> reference chain: 
> org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"])
> "placement_policy" : {
>   "constraints" : [ {
> "name" : null,
> "type" : "ANTI_AFFINITY",
> "scope" : "node",
> "target_tags" : [ "llap" ],
> "node_attributes" : { },
> "node_partitions" : [ ],
> "min_cardinality" : null,
> "max_cardinality" : null
>   } ]
> },org.codehaus.jackson.map.JsonMappingException: Can not construct 
> instance of org.apache.hadoop.yarn.service.api.records.PlacementScope from 
> String value 'node': value not one of declared Enum instance names
>  at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through 
> reference chain: 
> org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"])
> at 
> org.codehaus.jackson.map.JsonMappingException.from(JsonMappingException.java:163)
> at 
> org.codehaus.jackson.map.deser.StdDeserializationContext.weirdStringException(StdDeserializationContext.java:243)
> at 
> org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:80)
> at 
> org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:23)
> at 
> org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299)
> at 
> org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414)
> at 
> org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697)
> at 
> org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580)
> at 
> org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217)
> at 
> org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:194)
> at 
> org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:30)
> at 
> org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299)
> at 
> org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414)
> at 
> org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697)
> at 
> org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580)
> at 
> org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299)
> at 
> org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414)
> at 
> org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697)
> at 
> 

[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-14 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362724#comment-17362724
 ] 

Prabhu Joseph commented on YARN-10820:
--

Hi [~bibinchundatt], Could you please assign [~Swathi Chandrashekar] the 
contributor. Thanks.

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> 

[jira] [Created] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-13 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10820:


 Summary: Make GetClusterNodesRequestPBImpl thread safe
 Key: YARN-10820
 URL: https://issues.apache.org/jira/browse/YARN-10820
 Project: Hadoop YARN
  Issue Type: Task
  Components: client
Affects Versions: 3.3.0, 3.1.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


yarn node list intermittently fails with below
{code:java}
2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
[resourcemanager-1], so propagating back to caller.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
 at java.util.ArrayList.add(ArrayList.java:465)
 at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
 at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
 at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
 at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
 at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
 at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)



2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
Invocation returned exception: java.lang.UnsupportedOperationException on 
[resourcemanager-0], so propagating back to caller.
Exception in thread "main" java.lang.UnsupportedOperationException
at 
java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 

[jira] [Commented] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url

2021-06-08 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359448#comment-17359448
 ] 

Prabhu Joseph commented on YARN-10792:
--

Thanks [~abhinaba.sarkar] for the contribution. Have committed the patch to 3.4

> Set Completed AppAttempt LogsLink to Log Server Url
> ---
>
> Key: YARN-10792
> URL: https://issues.apache.org/jira/browse/YARN-10792
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Abhinaba Sarkar
>Priority: Major
> Attachments: YARN-10792-001.patch, YARN-10792-002.patch, 
> YARN-10792-003.patch
>
>
> Completed AppAttempts listed by YARN UI has logslink pointing to the 
> NodeManager containerlogs url. The completed container logs will be under 
> aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. 
> On frequent Scale Down, these NMs won't be available and so makes difficulty 
> to look for appattempt logs of completed apps from RM UI. Setting the 
> logslink for Completed AppAttempts to LogServer url will avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10810) YARN Native Service Definition is not backward compatible

2021-06-08 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10810:


 Summary: YARN Native Service Definition is not backward compatible
 Key: YARN-10810
 URL: https://issues.apache.org/jira/browse/YARN-10810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Affects Versions: 3.3.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


YARN Native Service Spec PlacementScope value was *NODE* in hadoop-3.1 version 
but got changed to *node* in hadoop-3.3. This causes older Service Client 
(hadoop-3.1) to fail while getting the status from new Api Server (hadoop-3.3). 
This looks caused due to jackson upgrade.

 
{code:java}

2021-06-07 06:08:40,095 INFO utils.ServiceApiUtil: Loading service definition 
from hdfs://prabhuhdfs/user/root/.yarn/services/llap0/llap0.json
2021-06-07 06:08:40,798 ERROR utils.JsonSerDeser: Exception while parsing json 
: org.codehaus.jackson.map.JsonMappingException: Can not construct instance of 
org.apache.hadoop.yarn.service.api.records.PlacementScope from String value 
'node': value not one of declared Enum instance names
 at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through 
reference chain: 
org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"])
"placement_policy" : {
  "constraints" : [ {
"name" : null,
"type" : "ANTI_AFFINITY",
"scope" : "node",
"target_tags" : [ "llap" ],
"node_attributes" : { },
"node_partitions" : [ ],
"min_cardinality" : null,
"max_cardinality" : null
  } ]
},org.codehaus.jackson.map.JsonMappingException: Can not construct instance 
of org.apache.hadoop.yarn.service.api.records.PlacementScope from String value 
'node': value not one of declared Enum instance names
 at [Source: java.io.StringReader@72c927f1; line: 27, column: 33] (through 
reference chain: 
org.apache.hadoop.yarn.service.api.records.Service["components"]->org.apache.hadoop.yarn.service.api.records.Component["placement_policy"]->org.apache.hadoop.yarn.service.api.records.PlacementPolicy["constraints"]->org.apache.hadoop.yarn.service.api.records.PlacementConstraint["scope"])
at 
org.codehaus.jackson.map.JsonMappingException.from(JsonMappingException.java:163)
at 
org.codehaus.jackson.map.deser.StdDeserializationContext.weirdStringException(StdDeserializationContext.java:243)
at 
org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:80)
at 
org.codehaus.jackson.map.deser.std.EnumDeserializer.deserialize(EnumDeserializer.java:23)
at 
org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299)
at 
org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414)
at 
org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697)
at 
org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580)
at 
org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217)
at 
org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:194)
at 
org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:30)
at 
org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299)
at 
org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414)
at 
org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697)
at 
org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580)
at 
org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:299)
at 
org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:414)
at 
org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:697)
at 
org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:580)
at 
org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217)
at 
org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:194)
at 
org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:30)
at 

[jira] [Commented] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url

2021-06-07 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358782#comment-17358782
 ] 

Prabhu Joseph commented on YARN-10792:
--

Thanks [~abhinaba.sarkar] for the patch. [^YARN-10792-003.patch] looks fine, +1.

Failed test cases are not related to this patch. 
{{TestCapacitySchedulerSurgicalPreemption#testPriorityPreemptionWithNodeLabels}}
 and {{TestFSRMStateStore}} are running fine on local. Looks intermittent 
issue, will check if Jira exists to track the same. If not, will report the 
same.

Will commit this Patch by tomorrow EOD, if no other comments. Thanks.

> Set Completed AppAttempt LogsLink to Log Server Url
> ---
>
> Key: YARN-10792
> URL: https://issues.apache.org/jira/browse/YARN-10792
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Abhinaba Sarkar
>Priority: Major
> Attachments: YARN-10792-001.patch, YARN-10792-002.patch, 
> YARN-10792-003.patch
>
>
> Completed AppAttempts listed by YARN UI has logslink pointing to the 
> NodeManager containerlogs url. The completed container logs will be under 
> aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. 
> On frequent Scale Down, these NMs won't be available and so makes difficulty 
> to look for appattempt logs of completed apps from RM UI. Setting the 
> logslink for Completed AppAttempts to LogServer url will avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url

2021-06-07 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358782#comment-17358782
 ] 

Prabhu Joseph edited comment on YARN-10792 at 6/7/21, 6:21 PM:
---

Thanks [~abhinaba.sarkar] for the patch. [^YARN-10792-003.patch] looks good to 
me, +1.

Failed test cases are not related to this patch. 
{{TestCapacitySchedulerSurgicalPreemption#testPriorityPreemptionWithNodeLabels}}
 and {{TestFSRMStateStore}} are running fine on local. Looks intermittent 
issue, will check if Jira exists to track the same. If not, will report the 
same.

Will commit this Patch by tomorrow EOD, if no other comments. Thanks.


was (Author: prabhu joseph):
Thanks [~abhinaba.sarkar] for the patch. [^YARN-10792-003.patch] looks fine, +1.

Failed test cases are not related to this patch. 
{{TestCapacitySchedulerSurgicalPreemption#testPriorityPreemptionWithNodeLabels}}
 and {{TestFSRMStateStore}} are running fine on local. Looks intermittent 
issue, will check if Jira exists to track the same. If not, will report the 
same.

Will commit this Patch by tomorrow EOD, if no other comments. Thanks.

> Set Completed AppAttempt LogsLink to Log Server Url
> ---
>
> Key: YARN-10792
> URL: https://issues.apache.org/jira/browse/YARN-10792
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Abhinaba Sarkar
>Priority: Major
> Attachments: YARN-10792-001.patch, YARN-10792-002.patch, 
> YARN-10792-003.patch
>
>
> Completed AppAttempts listed by YARN UI has logslink pointing to the 
> NodeManager containerlogs url. The completed container logs will be under 
> aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. 
> On frequent Scale Down, these NMs won't be available and so makes difficulty 
> to look for appattempt logs of completed apps from RM UI. Setting the 
> logslink for Completed AppAttempts to LogServer url will avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10792) Set Completed AppAttempt LogsLink to Log Server Url

2021-05-28 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10792:


 Summary: Set Completed AppAttempt LogsLink to Log Server Url
 Key: YARN-10792
 URL: https://issues.apache.org/jira/browse/YARN-10792
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: webapp
Affects Versions: 3.3.0
Reporter: Prabhu Joseph
Assignee: Abhinaba Sarkar


Completed AppAttempts listed by YARN UI has logslink pointing to the 
NodeManager containerlogs url. The completed container logs will be under 
aggregated log path and so NM ContainerLogsPage redirects to Log Server Url. On 
frequent Scale Down, these NMs won't be available and so makes difficulty to 
look for appattempt logs of completed apps from RM UI. Setting the logslink for 
Completed AppAttempts to LogServer url will avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10742) Discard old domain data from RollingLevelDBTimelineStore

2021-04-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-10742.
--
Resolution: Duplicate

> Discard old domain data from RollingLevelDBTimelineStore
> 
>
> Key: YARN-10742
> URL: https://issues.apache.org/jira/browse/YARN-10742
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Discard old domain data from domaindb and ownerdb in 
> RollingLevelDBTimelineStore



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10741) Discard old domain data from RollingLevelDBTimelineStore

2021-04-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-10741.
--
Resolution: Duplicate

> Discard old domain data from RollingLevelDBTimelineStore
> 
>
> Key: YARN-10741
> URL: https://issues.apache.org/jira/browse/YARN-10741
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Discard old domain data from domaindb and ownerdb in 
> RollingLevelDBTimelineStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10742) Discard old domain data from RollingLevelDBTimelineStore

2021-04-16 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10742:


 Summary: Discard old domain data from RollingLevelDBTimelineStore
 Key: YARN-10742
 URL: https://issues.apache.org/jira/browse/YARN-10742
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 3.3.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Discard old domain data from domaindb and ownerdb in RollingLevelDBTimelineStore



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10741) Discard old domain data from RollingLevelDBTimelineStore

2021-04-16 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10741:


 Summary: Discard old domain data from RollingLevelDBTimelineStore
 Key: YARN-10741
 URL: https://issues.apache.org/jira/browse/YARN-10741
 Project: Hadoop YARN
  Issue Type: Task
  Components: timelineserver
Affects Versions: 3.3.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Discard old domain data from domaindb and ownerdb in 
RollingLevelDBTimelineStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >