[jira] [Updated] (YARN-6606) The implementation of LocalizationStatus in ContainerStatusProto
[ https://issues.apache.org/jira/browse/YARN-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-6606: Fix Version/s: (was: 2.9.1) > The implementation of LocalizationStatus in ContainerStatusProto > > > Key: YARN-6606 > URL: https://issues.apache.org/jira/browse/YARN-6606 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Affects Versions: 2.9.0 >Reporter: Bingxue Qiu >Priority: Major > Attachments: YARN-6606.1.patch, YARN-6606.2.patch > > > we have a use case, where the full implementation of localization status in > ContainerStatusProto > [Continuous-resource-localization|https://issues.apache.org/jira/secure/attachment/12825041/Continuous-resource-localization.pdf] >need to be done , so we make it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool
[ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-6661: Fix Version/s: (was: 2.9.1) > Too much CLEANUP event hang ApplicationMasterLauncher thread pool > - > > Key: YARN-6661 > URL: https://issues.apache.org/jira/browse/YARN-6661 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 > Environment: hadoop 2.7.2 >Reporter: JackZhou >Priority: Major > > Some one else have already come up with the similar problem and fix it. > We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for > detail. > But I think the fix have not solve the problem completely, blow was the > problem I encountered: > There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps. > I failover my active rm and rm will failover all those 1800 apps. > When a application failover, It will wait for AM container register itself. > But there is a bug in my AM (I do it intentionally), and it will not register > itself. > So the RM will wait for about 10mins for the AM expiration, and it will send > a CLEANUP event to > ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so > it will hang the ApplicationMasterLauncher > thread pool for a large time. I have already use the > patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), > so > a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so > for each of my thread, it will > hang 1800 / 50 * 200s = 7200s=20min. > Because the AM have register itself during 10mins, so it will retry and > create a new application attempt. > The application attempt will accept a container from RM, and send a LAUNCH to > ApplicationMasterLauncher thread pool. > Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the > application attempt will not > start the AM container during 10min. > And it will expire, and send a CLEANUP event to ApplicationMasterLauncher > thread pools too. > As you can see, none of my application can really run it. > Each of them have 5 application attempts as follows, and each of them keep > retrying. > appattempt_1495786030132_4000_05 > appattempt_1495786030132_4000_04 > appattempt_1495786030132_4000_03 > appattempt_1495786030132_4000_02 > appattempt_1495786030132_4000_01 > So all of my apps have hang several hours, and none of them can really run. > I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events. > And use some other thread to deal with LAUNCH event or use other way. > Sorry, I english is so poor. I don't know have I describe it clearly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6645) Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-6645: Fix Version/s: (was: 2.9.1) > Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor > --- > > Key: YARN-6645 > URL: https://issues.apache.org/jira/browse/YARN-6645 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Bingxue Qiu >Priority: Major > Attachments: error when creating symlink.png > > > when creating symlink after the resource localized in our clusters , an > IOException has been thrown, because the nmPrivateDir doesn't exist. we add a > patch to fix it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7601) Incorrect container states recovered as LevelDB uses alphabetical order
[ https://issues.apache.org/jira/browse/YARN-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-7601: Target Version/s: 2.9.2 (was: 2.9.1) > Incorrect container states recovered as LevelDB uses alphabetical order > --- > > Key: YARN-7601 > URL: https://issues.apache.org/jira/browse/YARN-7601 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sampada Dehankar >Assignee: Sampada Dehankar >Priority: Major > Attachments: YARN-7601.001.patch, YARN-7601.002.patch > > > LevelDB stores key-value pairs in the alphabetical order. Container id > concatenated by its state is used as key. So, even if container goes through > any states in its life cycle, the order of states for following values > retrieved from LevelDB is always going to be as below`: > LAUNCHED > PAUSED > QUEUED > For eg: If a container is LAUNCHED then PAUSED and LAUNCHED again, the > recovered container state is PAUSED currently instead of LAUNCHED. > We propose to store the timestamp as the value while making call to > > storeContainerLaunched > storeContainerPaused > storeContainerQueued > > so that correct container state is recovered based on timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
[ https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-4858: Target Version/s: 2.9.2 (was: 2.9.1) > start-yarn and stop-yarn scripts to support timeline and sharedcachemanager > --- > > Key: YARN-4858 > URL: https://issues.apache.org/jira/browse/YARN-4858 > Project: Hadoop YARN > Issue Type: Improvement > Components: scripts >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: oct16-easy > Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch > > > The start-yarn and stop-yarn scripts don't have any (even commented out) > support for the timeline and sharedcachemanager > Proposed: > * bash and cmd start-yarn scripts have commented out start actions > * stop-yarn scripts stop the servers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-7652: Target Version/s: 2.9.2 (was: 2.9.1) > Handle AM register requests asynchronously in FederationInterceptor > --- > > Key: YARN-7652 > URL: https://issues.apache.org/jira/browse/YARN-7652 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Affects Versions: 2.9.0, 3.0.0 >Reporter: Subru Krishnan >Assignee: Botong Huang >Priority: Major > > We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in > {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has > outdated info about a _SubCluster_. This is because we handle AM register > requests synchronously. This jira proposes to move to async similar to how we > operate with allocate invocations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.
[ https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-7450: Target Version/s: 2.9.2 (was: 2.9.1) > ATS Client should retry on intermittent Kerberos issues. > > > Key: YARN-7450 > URL: https://issues.apache.org/jira/browse/YARN-7450 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Affects Versions: 2.7.3 > Environment: Hadoop-2.7.3 >Reporter: Ravi Prakash >Priority: Major > > We saw a stack trace (posted in the first comment) in the ResourceManager > logs for the TimelineClientImpl not being able to relogin from keytab. > I'm guessing there was an intermittent issue that failed the kerberos relogin > from keytab. However, I'm assuming this was *not* retried because I only saw > one instance of this stack trace. I propose that this operation should have > been retried. > It seems, this caused events at the ResourceManager to queue up and > eventually stop responding to even basic {{yarn application -list}} commands. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6918) Remove acls after queue delete to avoid memory leak
[ https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-6918: Target Version/s: 3.0.2, 2.9.2 (was: 2.9.1, 3.0.2) > Remove acls after queue delete to avoid memory leak > --- > > Key: YARN-6918 > URL: https://issues.apache.org/jira/browse/YARN-6918 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > Attachments: YARN-6918.001.patch, YARN-6918.002.patch > > > Acl for deleted queue need to removed from allAcls to avoid leak > (Priority,YarnAuthorizer) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7649) RMContainer state transition exception after container update
[ https://issues.apache.org/jira/browse/YARN-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SammiChen updated YARN-7649: Target Version/s: 3.0.2, 2.9.2 (was: 2.9.1, 3.0.2) > RMContainer state transition exception after container update > - > > Key: YARN-7649 > URL: https://issues.apache.org/jira/browse/YARN-7649 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.0 >Reporter: Weiwei Yang >Assignee: Arun Suresh >Priority: Major > > I've been seen this in a cluster deployment as well as in UT, run > {{TestAMRMClient#testAMRMClientWithContainerPromotion}} could reproduce this, > it doesn't fail the test case but following error message is shown up in the > log > {noformat} > 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(480)) - Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > RELEASED at ALLOCATED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:478) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:675) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:155) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(481)) - Invalid event RELEASED on container > container_1513165290804_0001_01_03 > {noformat} > this seems to be related to YARN-6251. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387355#comment-16387355 ] SammiChen commented on YARN-7652: - Hi [~botong], is it still on target for 2.9.1? If not, can we push it out from 2.9.1 to next release? > Handle AM register requests asynchronously in FederationInterceptor > --- > > Key: YARN-7652 > URL: https://issues.apache.org/jira/browse/YARN-7652 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Affects Versions: 2.9.0, 3.0.0 >Reporter: Subru Krishnan >Assignee: Botong Huang >Priority: Major > > We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in > {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has > outdated info about a _SubCluster_. This is because we handle AM register > requests synchronously. This jira proposes to move to async similar to how we > operate with allocate invocations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7601) Incorrect container states recovered as LevelDB uses alphabetical order
[ https://issues.apache.org/jira/browse/YARN-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387353#comment-16387353 ] SammiChen commented on YARN-7601: - Hi [~sampada15], is it still on target for 2.9.1? If not, can we put it out from 2.9.1 to next release? > Incorrect container states recovered as LevelDB uses alphabetical order > --- > > Key: YARN-7601 > URL: https://issues.apache.org/jira/browse/YARN-7601 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sampada Dehankar >Assignee: Sampada Dehankar >Priority: Major > Attachments: YARN-7601.001.patch, YARN-7601.002.patch > > > LevelDB stores key-value pairs in the alphabetical order. Container id > concatenated by its state is used as key. So, even if container goes through > any states in its life cycle, the order of states for following values > retrieved from LevelDB is always going to be as below`: > LAUNCHED > PAUSED > QUEUED > For eg: If a container is LAUNCHED then PAUSED and LAUNCHED again, the > recovered container state is PAUSED currently instead of LAUNCHED. > We propose to store the timestamp as the value while making call to > > storeContainerLaunched > storeContainerPaused > storeContainerQueued > > so that correct container state is recovered based on timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container
[ https://issues.apache.org/jira/browse/YARN-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383107#comment-16383107 ] SammiChen commented on YARN-7511: - Hi [~Tao Yang], is it still on target for 2.9.1? if not, can we push this out from 2.9.1 to next release? > NPE in ContainerLocalizer when localization failed for running container > > > Key: YARN-7511 > URL: https://issues.apache.org/jira/browse/YARN-7511 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-7511.001.patch > > > Error log: > {noformat} > 2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106) > at > java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:834) > 2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. > {noformat} > Reproduce this problem: > 1. Container was running and ContainerManagerImpl#localize was called for > this container > 2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and > sent out ContainerResourceFailedEvent with null LocalResourceRequest. > 3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> > container.resourceSet.resourceLocalizationFailed(null) > I think we can fix this problem through ensuring that request is not null > before remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7736) Fix itemization in YARN federation document
[ https://issues.apache.org/jira/browse/YARN-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383103#comment-16383103 ] SammiChen commented on YARN-7736: - Hi [~ajisakaa], is it still on target for 2.9.1? If not, can we push this out from 2.9.1 to next release? > Fix itemization in YARN federation document > --- > > Key: YARN-7736 > URL: https://issues.apache.org/jira/browse/YARN-7736 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Akira Ajisaka >Priority: Minor > Labels: newbie > > https://hadoop.apache.org/docs/r3.0.0/hadoop-yarn/hadoop-yarn-site/Federation.html > {noformat} > Assumptions: > * We assume reasonably good connectivity across sub-clusters (e.g., we are > not looking to federate across DC yet, though future investigations of this > are not excluded). > * We rely on HDFS federation (or equivalently scalable DFS solutions) to take > care of scalability of the store side. > {noformat} > Blank line should be inserted before itemization to render correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
[ https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383040#comment-16383040 ] SammiChen commented on YARN-4858: - Hi [~ste...@apache.org], does this still target for 2.9.1? If not, can we push this out to next 2.9.2 release? > start-yarn and stop-yarn scripts to support timeline and sharedcachemanager > --- > > Key: YARN-4858 > URL: https://issues.apache.org/jira/browse/YARN-4858 > Project: Hadoop YARN > Issue Type: Improvement > Components: scripts >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: oct16-easy > Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch > > > The start-yarn and stop-yarn scripts don't have any (even commented out) > support for the timeline and sharedcachemanager > Proposed: > * bash and cmd start-yarn scripts have commented out start actions > * stop-yarn scripts stop the servers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.
[ https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383031#comment-16383031 ] SammiChen commented on YARN-7450: - Hi [~raviprak], does this still target for 2.9.1? If not, can we push this out to next 2.9.2 release? > ATS Client should retry on intermittent Kerberos issues. > > > Key: YARN-7450 > URL: https://issues.apache.org/jira/browse/YARN-7450 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Affects Versions: 2.7.3 > Environment: Hadoop-2.7.3 >Reporter: Ravi Prakash >Priority: Major > > We saw a stack trace (posted in the first comment) in the ResourceManager > logs for the TimelineClientImpl not being able to relogin from keytab. > I'm guessing there was an intermittent issue that failed the kerberos relogin > from keytab. However, I'm assuming this was *not* retried because I only saw > one instance of this stack trace. I propose that this operation should have > been retried. > It seems, this caused events at the ResourceManager to queue up and > eventually stop responding to even basic {{yarn application -list}} commands. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org