[ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-7450:
-------------------------------
    Description: 
We saw a stack trace (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent issue that failed the kerberos relogin 
from keytab. However, I'm assuming this was *not* retried because I only saw 
one instance of this stack trace.  I propose that this operation should have 
been retried.

It seems, this caused events at the ResourceManager to queue up and eventually 
stop responding to even basic {{yarn application -list}} commands.

  was:
We saw a stack trace (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent network issue that failed the kerberos 
relogin from keytab. However, I'm assuming this was *not* retried because I 
only saw one instance of this stack trace.  I propose that this operation 
should have been retried.

It seems, this caused events at the ResourceManager to queue up and eventually 
stop responding to even basic {{yarn application -list}} commands.


> ATS Client should retry on intermittent Kerberos issues.
> --------------------------------------------------------
>
>                 Key: YARN-7450
>                 URL: https://issues.apache.org/jira/browse/YARN-7450
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: ATSv2
>    Affects Versions: 2.7.3
>         Environment: Hadoop-2.7.3
>            Reporter: Ravi Prakash
>
> We saw a stack trace (posted in the first comment) in the ResourceManager 
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent issue that failed the kerberos relogin 
> from keytab. However, I'm assuming this was *not* retried because I only saw 
> one instance of this stack trace.  I propose that this operation should have 
> been retried.
> It seems, this caused events at the ResourceManager to queue up and 
> eventually stop responding to even basic {{yarn application -list}} commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to