[
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240899#comment-16240899
]
Ravi Prakash commented on YARN-7450:
------------------------------------
{code}
2017-10-29 02:30:30,260 ERROR
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher:
Error when publishing entity [YARN_APPLICATION,application_1507181091525_3046]
com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: Login
failure for <SOME_PRINCIPAL>@<SOME_REALM> from keytab <SOME_KEYTAB_FILE>
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:235)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:184)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:246)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:483)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:332)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:329)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:329)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:314)
at
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:452)
at
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationCreatedEvent(SystemMetricsPublisher.java:265)
at
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:220)
at
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:469)
at
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:464)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Login failure for <SOME_PRINCIPAL>@<SOME_REALM>
from keytab <SOME_KEYTAB_FILE>
at
org.apache.hadoop.security.UserGroupInformation.reloginFromKeytab(UserGroupInformation.java:1109)
at
org.apache.hadoop.security.UserGroupInformation.checkTGTAndReloginFromKeytab(UserGroupInformation.java:1042)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:500)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
... 23 more
Caused by: javax.security.auth.login.LoginException: Generic error (description
in e-text) (60) - LOOKING_UP_CLIENT
at
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
at
com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at
javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at
javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at
org.apache.hadoop.security.UserGroupInformation.reloginFromKeytab(UserGroupInformation.java:1101)
... 27 more
Caused by: KrbException: Generic error (description in e-text) (60) -
LOOKING_UP_CLIENT
at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:82)
at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316)
at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361)
at
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776)
... 40 more
Caused by: KrbException: Identifier doesn't match expected value (906)
at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
at sun.security.krb5.internal.ASRep.init(ASRep.java:64)
at sun.security.krb5.internal.ASRep.<init>(ASRep.java:59)
at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:60)
... 43 more
{code}
> ATS Client should retry on intermittent Kerberos issues.
> --------------------------------------------------------
>
> Key: YARN-7450
> URL: https://issues.apache.org/jira/browse/YARN-7450
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: ATSv2
> Affects Versions: 2.7.3
> Environment: Hadoop-2.7.3
> Reporter: Ravi Prakash
>
> We saw a stack track (posted in the first comment) in the ResourceManager
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent network issue that failed the kerberos
> relogin from keytab. However, I'm assuming this was *not* retried because I
> only saw one instance of this stack trace. I propose that this operation
> should have been retried.
> It seems, this caused events at the ResourceManager to queue up and
> eventually stop responding to even basic {{yarn application -list}} commands.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]