[jira] [Created] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf

Wei-Chiu Chuang (JIRA) Tue, 10 Apr 2018 09:06:40 -0700

Wei-Chiu Chuang created HADOOP-15378:
----------------------------------------


             Summary: Hadoop client unable to relogin because a remote DataNode 
has an incorrect krb5.conf
                 Key: HADOOP-15378
                 URL: https://issues.apache.org/jira/browse/HADOOP-15378
             Project: Hadoop Common
          Issue Type: Bug
          Components: security
         Environment: CDH5.8.3, Kerberized, Impala
            Reporter: Wei-Chiu Chuang


This is a very weird bug.

We received a report where a Hadoop client (Impala Catalog server) failed to 
relogin and crashed every several hours. Initial indication suggested the 
symptom matched HADOOP-13433.

But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog 
server still kept crashing.

 
{noformat}
W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] 
PriviledgedActionException as:impala/host1.example....@example.com 
(auth:KERBEROS) 
cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
Failure to initialize security context
W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first kerberos 
ticket is not TGT(the server principal is hdfs/host2.example....@example.com), 
remove and destroy it.
W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first kerberos 
ticket is not TGT(the server principal is hdfs/host3.example....@example.com), 
remove and destroy it.
W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no 
kerberos ticket found while attempting to renew ticket{noformat}
The error “Failure to initialize security context” is suspicious here. Catalogd 
was unable to log in because of a Kerberos issue. The JDK expects the first 
kerberos ticket of a principal to be a TGT, however it seems that after this 
error, because it was unable to login successfully, the first ticket was no 
longer a TGT. The patch HADOOP-13433 removed other tickets of the principal, 
because it expects the TGT to be in the principal’s ticket, which is untrue in 
this case. So finally, it removed all tickets.

And then
{noformat}
W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] 
PriviledgedActionException as:impala/host1.example....@example.com 
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
{noformat}
The error “Failed to find any Kerberos tgt” is typically an indication that the 
user’s Kerberos ticket has expired. However, that’s definitely not the case 
here, since it was just a little over 8 hours.

After we patched HADOOP-13433, the error handling code exhibited NPE, as 
reported in HADOOP-15143.

 
{code:java}
I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while 
invoking listCachePools of class ClientNamenodeProtocolTranslatorPB over 
host4.example.com/10.0.121.66:8020 after 2 fail over attempts. Trying to fail 
over immediately. Java exception follows: java.io.IOException: Failed on local 
exception: java.io.IOException: Couldn't set up IO streams; Host Details : 
local host is: "host1.example.com/10.0.121.45"; destination host is: 
"host4.example.com":8020; at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at 
org.apache.hadoop.ipc.Client.call(Client.java:1506) at 
org.apache.hadoop.ipc.Client.call(Client.java:1439) at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
 at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261)
 at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
 at com.sun.proxy.$Proxy10.listCachePools(Unknown Source) at 
org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:55)
 at 
org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:33)
 at 
org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77)
 at 
org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85)
 at 
org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99)
 at 
com.cloudera.impala.catalog.CatalogServiceCatalog$CachePoolReader.run(CatalogServiceCatalog.java:193)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: 
Couldn't set up IO streams at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:826) at 
org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396) at 
org.apache.hadoop.ipc.Client.getConnection(Client.java:1555) at 
org.apache.hadoop.ipc.Client.call(Client.java:1478) ... 23 more Caused by: 
java.lang.NullPointerException at 
org.apache.hadoop.security.UserGroupInformation.fixKerberosTicketOrder(UserGroupInformation.java:1136)
 at 
org.apache.hadoop.security.UserGroupInformation.reloginFromTicketCache(UserGroupInformation.java:1272)
 at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:697) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1835)
 at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:681)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:769) ... 
26 more

{code}
 

In any case, HADOOP-15143 does not fix the problem, since valid credentials 
were already removed from the user.

Also, even after HADOOP-13433 was applied, I still saw the following error:
{noformat}
W0113 14:33:44.254727 1255277 UserGroupInformation.java:1838] 
PriviledgedActionException as:impala/host1.example....@example.com 
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: The 
ticket isn't for us (35) - BAD TGS SERVER NAME)]
{noformat}
 

We traced the SaslException back to a particular remote DataNode. Finally, it 
turns out that the DataNode's host has an incorrect krb5.conf, which points to 
a decommissioned KDC, inconsistent with the rest of the hosts in the cluster. 
Upon authentication, it seems the error handling for the SaslException has a 
race condition, and occasionally it removes valid credential from the UGI. When 
we corrected the krb5.conf on that host, this error disappeared.

This is not easy to reproduce, but we are told this issue seems to occur after 
Impala user performs a "refresh" command. Presumably this command forces Impala 
catalog server to connect to the problematic DataNode and triggered the buggy 
code.

 

I don't have a patch to fix it but I want to raise this issue for discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Created] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf

Reply via email to