[jira] [Commented] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf

2018-04-16 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439204#comment-16439204
 ] 

Steve Loughran commented on HADOOP-15378:
-

bq. unfortunately current CDH5 doesn't have KDiag (I thought of backporting it 
but I forgot).

There's a self contained version of KDiag designed tor run against older hadoop 
versions: https://github.com/steveloughran/kdiag

grab it, build against CDH, share with the support team. They'll appreciate it

> Hadoop client unable to relogin because a remote DataNode has an incorrect 
> krb5.conf
> 
>
> Key: HADOOP-15378
> URL: https://issues.apache.org/jira/browse/HADOOP-15378
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.6.0
> Environment: CDH5.8.3, Kerberized, Impala
>Reporter: Wei-Chiu Chuang
>Priority: Critical
>
> This is a very weird bug.
> We received a report where a Hadoop client (Impala Catalog server) failed to 
> relogin and crashed every several hours. Initial indication suggested the 
> symptom matched HADOOP-13433.
> But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog 
> server still kept crashing.
>  
> {noformat}
> W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) 
> cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
>  Failure to initialize security context
> W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host2.example@example.com), remove and destroy it.
> W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host3.example@example.com), remove and destroy it.
> W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no 
> kerberos ticket found while attempting to renew ticket{noformat}
> The error “Failure to initialize security context” is suspicious here. 
> Catalogd was unable to log in because of a Kerberos issue. The JDK expects 
> the first kerberos ticket of a principal to be a TGT, however it seems that 
> after this error, because it was unable to login successfully, the first 
> ticket was no longer a TGT. The patch HADOOP-13433 removed other tickets of 
> the principal, because it expects the TGT to be in the principal’s ticket, 
> which is untrue in this case. So finally, it removed all tickets.
> And then
> {noformat}
> W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)]
> {noformat}
> The error “Failed to find any Kerberos tgt” is typically an indication that 
> the user’s Kerberos ticket has expired. However, that’s definitely not the 
> case here, since it was just a little over 8 hours.
> After we patched HADOOP-13433, the error handling code exhibited NPE, as 
> reported in HADOOP-15143.
>  
> {code:java}
> I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while 
> invoking listCachePools of class ClientNamenodeProtocolTranslatorPB over 
> host4.example.com/10.0.121.66:8020 after 2 fail over attempts. Trying to fail 
> over immediately. Java exception follows: java.io.IOException: Failed on 
> local exception: java.io.IOException: Couldn't set up IO streams; Host 
> Details : local host is: "host1.example.com/10.0.121.45"; destination host 
> is: "host4.example.com":8020; at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1506) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1439) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>  at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261)
>  at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>  at com.sun.proxy.$Proxy10.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:55)
>  at 

[jira] [Commented] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf

2018-04-13 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437545#comment-16437545
 ] 

Wei-Chiu Chuang commented on HADOOP-15378:
--

Thank you, Steve. BTW your book is valuable in troubleshooting this issue.

No  unfortunately current CDH5 doesn't have KDiag (I thought of backporting 
it but I forgot).
We did ask for JDK Kerberos debug and Hadoop debug log. But we corrected the 
invalid krb5.conf before the debug log was put into place.

> Hadoop client unable to relogin because a remote DataNode has an incorrect 
> krb5.conf
> 
>
> Key: HADOOP-15378
> URL: https://issues.apache.org/jira/browse/HADOOP-15378
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.6.0
> Environment: CDH5.8.3, Kerberized, Impala
>Reporter: Wei-Chiu Chuang
>Priority: Critical
>
> This is a very weird bug.
> We received a report where a Hadoop client (Impala Catalog server) failed to 
> relogin and crashed every several hours. Initial indication suggested the 
> symptom matched HADOOP-13433.
> But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog 
> server still kept crashing.
>  
> {noformat}
> W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) 
> cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
>  Failure to initialize security context
> W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host2.example@example.com), remove and destroy it.
> W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host3.example@example.com), remove and destroy it.
> W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no 
> kerberos ticket found while attempting to renew ticket{noformat}
> The error “Failure to initialize security context” is suspicious here. 
> Catalogd was unable to log in because of a Kerberos issue. The JDK expects 
> the first kerberos ticket of a principal to be a TGT, however it seems that 
> after this error, because it was unable to login successfully, the first 
> ticket was no longer a TGT. The patch HADOOP-13433 removed other tickets of 
> the principal, because it expects the TGT to be in the principal’s ticket, 
> which is untrue in this case. So finally, it removed all tickets.
> And then
> {noformat}
> W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)]
> {noformat}
> The error “Failed to find any Kerberos tgt” is typically an indication that 
> the user’s Kerberos ticket has expired. However, that’s definitely not the 
> case here, since it was just a little over 8 hours.
> After we patched HADOOP-13433, the error handling code exhibited NPE, as 
> reported in HADOOP-15143.
>  
> {code:java}
> I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while 
> invoking listCachePools of class ClientNamenodeProtocolTranslatorPB over 
> host4.example.com/10.0.121.66:8020 after 2 fail over attempts. Trying to fail 
> over immediately. Java exception follows: java.io.IOException: Failed on 
> local exception: java.io.IOException: Couldn't set up IO streams; Host 
> Details : local host is: "host1.example.com/10.0.121.45"; destination host 
> is: "host4.example.com":8020; at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1506) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1439) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>  at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261)
>  at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>  at com.sun.proxy.$Proxy10.listCachePools(Unknown Source) at 
> 

[jira] [Commented] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf

2018-04-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432583#comment-16432583
 ] 

Steve Loughran commented on HADOOP-15378:
-

This is bizarre even in the category of bizarre-kerberos errors. Really great 
to have you share this.  Happy to have a section on it on 
https://github.com/steveloughran/kerberos_and_hadoop ; maybe just a link to 
this in the tales of "wierd things", with "The ticket isn't for us" getting a 
callout in error messages.

you thought of running KDiag on the system to see what it showed up...and 
whether it could be improved? Maybe something to check the auth status of IPC 
endpoints: give it a list of endpoints and it'll try to handshake to all of 
them, without bothering to actually say anything afterwards. Could be 
paralllelisable

> Hadoop client unable to relogin because a remote DataNode has an incorrect 
> krb5.conf
> 
>
> Key: HADOOP-15378
> URL: https://issues.apache.org/jira/browse/HADOOP-15378
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.6.0
> Environment: CDH5.8.3, Kerberized, Impala
>Reporter: Wei-Chiu Chuang
>Priority: Critical
>
> This is a very weird bug.
> We received a report where a Hadoop client (Impala Catalog server) failed to 
> relogin and crashed every several hours. Initial indication suggested the 
> symptom matched HADOOP-13433.
> But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog 
> server still kept crashing.
>  
> {noformat}
> W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) 
> cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
>  Failure to initialize security context
> W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host2.example@example.com), remove and destroy it.
> W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host3.example@example.com), remove and destroy it.
> W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no 
> kerberos ticket found while attempting to renew ticket{noformat}
> The error “Failure to initialize security context” is suspicious here. 
> Catalogd was unable to log in because of a Kerberos issue. The JDK expects 
> the first kerberos ticket of a principal to be a TGT, however it seems that 
> after this error, because it was unable to login successfully, the first 
> ticket was no longer a TGT. The patch HADOOP-13433 removed other tickets of 
> the principal, because it expects the TGT to be in the principal’s ticket, 
> which is untrue in this case. So finally, it removed all tickets.
> And then
> {noformat}
> W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)]
> {noformat}
> The error “Failed to find any Kerberos tgt” is typically an indication that 
> the user’s Kerberos ticket has expired. However, that’s definitely not the 
> case here, since it was just a little over 8 hours.
> After we patched HADOOP-13433, the error handling code exhibited NPE, as 
> reported in HADOOP-15143.
>  
> {code:java}
> I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while 
> invoking listCachePools of class ClientNamenodeProtocolTranslatorPB over 
> host4.example.com/10.0.121.66:8020 after 2 fail over attempts. Trying to fail 
> over immediately. Java exception follows: java.io.IOException: Failed on 
> local exception: java.io.IOException: Couldn't set up IO streams; Host 
> Details : local host is: "host1.example.com/10.0.121.45"; destination host 
> is: "host4.example.com":8020; at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1506) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1439) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>  at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261)
>  at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> 

[jira] [Commented] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf

2018-04-10 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432530#comment-16432530
 ] 

Wei-Chiu Chuang commented on HADOOP-15378:
--

[~Apache9] appreciate if you could also look at this one since you were the 
author of HADOOP-13433.

> Hadoop client unable to relogin because a remote DataNode has an incorrect 
> krb5.conf
> 
>
> Key: HADOOP-15378
> URL: https://issues.apache.org/jira/browse/HADOOP-15378
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
> Environment: CDH5.8.3, Kerberized, Impala
>Reporter: Wei-Chiu Chuang
>Priority: Critical
>
> This is a very weird bug.
> We received a report where a Hadoop client (Impala Catalog server) failed to 
> relogin and crashed every several hours. Initial indication suggested the 
> symptom matched HADOOP-13433.
> But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog 
> server still kept crashing.
>  
> {noformat}
> W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) 
> cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
>  Failure to initialize security context
> W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host2.example@example.com), remove and destroy it.
> W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host3.example@example.com), remove and destroy it.
> W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no 
> kerberos ticket found while attempting to renew ticket{noformat}
> The error “Failure to initialize security context” is suspicious here. 
> Catalogd was unable to log in because of a Kerberos issue. The JDK expects 
> the first kerberos ticket of a principal to be a TGT, however it seems that 
> after this error, because it was unable to login successfully, the first 
> ticket was no longer a TGT. The patch HADOOP-13433 removed other tickets of 
> the principal, because it expects the TGT to be in the principal’s ticket, 
> which is untrue in this case. So finally, it removed all tickets.
> And then
> {noformat}
> W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example@example.com 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)]
> {noformat}
> The error “Failed to find any Kerberos tgt” is typically an indication that 
> the user’s Kerberos ticket has expired. However, that’s definitely not the 
> case here, since it was just a little over 8 hours.
> After we patched HADOOP-13433, the error handling code exhibited NPE, as 
> reported in HADOOP-15143.
>  
> {code:java}
> I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while 
> invoking listCachePools of class ClientNamenodeProtocolTranslatorPB over 
> host4.example.com/10.0.121.66:8020 after 2 fail over attempts. Trying to fail 
> over immediately. Java exception follows: java.io.IOException: Failed on 
> local exception: java.io.IOException: Couldn't set up IO streams; Host 
> Details : local host is: "host1.example.com/10.0.121.45"; destination host 
> is: "host4.example.com":8020; at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1506) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1439) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>  at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261)
>  at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>  at com.sun.proxy.$Proxy10.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:55)
>  at 
> org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:33)
>  at 
> org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77)
>  at 
>