Hi, we have a Kerberos secured cluster and currently facing issues with Ambari Metrics. After starting Ambari Metrics everythin is fine but after a couple of days we get alerts from Ambari like this:
NameNode Service RPC Processing Latency (Hourly) Unable to retrieve metrics from the Ambari Metrics service. When I check the logs oft he Metrics Collector I can find entries like: 2018-03-28 11:19:47,013 WARN org.apache.hadoop.security.UserGroupInformation: Exception encountered while running the renewal command for amshbase/[email protected]<mailto:amshbase/[email protected]>. (TGT end time:1522228847000, renewalFailures: org.apache.hadoop.metrics2.lib.MutableGaugeInt@388f50cd,renewalFailuresTotal<mailto:org.apache.hadoop.metrics2.lib.MutableGaugeInt@388f50cd,renewalFailuresTotal>: org.apache.hadoop.metrics2.lib.MutableGaugeLong@7d8dc9b8<mailto:org.apache.hadoop.metrics2.lib.MutableGaugeLong@7d8dc9b8>) ExitCodeException exitCode=1: kinit: KDC can't fulfill requested option while renewing credentials at org.apache.hadoop.util.Shell.runCommand(Shell.java:954) at org.apache.hadoop.util.Shell.run(Shell.java:855) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1163) at org.apache.hadoop.util.Shell.execCommand(Shell.java:1257) at org.apache.hadoop.util.Shell.execCommand(Shell.java:1239) at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:987) at java.lang.Thread.run(Thread.java:745) 2018-03-28 11:19:47,014 ERROR org.apache.hadoop.security.UserGroupInformation: TGT is expired. Aborting renew thread for amshbase/[email protected]<mailto:amshbase/[email protected]>. In the following I then see aggregation errors: 2018-03-28 11:27:08,188 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Wed Mar 28 11:27:08 CEST 2018 2018-03-28 11:27:08,189 INFO TimelineClusterAggregatorMinute: Skipping aggregation function not owned by this instance. 2018-03-28 11:27:08,205 ERROR TimelineMetricHostAggregatorHourly: Exception during aggregating metrics. java.sql.SQLTimeoutException: Operation timed out. at org.apache.phoenix.exception.SQLExceptionCode$14.newException(SQLExceptionCode.java:364) at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150) at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:831) So this seems to be related to Kerberos. When I check the log oft he KDC there is not much info: Mar 28 11:19:47 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 10.11.1.21: TICKET NOT RENEWABLE: authtime 0, amshbase/[email protected]<mailto:amshbase/[email protected]> for krbtgt/[email protected]<mailto:krbtgt/[email protected]>, KDC can't fulfill requested option ... Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): AS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/[email protected]<mailto:amshbase/[email protected]> for krbtgt/[email protected]<mailto:krbtgt/[email protected]> Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/[email protected]<mailto:amshbase/[email protected]> for nn/[email protected]<mailto:nn/[email protected]> When I check the principal amshbase/[email protected]<mailto:amshbase/[email protected]> in the KDC I get the following: Principal: amshbase/[email protected]<mailto:amshbase/[email protected]> Expiration date: [never] Last password change: Mo Mär 19 11:24:05 CET 2018 Password expiration date: [never] Maximum ticket life: 1 day 00:00:00 Maximum renewable life: 0 days 00:00:00 Last modified: Mo Mär 19 11:24:05 CET 2018 (admin/[email protected]<mailto:admin/[email protected]>) Last successful authentication: [never] Last failed authentication: [never] Failed password attempts: 0 Number of keys: 2 Key: vno 1, aes256-cts-hmac-sha1-96 Key: vno 1, aes128-cts-hmac-sha1-96 MKey: vno 1 Attributes: Policy: [none] Ist hat normal? Maximum renewable life is set to 0 so ticket renewal is not possible. But that is also true for all other principals in the KDC and all other services work normally. This is the content of krb5.conf: [libdefaults] renew_lifetime = 7d forwardable = true default_realm = PSIORI.COM ticket_lifetime = 24h dns_lookup_realm = false dns_lookup_kdc = false default_ccache_name = /tmp/krb5cc_%{uid} #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 [domain_realm] .cl.psiori.com = PSIORI.COM cl.psiori.com = PSIORI.COM [logging] default = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log kdc = FILE:/var/log/krb5kdc.log [realms] PSIORI.COM = { admin_server = sql.cl.psiori.com kdc = sql.cl.psiori.com } I have not applied any changes to the kdc.conf so it has the default content: [kdcdefaults] kdc_ports = 88 kdc_tcp_ports = 88 [realms] EXAMPLE.COM = { #master_key_type = aes256-cts acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal } Is there any misconfiguration? When I restart the service then everything is fine again (for some time). Any suggestions or help is very welcome. Best regards, Alex
