All,

We took Sumit’s advice and enabled sssd’s debug level 7 on the “domain”
section of sssd.conf.   On about 2300 non-prod Linux servers.

FYI – beware if you do this!  We found occurrences where that
sssd_amer.company.com_log was 8 GB after 24 hrs.  So you’ll likely have to
fine-tune your sssd logrotate file or even more drastic actions.

RECAP:  Randomly on 0.24% of our Linux servers, after 30 days sssd will
drop off the AD domain.  We find this occurs during the automatic Kerberos
Host Keytab renewal.  The KVNO number in AD is one more than the latest
KVNO number in /etc/krb5.keytab file.

Due to sssd debug level 7, we now have verbose ‘adcli update’  output in
our sssd_<domain>.company.com_log files.   For two such culprits.  The
output shows the same error condition.  Here is example output:

(2021-09-28  3:44:23): [be[amer.company.com]]
[ad_machine_account_password_renewal_done] (0x1000): --- adcli output
start---

 * Found realm in keytab: AMER.COMPANY.COM

 * Found computer name in keytab: KEWNLR2CU2APP01

 * Found service principal in keytab: host/kewnlr2cu2app01.amer.company.com

 * Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com

 * Found service principal in keytab: host/KEWNLR2CU2APP01

 * Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01

 * Found service principal in keytab: RestrictedKrbHost/
kewnlr2cu2app01.amer.company.com

 * Using fully qualified name: kewnlr2cu2app01.amer.company.com

 * Using domain name: amer.company.com

 * Using computer account name: KEWNLR2CU2APP01

 * Using domain realm: amer.company.com

 * Sending NetLogon ping to domain controller:
AUSDC16AMER23.amer.company.com

 * Received NetLogon info from: AUSDC16AMER23.amer.company.com

 * Wrote out krb5.conf snippet to
/tmp/adcli-krb5-HRsQ9K/krb5.d/adcli-krb5-conf-yBNrRI

 * Authenticated as default/reset computer account: KEWNLR2CU2APP01

 * Using GSS-SPNEGO for SASL bind

 * Looked up short domain name: AMERICAS

 * Looked up domain SID: S-1-5-21-1802859667-647903414-1863928812

 * Using fully qualified name: kewnlr2cu2app01.amer.company.com

 * Using domain name: amer.company.com

 * Using computer account name: KEWNLR2CU2APP01

 * Using domain realm: amer.company.com

 * Using fully qualified name: kewnlr2cu2app01.amer.company.com

 * Enrolling computer name: KEWNLR2CU2APP01

 * Generated 120 character computer password

 * Using keytab: FILE:/etc/krb5.keytab

 * Found computer account for KEWNLR2CU2APP01$ at:
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com

 * Retrieved kvno '17' for computer account in directory:
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com

 * Sending NetLogon ping to domain controller:
AUSDC16AMER23.amer.company.com

 * Received NetLogon info from: AUSDC16AMER23.amer.company.com

 ! Cannot change computer password: Authentication error

adcli: updating membership with domain amer.company.com failed: Cannot
change computer password: Authentication error

---adcli output end---



Within 1.5 mins of the above, we receive errors in /var/log/messages as
below:

Sep 28 03:45:51 kewnlr2cu2app01 sssd[ldap_child[288005]][288005]: Failed to
initialize credentials using keytab [MEMORY:/etc/krb5.keytab]:
Preauthentication failed. Unable to create GSSAPI-encrypted LDAP connection.

We verify in /etc/krb5.keytab file that the latest KVNO is still 17, while
in AD the KVNO is now 18.  Also, the time of the last password changed in
AD exactly corresponds to above:

PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -Property
'PasswordLastSet'





DistinguishedName :
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com

DNSHostName       : kewnlr2cu2app01.amer.company.com

...

Name              : KEWNLR2CU2APP01

ObjectClass       : computer

...

PasswordLastSet   : 9/28/2021 3:44:23 AM

SamAccountName    : KEWNLR2CU2APP01$

...

UserPrincipalName : host/kewnlr2cu2app01.amer.company....@amer.company.com



PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -property
msDS-KeyVersionNumber





DistinguishedName     :
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com

DNSHostName           : kewnlr2cu2app01.amer.company.com

...

msDS-KeyVersionNumber : 18



Of course, after this, the adcli output in _<domain>.company.com_log file,
will continue to show Kerberos pre-authentication errors.  Because now
adcli update is using the old machine account password, while AD has the
new machine account password:



(2021-09-28  4:13:42): [be[amer.company.com]]
[ad_machine_account_password_renewal_done] (0x1000): --- adcli output
start---

 * Found realm in keytab: AMER.COMPANY.COM

 * Found computer name in keytab: KEWNLR2CU2APP01

 * Found service principal in keytab: host/kewnlr2cu2app01.amer.company.com

 * Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com

 * Found service principal in keytab: host/KEWNLR2CU2APP01

 * Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01

 * Found service principal in keytab: RestrictedKrbHost/
kewnlr2cu2app01.amer.company.com

 * Using fully qualified name: kewnlr2cu2app01.amer.company.com

 * Using domain name: amer.company.com

 * Using computer account name: KEWNLR2CU2APP01

 * Using domain realm: amer.company.com

 * Discovering domain controllers: _ldap._tcp.amer.company.com

 * Sending NetLogon ping to domain controller:
RDUDC16AMER04.amer.company.com

 * Received NetLogon info from: RDUDC16AMER04.amer.company.com

 * Discovering site domain controllers: _ldap._tcp.AMERAustin._sites.dc._
msdcs.amer.company.com

 * Sending NetLogon ping to domain controller:
AUSDC16AMER34.amer.company.com

 * Received NetLogon info from: AUSDC16AMER34.amer.company.com

 * Wrote out krb5.conf snippet to
/tmp/adcli-krb5-i7P6zR/krb5.d/adcli-krb5-conf-vkBoqT

 ! Couldn't authenticate as machine account: KEWNLR2CU2APP01:
Preauthentication failed

adcli: couldn't connect to amer.company.com domain: Couldn't authenticate
as machine account: KEWNLR2CU2APP01: Preauthentication failed

---adcli output end---



In summary, for some reason adcli update after attempting to set the
machine account password thinks that it’s receiving a Kerberos
authentication error.  Actually, it has successfully set the machine
account password in AD.  Because it thinks that it did not successfully
update the machine account password, it does not update the entiries in the
local /etc/krb5.keytab file.



We have our AD admins examining the AD domain controller logs now (since we
have an exact DC name, exact time and exact client FQDN above).



At this point, we’re unsure whether this is an adcli problem or an AD
problem.



Does adcli update attempt to authenticate back to the same AD DC with the
new password?  Or does it randomly pick an AD DC to authentication back to,
with the new password?



Spike White

On Wed, Aug 25, 2021 at 10:32 AM Spike White <spikewhit...@gmail.com> wrote:

> Sssd experts,
>
> *Short summary: * How can we troubleshoot sssd’s ‘Automatic Kerberos Host
> Keytab Renewal’ process?    We have ~0.4%  of our Linux servers dropping
> off the AD domain monthly.
>
> *Longer explanation:*
>
> Over the past two years, we have on-boarded sssd as our Linux AD
> integration component.  Largely displacing a former commercial product that
> did the same.
>
> We have about ~20K Linux servers that are sssd-enabled.  A mix of RHEL6,
> RHEL7, RHEL8, Oracle Linux 6, 7 and 8.   We have ~7K Linux servers still on
> the old commercial product.  (For certain edge-case scenarios, such as
> DMZs, the commercial product works better.)
>
> Our AD forest is a single AD forest, with 4 regional child domains.  All
> with transitive trust.  Sssd auto-discovers parent domain and all 4 child
> domains, no problem – whenever it’s adcli joined to its regional local
> domain.
>
> Why are I writing this?
>
> Because we are researching an ongoing problem reported by L1 server ops.
> About 70 – 80 sssd-enabled Linux servers / month drop off the domain.  Out
> of our current sssd-enabled population of ~20K server, that’s not
> horrible.  But still it should be better.  (Our former commercial product
> did better.)
>
> It’s not limited to one particular OS, OS version, build location or
> region.  We have surveyed; it seems to occur randomly among all OS
> versions, regions and locations.
>
> To be clear, it’s extremely likely that this behavior arising from some
> subtle misconfiguration on our part – not from any sssd or adcli or
> Kerberos bug.  We have a couple of configuration improvements we’re
> pursuing.  (Kerberos max ticket lifetime mismatch between AD and
> /etc/krb5.conf file for instance.)
>
> We are taking sssd’s default settings for
> ad_maximum_machine_account_password_age and
> ad_machine_account_password_renewal_opts.   So after 30 days, sssd will
> attempt daily to renew the host Kerberos keytab file.  It should re-attempt
> daily if not renewed.  By company policy, our AD disables any machine
> accounts that have not renewed their credentials in 40 days.   So when we
> find servers that have dropped off the domain, it’s because they have not
> renewed their AD machine accounts in 40 days.
>
> We have SR’s open with our OS vendors (Redhat and Oracle respectively) for
> months now.  To no great help.  (They gave a few suggestions, but none
> panned out.)
>
> We thought we were hitting this bug:
>
> https://github.com/SSSD/sssd/issues/4762
>
> But packet captures proved that adcli update is using TCP on RHEL7/8.
> Thus, this might be a potential problem, but only on RHEL6.  (BTW
> ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd
> invocation in RHEL6 – it still uses UDP.  Thus, the recommended work-around
> for this bug doesn’t work.)
>
> So that isn’t our underlying problem.
>
> We’re at a loss now – as you can see, we’re grasping at straws.
>
> How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’
> process?  Whenever we inspect a particular server it works.  We can’t run
> all sssd clients at debug level 9;  it fills up /var/log filesystem after a
> few days of that.  We’re interested in troubleshooting that one particular
> sssd process on all clients;  not all parts of sssd.
>
> Other than a steep learning curve (on our part), obscure situations (like
> DMZ auto-discovery of AD controllers) and exotic scenarios (like above),
> we’re quite happy with our 2 yr journey of direct AD integration with
> sssd.    Obviously, the troubleshooting tools on RHEL6 are very minimal.
> But certainly, overall the quality of sssd on RHEL7/8 is excellent.  AD
> integration has innumerable devils in the details; I’m amazed that sssd
> performs as well as it does against our multi-domain forest.
>
> Spike
>
> PS the problem with sssd auto-discovery of AD controllers in DMZs has been
> fixed in a recent sssd release.  The better discovery algorithm was
> implemented – same one used by Windows clients and commercial products.
> It’s just that recent sssd version is not on RHEL7 or 8.
>
>
>
>
>
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to