On 10/7/21 12:01, Spike White wrote:
FYI -- update on this situation.

AD DC logs no help.  They show the exact same response sent back to a good machine account password renewal as for a failed renewal.

One of the AD administrators have identified a particular AD DC NIC teaming configuration that they state has caused problems with Kerberos on the past.  It's on a small percentage of their AD DCs and they will work to correct.  They will keep us apprised as to update.

I'm skeptical that's the underlying root cause -- for two reasons:
1.  If Kerberos was sensitive to this, it should affect all Kerberos operations  (Kerberos auth, etc.) and not just the kpasswd operations. 2. This is not occurring on our older RHEL6 and RHEL7 builds AD integrated via our older commercial AD integration product.  It's occurring only on our sssd-integrated builds.

At this point, we're turned off debug level 7 (it was filling up our /var/log filesystems and we have the verbose adcli update output from at least two failed clients).   We're going to take the alternate suggestion of setting ad_maximum_machine_account_password_age to 0 (disabling sssd from updating password) and run a cron job to do 'adcli update'.

We're wrapping this adcli_update with tcpdump to get the exact kpasswd request/response packets, as well as wrapping with KRB5_TRACE.

We want to call adcli update exactly as sssd calls it. From SOURCES/sssd-2.4.0/src/providers/ad/ad_machine_pw_renewal.c, this appears to be how sssd calls external program /usr/sbin/adcli to do its adcli update:

      /usr/sbin/adcli update --verbose --domain=$AD_DOMAIN --host-keytab=/etc/krb5.keytab --host-fqdn=$FQDN --computer-password-lifetime=30

because we aren't doing any Samba stuff.


Question: how would Samba stuff be relevant to updating the Kerberos ticket using adcli?



  Is that the correct
invocation?   We'll set computer-password-lifetime lower, say to 7. Because we want to see examples more frequently, to find failed updates.

BTW, the packet capture on a successful machine account password renewal is only 8K, so that very targeted debug will not swamp our /var/log or /tmp filesystems.

Spike

On Wed, Aug 25, 2021 at 10:32 AM Spike White <spikewhit...@gmail.com <mailto:spikewhit...@gmail.com>> wrote:

    Sssd experts,

    *_Short summary:_//* How can we troubleshoot sssd’s ‘Automatic
    Kerberos Host Keytab Renewal’ process?    We have ~0.4% of our Linux
    servers dropping off the AD domain monthly.

    *_Longer explanation:_*

    Over the past two years, we have on-boarded sssd as our Linux AD
    integration component.  Largely displacing a former commercial
    product that did the same.

    We have about ~20K Linux servers that are sssd-enabled.  A mix of
    RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8.   We have ~7K Linux
    servers still on the old commercial product.  (For certain edge-case
    scenarios, such as DMZs, the commercial product works better.)

Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust.  Sssd auto-discovers parent domain and
    all 4 child domains, no problem – whenever it’s adcli joined to its
    regional local domain.

    Why are I writing this?

    Because we are researching an ongoing problem reported by L1 server
    ops.  About 70 – 80 sssd-enabled Linux servers / month drop off the
    domain.  Out of our current sssd-enabled population of ~20K server,
    that’s not horrible.  But still it should be better.  (Our former
    commercial product did better.)

    It’s not limited to one particular OS, OS version, build location or
    region.  We have surveyed; it seems to occur randomly among all OS
    versions, regions and locations.

    To be clear, it’s extremely likely that this behavior arising from
    some subtle misconfiguration on our part – not from any sssd or
    adcli or Kerberos bug.  We have a couple of configuration
    improvements we’re pursuing. (Kerberos max ticket lifetime mismatch
    between AD and /etc/krb5.conf file for instance.)

    We are taking sssd’s default settings for
    ad_maximum_machine_account_password_age and
    ad_machine_account_password_renewal_opts. So after 30 days, sssd
    will attempt daily to renew the host Kerberos keytab file.  It
    should re-attempt daily if not renewed.  By company policy, our AD
    disables any machine accounts that have not renewed their
    credentials in 40 days.   So when we find servers that have dropped
    off the domain, it’s because they have not renewed their AD machine
    accounts in 40 days.

    We have SR’s open with our OS vendors (Redhat and Oracle
    respectively) for months now.  To no great help.  (They gave a few
    suggestions, but none panned out.)

    We thought we were hitting this bug:

    https://github.com/SSSD/sssd/issues/4762
    <https://github.com/SSSD/sssd/issues/4762>

    But packet captures proved that adcli update is using TCP on
    RHEL7/8.  Thus, this might be a potential problem, but only on
    RHEL6.  (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for
    the kpasswd invocation in RHEL6 – it still uses UDP.  Thus, the
    recommended work-around for this bug doesn’t work.)

    So that isn’t our underlying problem.

    We’re at a loss now – as you can see, we’re grasping at straws.

    How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab
    renewal’ process?  Whenever we inspect a particular server it
    works.  We can’t run all sssd clients at debug level 9;  it fills up
    /var/log filesystem after a few days of that.  We’re interested in
    troubleshooting that one particular sssd process on all clients; not
    all parts of sssd.

    Other than a steep learning curve (on our part), obscure situations
    (like DMZ auto-discovery of AD controllers) and exotic scenarios
    (like above), we’re quite happy with our 2 yr journey of direct AD
    integration with sssd.    Obviously, the troubleshooting tools on
    RHEL6 are very minimal.  But certainly, overall the quality of sssd
    on RHEL7/8 is excellent. AD integration has innumerable devils in
    the details; I’m amazed that sssd performs as well as it does
    against our multi-domain forest.

    Spike

    PS the problem with sssd auto-discovery of AD controllers in DMZs
    has been fixed in a recent sssd release. The better discovery
    algorithm was implemented – same one used by Windows clients and
    commercial products. It’s just that recent sssd version is not on
    RHEL7 or 8.



_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure
This message is from an external sender. Learn more about why this <<
matters at https://links.utexas.edu/rtyclf.                        <<
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to