Hi SSSD users,

I need some help debugging this strange failure we are having with sssd. We've 
got an issue on a small centos linux cluster in our data center that uses 
GSSAPI and active directory for auth. On the head-nodes the sssd application 
periodically enters an error state and begins restarting every few minutes and 
while its restarting host authentication fails. From the logs it appears that 
the sssd application is in a loop restarting itself every 2 or so minutes. 
There are numerous logs that are written during the occurrence of this issue. 
I've attached snippets below.

I've been on a quite a journey over the past few weeks trying to get sssd+ad 
working again in our environment. A few months ago our AD admins added a new 
site to our domain that is unreachable from on-premise. This lead to a bunch of 
strange issues until we setup our configs to limit which DCs we talk to and 
despite limiting the hosts to the proper AD site and servers we continue to 
have periodic errors on these busy head nodes. While in the error state, the 
sssctl domain-status command will usually fail with the following error:

# sssctl domain-status example.com
Unable to get online status [3]: Communication error
org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes 
include: the remote application did not send a reply, the message bus security 
policy blocked the reply, the reply timeout expired, or the network connection 
was broken.
Check that SSSD is running and the InfoPipe responder is enabled. Make sure 
'ifp' is listed in the 'services' option in sssd.conf.
Unable to get online status

Ifp is enabled in the services and on working nodes this works correctly. To 
try and describe what is happening in the fail state I'll walk thru the logs. 
The first indication of an issue is that in /var/log/messages we see the 
following entries when the application restart loop starts going:

12245:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Shutting down
12246:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Starting up

And then these repeat every few minutes

the krb5_child log shows errors like this when the ifp backend is not working

(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] 
(0x0100): cmd [241] uid [266812] gid [286812] validate [true] enterprise 
principal [true] offline [false] UPN [u...@example.com]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] 
(0x0100): ccname: [KEYRING:persistent:286812] old_ccname: [not set] keytab: 
[/etc/krb5.keytab]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [check_use_fast] 
(0x0100): Not using FAST.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [privileged_krb5_setup] 
(0x0080): Cannot open the PAC responder socket
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_lifetime_options] 
(0x0100): No specific renewable lifetime requested.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_lifetime_options] 
(0x0100): No specific lifetime requested.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] 
[set_canonicalize_option] (0x0100): Canonicalization is set to [true]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [sss_send_pac] 
(0x0040): sss_pac_make_request failed [-1][2].
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [validate_tgt] 
(0x0040): sss_send_pac failed, group membership for user with principal 
[user\@EXAMPLE.COM] might not be correct.

The nss log shows these failures and then the auto-reconnect after seeing 
something like this "[cache_req_common_dp_recv] (0x0040): CR #1171: Data 
Provider Error: 3, 5, (null)" see below

(Wed Nov  4 10:52:28 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR 
#1171: Data Provider Error: 3, 5, (null)
(Wed Nov  4 10:52:58 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR 
#1174: Data Provider Error: 3, 5, (null)
(Wed Nov  4 10:53:00 2020) [sssd[nss]] [sbus_dispatch] (0x0020): Performing 
auto-reconnect
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Making 
reconnection attempt 1 to 
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Reconnected 
to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [nss_dp_reconnect_init] (0x0020): 
Reconnected to the Data Provider.
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR 
#1178: Data Provider Error: 3, 5, (null)

the ifp log shows the following, notice the killing children at the bottom

(Wed Nov  4 10:53:00 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing 
auto-reconnect
(Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making 
reconnection attempt 1 to 
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected 
to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:55:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing 
auto-reconnect
(Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making 
reconnection attempt 1 to 
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected 
to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:57:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing 
auto-reconnect
(Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making 
reconnection attempt 1 to 
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected 
to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:59:10 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing 
auto-reconnect
(Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making 
reconnection attempt 1 to 
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected 
to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:01:11 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing 
auto-reconnect
(Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making 
reconnection attempt 1 to 
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected 
to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:02:01 2020) [sssd[ifp]] [orderly_shutdown] (0x0010): SIGTERM: 
killing children

Usually clearing the cache and restarting sssd are enough to fix this issue but 
not always. Its hard to diagnose because it effects production and peoples 
ability to even use the cluster and I can't recreate it on the fly.

Any idea of what might be causing these failures?

Thanks in advance,
Josh
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org

Reply via email to