Hi, what is in sssd_example.com.log?
Seems like most (all?) of errors you quote are due to sssd_be[example.com] hang or crash. On Wed, Nov 4, 2020 at 8:56 PM Josh Sonstroem <[email protected]> wrote: > > Hi SSSD users, > > I need some help debugging this strange failure we are having with sssd. > We've got an issue on a small centos linux cluster in our data center that > uses GSSAPI and active directory for auth. On the head-nodes the sssd > application periodically enters an error state and begins restarting every > few minutes and while its restarting host authentication fails. From the > logs it appears that the sssd application is in a loop restarting itself > every 2 or so minutes. There are numerous logs that are written during the > occurrence of this issue. I've attached snippets below. > > I've been on a quite a journey over the past few weeks trying to get > sssd+ad working again in our environment. A few months ago our AD admins > added a new site to our domain that is unreachable from on-premise. This > lead to a bunch of strange issues until we setup our configs to limit which > DCs we talk to and despite limiting the hosts to the proper AD site and > servers we continue to have periodic errors on these busy head nodes. While > in the error state, the sssctl domain-status command will usually fail with > the following error: > > # sssctl domain-status example.com > Unable to get online status [3]: Communication error > org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible > causes include: the remote application did not send a reply, the message > bus security policy blocked the reply, the reply timeout expired, or the > network connection was broken. > Check that SSSD is running and the InfoPipe responder is enabled. Make > sure 'ifp' is listed in the 'services' option in sssd.conf. > Unable to get online status > > Ifp is enabled in the services and on working nodes this works correctly. > To try and describe what is happening in the fail state I'll walk thru the > logs. The first indication of an issue is that in /var/log/messages we see > the following entries when the application restart loop starts going: > > 12245:Nov 4 10:53:00 host-0 sssd[be[example.com]]: Shutting down > 12246:Nov 4 10:53:00 host-0 sssd[be[example.com]]: Starting up > > And then these repeat every few minutes > > the krb5_child log shows errors like this when the ifp backend is not > working > > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] > (0x0100): cmd [241] uid [266812] gid [286812] validate [true] enterprise > principal [true] offline [false] UPN [[email protected]] > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] > (0x0100): ccname: [KEYRING:persistent:286812] old_ccname: [not set] keytab: > [/etc/krb5.keytab] > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [check_use_fast] > (0x0100): Not using FAST. > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] > [privileged_krb5_setup] (0x0080): Cannot open the PAC responder socket > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] > [set_lifetime_options] (0x0100): No specific renewable lifetime requested. > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] > [set_lifetime_options] (0x0100): No specific lifetime requested. > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] > [set_canonicalize_option] (0x0100): Canonicalization is set to [true] > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [sss_send_pac] > (0x0040): sss_pac_make_request failed [-1][2]. > (Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [validate_tgt] > (0x0040): sss_send_pac failed, group membership for user with principal > [user\@EXAMPLE.COM] might not be correct. > > The nss log shows these failures and then the auto-reconnect after seeing > something like this "[cache_req_common_dp_recv] (0x0040): CR #1171: Data > Provider Error: 3, 5, (null)" see below > > (Wed Nov 4 10:52:28 2020) [sssd[nss]] [cache_req_common_dp_recv] > (0x0040): CR #1171: Data Provider Error: 3, 5, (null) > (Wed Nov 4 10:52:58 2020) [sssd[nss]] [cache_req_common_dp_recv] > (0x0040): CR #1174: Data Provider Error: 3, 5, (null) > (Wed Nov 4 10:53:00 2020) [sssd[nss]] [sbus_dispatch] (0x0020): > Performing auto-reconnect > (Wed Nov 4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Making > reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/ > sbus-dp_example.com] > (Wed Nov 4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): > Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com] > (Wed Nov 4 10:53:01 2020) [sssd[nss]] [nss_dp_reconnect_init] (0x0020): > Reconnected to the Data Provider. > (Wed Nov 4 10:53:01 2020) [sssd[nss]] [cache_req_common_dp_recv] > (0x0040): CR #1178: Data Provider Error: 3, 5, (null) > > the ifp log shows the following, notice the killing children at the bottom > > (Wed Nov 4 10:53:00 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): > Performing auto-reconnect > (Wed Nov 4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making > reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/ > sbus-dp_example.com] > (Wed Nov 4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): > Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com] > (Wed Nov 4 10:55:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): > Performing auto-reconnect > (Wed Nov 4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making > reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/ > sbus-dp_example.com] > (Wed Nov 4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): > Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com] > (Wed Nov 4 10:57:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): > Performing auto-reconnect > (Wed Nov 4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making > reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/ > sbus-dp_example.com] > (Wed Nov 4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): > Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com] > (Wed Nov 4 10:59:10 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): > Performing auto-reconnect > (Wed Nov 4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making > reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/ > sbus-dp_example.com] > (Wed Nov 4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): > Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com] > (Wed Nov 4 11:01:11 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): > Performing auto-reconnect > (Wed Nov 4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making > reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/ > sbus-dp_example.com] > (Wed Nov 4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): > Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com] > (Wed Nov 4 11:02:01 2020) [sssd[ifp]] [orderly_shutdown] (0x0010): > SIGTERM: killing children > > Usually clearing the cache and restarting sssd are enough to fix this > issue but not always. Its hard to diagnose because it effects production > and peoples ability to even use the cluster and I can't recreate it on the > fly. > > Any idea of what might be causing these failures? > > Thanks in advance, > Josh > _______________________________________________ > sssd-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedorahosted.org/archives/list/[email protected] >
_______________________________________________ sssd-users mailing list -- [email protected] To unsubscribe send an email to [email protected] Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/[email protected]
