Hi,

what is in sssd_example.com.log?

Seems like most (all?) of errors you quote are due to sssd_be[example.com]
hang or crash.


On Wed, Nov 4, 2020 at 8:56 PM Josh Sonstroem <[email protected]> wrote:

>
> Hi SSSD users,
>
> I need some help debugging this strange failure we are having with sssd.
> We've got an issue on a small centos linux cluster in our data center that
> uses GSSAPI and active directory for auth. On the head-nodes the sssd
> application periodically enters an error state and begins restarting every
> few minutes and while its restarting host authentication fails. From the
> logs it appears that the sssd application is in a loop restarting itself
> every 2 or so minutes. There are numerous logs that are written during the
> occurrence of this issue. I've attached snippets below.
>
> I've been on a quite a journey over the past few weeks trying to get
> sssd+ad working again in our environment. A few months ago our AD admins
> added a new site to our domain that is unreachable from on-premise. This
> lead to a bunch of strange issues until we setup our configs to limit which
> DCs we talk to and despite limiting the hosts to the proper AD site and
> servers we continue to have periodic errors on these busy head nodes. While
> in the error state, the sssctl domain-status command will usually fail with
> the following error:
>
> # sssctl domain-status example.com
> Unable to get online status [3]: Communication error
> org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible
> causes include: the remote application did not send a reply, the message
> bus security policy blocked the reply, the reply timeout expired, or the
> network connection was broken.
> Check that SSSD is running and the InfoPipe responder is enabled. Make
> sure 'ifp' is listed in the 'services' option in sssd.conf.
> Unable to get online status
>
> Ifp is enabled in the services and on working nodes this works correctly.
> To try and describe what is happening in the fail state I'll walk thru the
> logs. The first indication of an issue is that in /var/log/messages we see
> the following entries when the application restart loop starts going:
>
> 12245:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Shutting down
> 12246:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Starting up
>
> And then these repeat every few minutes
>
> the krb5_child log shows errors like this when the ifp backend is not
> working
>
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer]
> (0x0100): cmd [241] uid [266812] gid [286812] validate [true] enterprise
> principal [true] offline [false] UPN [[email protected]]
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer]
> (0x0100): ccname: [KEYRING:persistent:286812] old_ccname: [not set] keytab:
> [/etc/krb5.keytab]
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [check_use_fast]
> (0x0100): Not using FAST.
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
> [privileged_krb5_setup] (0x0080): Cannot open the PAC responder socket
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
> [set_lifetime_options] (0x0100): No specific renewable lifetime requested.
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
> [set_lifetime_options] (0x0100): No specific lifetime requested.
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
> [set_canonicalize_option] (0x0100): Canonicalization is set to [true]
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [sss_send_pac]
> (0x0040): sss_pac_make_request failed [-1][2].
> (Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [validate_tgt]
> (0x0040): sss_send_pac failed, group membership for user with principal
> [user\@EXAMPLE.COM] might not be correct.
>
> The nss log shows these failures and then the auto-reconnect after seeing
> something like this "[cache_req_common_dp_recv] (0x0040): CR #1171: Data
> Provider Error: 3, 5, (null)" see below
>
> (Wed Nov  4 10:52:28 2020) [sssd[nss]] [cache_req_common_dp_recv]
> (0x0040): CR #1171: Data Provider Error: 3, 5, (null)
> (Wed Nov  4 10:52:58 2020) [sssd[nss]] [cache_req_common_dp_recv]
> (0x0040): CR #1174: Data Provider Error: 3, 5, (null)
> (Wed Nov  4 10:53:00 2020) [sssd[nss]] [sbus_dispatch] (0x0020):
> Performing auto-reconnect
> (Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Making
> reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
> sbus-dp_example.com]
> (Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080):
> Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
> (Wed Nov  4 10:53:01 2020) [sssd[nss]] [nss_dp_reconnect_init] (0x0020):
> Reconnected to the Data Provider.
> (Wed Nov  4 10:53:01 2020) [sssd[nss]] [cache_req_common_dp_recv]
> (0x0040): CR #1178: Data Provider Error: 3, 5, (null)
>
> the ifp log shows the following, notice the killing children at the bottom
>
> (Wed Nov  4 10:53:00 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
> Performing auto-reconnect
> (Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
> reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
> sbus-dp_example.com]
> (Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
> Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
> (Wed Nov  4 10:55:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
> Performing auto-reconnect
> (Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
> reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
> sbus-dp_example.com]
> (Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
> Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
> (Wed Nov  4 10:57:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
> Performing auto-reconnect
> (Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
> reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
> sbus-dp_example.com]
> (Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
> Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
> (Wed Nov  4 10:59:10 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
> Performing auto-reconnect
> (Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
> reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
> sbus-dp_example.com]
> (Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
> Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
> (Wed Nov  4 11:01:11 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
> Performing auto-reconnect
> (Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
> reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
> sbus-dp_example.com]
> (Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
> Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
> (Wed Nov  4 11:02:01 2020) [sssd[ifp]] [orderly_shutdown] (0x0010):
> SIGTERM: killing children
>
> Usually clearing the cache and restarting sssd are enough to fix this
> issue but not always. Its hard to diagnose because it effects production
> and peoples ability to even use the cluster and I can't recreate it on the
> fly.
>
> Any idea of what might be causing these failures?
>
> Thanks in advance,
> Josh
> _______________________________________________
> sssd-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedorahosted.org/archives/list/[email protected]
>
_______________________________________________
sssd-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/[email protected]

Reply via email to