Thanks again for the reply.  I think this explains the random slowness. 
The users in question and the AD server had LOTS and LOTS of groups, and
group memberships among users are many and complex.  It sounds like I
could also attack some of the slowness by increasing the cache timeouts
perhaps, if that can be controlled to the level needed by the conf file.

Thanks for helping me understand the issue and the role of PAC in it.


-Jim

On 2019-09-19 23:52, Jakub Hrozek wrote:
> On Thu, Sep 19, 2019 at 05:41:00PM -0700, Jim Burwell wrote:
>> Thanks for the response.  Will respond inline.
>>
>> On 2019-09-19 00:07, Jakub Hrozek wrote:
>>> On Wed, Sep 18, 2019 at 06:25:31PM -0700, Jim Burwell wrote:
>>>> Hi,
>>>>
>>>> I recently encountered issues where logins on Linux clients using SSSD
>>>> and the AD provider, pointed directly to an AD server were randomly
>>>> slow.  Randomly meaning, some clients experienced no slowness at all,
>>>> other clients consistently had slow logins (30+ seconds sometimes), and
>>>> yet other clients had random normal/fast logins, and frequent slow logins.
>>>>
>>>> Through troubleshooting, log analysis and experimentation, it appears
>>>> the fix for this issue is to turn off the PAC service.  Once "pac" was
>>>> removed from the "services =" line in sssd.conf, the problem client
>>>> boxes were suddenly consistently fast in terms of user logins.
>>>>
>>>> This deployment has the clients talking directly to AD servers it looks
>>>> up via the normal AD DNS entries, and uses Unix POSIX attributes in AD
>>>> for uidnumber and gidnumber etc (e.g. it's not doing any SID -> unix ID
>>>> translations, it's just pulling them directly from LDAP attributes).
>>>>
>>>> I guess my questions are:
>>>>
>>>>  1. What does PAC actually do?  I've read that it lists a users group as
>>>>     part of a KRB5 response, but also that it might be involved in
>>>>     cross-domain trusts.
>>> There is a lot of information about PAC in the PFD linked here:
>>>     
>>> https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-pac/166d8064-c863-41e1-9c23-edaaa5f36962?redirectedfrom=MSDN
>>> and a more readable version e.g. here:
>>>     https://www.freeipa.org/page/Howto/Inspecting_the_PAC
>>>
>>> In general, Windows gives you the authoritative set of groups the user
>>> is a member of only after login, so parsing the groups out of the PAC is
>>> the most reliable way. And in older versions of SSSD, especially with
>>> IPA-AD trusts, it was even the only way, IOW 'id' would only display
>>> groups after you log in. Newer versions try to approximate the groups
>>> with other means, mostly the tokenGroups attribute.
>> I'll take a look at these links to increase my understanding.
>>>>  2. When is PAC needed.  Is it only needed for deployments using IPA?
>>> It was strictly needed with some quite old IPA provider versions and
>>> recommeded at some point for AD provider also, but in the meantime, we
>>> improved the tokenGroups codepath,so the PAC provider is no longer used
>>> for AD provider, at least by default.
>> OK.  Good to know.  I based my "services =" line on many example configs
>> I've seen similar to our particular architecture, which is why I
>> included "pac".  It seemed to be recommended to use with the AD provider
>> from my memory, but now when I do a cursory search, I see most of the
>> example configs no longer include "pac".  I thought I was going crazy.
>>
>>>>  3. Is there any impact in turning off PAC if the architecture doesn't
>>>>     involve IPA in the mix?
>>> As said above, it is the most reliable way, but if sssd is giving you
>>> the group membership you expect also w/o using the PAC, then feel free
>>> to not use it.
>> So far I've only removed it from services on problem hosts to fix the
>> "slow logins" problem.
>>>>  4. Why would PAC slow down such a architecture seemingly randomly?
>>> I guess you might be using an older version of SSSD? In the older
>>> versions, the PAC was processed as part of the krb5_child process, so if
>>> the PAC processing was taking too long, the krb5_child was timing out.
>>> In newer versions, the PAC handling was reworked and is now evaluated
>>> differently.
>>>
>>> The PAC data is cached iirc, so when the slowdown occured, I guess it
>>> was when the PAC data was out of date in the cache.
>> This makes sense, because in some cases these systems would not
>> successfully complete a login until I increased various timeouts in
>> sssd.conf.  Then they'd take from 20-30s to login.
>>
>> But it was quite random, which is what has me confused.  These systems
>> were running identical OS, package sets, and were on the same network,
>> in many cases connected to the same set of switches (blade servers). 
>> Most have no issue, and login is fast (2-3s).  Others took 20-30s! 
>> Suspecting networking issues, we moved one to a different network whose
>> clients weren't having slow-login problems to see if it changed
>> anything, and it didn't. 
> The slow part is parsing the PAC locally. The PAC includes a list of
> SIDs and for each SID, the PAC responder would ask sssd_be if the
> corresponding SID is known and to refresh it if the corresponding cache
> entry is stale.
>
> So the flow used to go like this:
>     sssd_pam -> sssd_be -> krb5_child -> sssd_pac -> sssd_be (for each
>     SID) -> (for each expired SID) AD LDAP
> this was too slow. And about why it was random, I guess if some users
> had overlapping group memberships, many of the groups could have been
> updated when another user with similar group memberships logged in and
> then at some point more than a critical mass of groups would go stale in
> the cache and sssd_be ended up updating them all..
>
>> What ultimately fixed it was disabling the PAC service.  But near
>> identical systems except for the IP address, sitting right beside the
>> problem systems have no issues with pac enabled!  Very strange!
>>
>> Disabling pac made these problem clients behave like the ones that
>> weren't having issues, and logins take 2-3s.
>>
>> These are all ubuntu 16.04 LTS systems which are running sssd 1.13-4-1
>> (or higher if there are patches, right now I don't have access to look
>> at them).  So I'm not sure if this is using the older code, or the newer
>> code.  Do you remember?
> git log remembers :-)
>
> and tells me that the "new" PAC approach was implemented in 1.14.
>
>> I guess this version is "old" since it came out in early 2017, and the
>> latest 1.x is 1.16.4 (I presume 1.x development has stopped except for
>> bug fixes?).  Latest of course is 2.2.2!  So it seems way behind when
>> looking at it that way.  :-)
> Yes and no. The 1.16.x branch is stable, or long-term support and we'll
> be supporting it until RHEL-7 is supported. It's true that the 1.16
> branch no longer receives many new features and that most of the
> development happens with the 2.x branch, but bug fixes and selected new
> features are still backported to 1.16.x as well. This is not to say the
> 2.x branch is not stable, it is used in RHEL-8 after all, but the large
> amount of chances increases the chance that something would break by
> accident.


_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org

Reply via email to