Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-26 Thread Sullivan, Daniel [CRI]
Sumit,

Thank you for detailed and thorough reply; I really appreciate you taking the 
time to get back to me.

What you describe makes sense, and is in-line with my thoughts and observations 
(and from reading Jakubs performance tuning doc).  You articulated this very 
well; thank you.

We had already increased the 389ds thread number from 30 -> 180 and increased 
the number of processor cores to 6 (based on reading 389’s documentation 30 
threads per cores is roughly appropriate).  This actually helped the problem 
significantly.

This morning, we upped that number again to 12 cores and 360 threads for 389 
DS, and this made the situation even better.  The problem has largely gone away 
at this point.

I turned on the ignore_group_members setting long ago, it did not help with 
this particular problem.

I considered enabling enumeration to try and address this issue, for some 
reason I thought that feature wasn’t implemented for AD domains though.  

Anyway, I think we are good for now, at least on this issue, especially thanks 
to your description of what is actually occurring.

Thank you for your time and for your help!

Best,

Dan

> On Jan 26, 2017, at 9:15 AM, Sumit Bose  wrote:
> 
> On Wed, Jan 25, 2017 at 10:58:34PM +, Sullivan, Daniel [CRI] wrote:
>> Hi,
>> 
>> My apologizes for resurrecting this thread.  This issue is still ongoing, at 
>> this point we’ve been looking at it for over a week and now have more than 
>> one staff member analyzing and trying to resolve it on a full time basis.  I 
>> have some more information that I was hoping an a seasoned IPA expert could 
>> take a look at.   At this point I am fairly certain it is a performance 
>> tuning issue in either sssd or FreeIPA on the our domain controllers.  It 
>> looks to me like the main issue is that when looking up the same user across 
>> a large number of nodes in parallel, all of our available ds389 threads get 
>> blocked with '__lll_robust_lock_wait ()’ for operations involving 
>> ipa_extdom_common.c.  This usually occurs on one of our two DCs, but 
>> occasionally on both.   For example, in the attached output, out of 199 
>> threads in the attached output, 179 are in the status __lll_robust_lock_wait 
>> ().  All of the us...@xxx.uchicago.edu in 
>> this attachment are the same user.
>> 
>> Here is more information about this issue (some of it repeated for 
>> convenience):
>> 
>>  1.  We currently have 2 domain controllers.  Each has 6 processor cores and 
>> 180 threads allocated for 389ds.  We have gone through Red Hat’s performance 
>> tuning guide for directory services made what we felt were appropriate 
>> changes, and made additional tuning modifications to get lowered eviction 
>> rates and high cache hit numbers for 389ds.  We have approximately 220 
>> connections to our domain controllers (from "cn=monitor”), depending on the 
>> test I’ve seen as many as 190 connected to a single DC.
>>  2.  We are using an AD domain where all of our users and groups reside.
>>  3.   I induce this by looking up a user (using the id command) on a large 
>> number of nodes (maybe 200) for a user that has never been looked up before, 
>> and is not cached on either the client, or on the DC.
>>  4.   Before I induce the problem, I can lookup entries in LDAP without 
>> delay or problem (i.e. the LDAP server is performant and responsive, I can 
>> inspect cn=monitor or cn=config and get instantaneous results).
>>  5.  When I do induce the issue, the LDAP server basically becomes 
>> unresponsive (which is expected based on the attached output).  Servicing a 
>> query using the ldapsearchtool (for either cn=monitor or cn=config) can take 
>> upwards of 1-2 minutes or longer.  Eventually the LDAP server will 
>> ‘recover’, i.e. I do not typically need to restart IPA services to get this 
>> working again.
>>  6.  After a lookup fails, subsequent parallel lookups succeed and return 
>> the desired record (presumably from the cache).
>>  7.  It appears that these failures are also characterized by a 
>> corresponding "[monitor_hup] (0x0020): Received SIGHUP.”  in the sssd log.
>>  8.  Right before the problem occurs I see a brief spike in CPU utilization 
>> of the ns-slapd process, then the utilization basically drops to 0 once the 
>> threads are blocked in ns-slapd.
>>  9.  Since we are doing computation in our IPA environment, it is important 
>> that we can perform these types of parallel operations against our IPA 
>> environment at the scale we are testing.
>> 
>> I feel like we are either DoSing the LDAP server or the sss_be / sss_nss 
>> processes, although I am not sure.   Right now we are in the process of 
>> deploying an additional domain controller to see if that helps with 
>> distribution of load.  If anybody could provide any sort of information with 
>> respect addressing the issue in the attached trace I would be very grateful.
> 
> I think your observations are due to the fact

Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-26 Thread Sumit Bose
On Wed, Jan 25, 2017 at 10:58:34PM +, Sullivan, Daniel [CRI] wrote:
> Hi,
> 
> My apologizes for resurrecting this thread.  This issue is still ongoing, at 
> this point we’ve been looking at it for over a week and now have more than 
> one staff member analyzing and trying to resolve it on a full time basis.  I 
> have some more information that I was hoping an a seasoned IPA expert could 
> take a look at.   At this point I am fairly certain it is a performance 
> tuning issue in either sssd or FreeIPA on the our domain controllers.  It 
> looks to me like the main issue is that when looking up the same user across 
> a large number of nodes in parallel, all of our available ds389 threads get 
> blocked with '__lll_robust_lock_wait ()’ for operations involving 
> ipa_extdom_common.c.  This usually occurs on one of our two DCs, but 
> occasionally on both.   For example, in the attached output, out of 199 
> threads in the attached output, 179 are in the status __lll_robust_lock_wait 
> ().  All of the us...@xxx.uchicago.edu in 
> this attachment are the same user.
> 
> Here is more information about this issue (some of it repeated for 
> convenience):
> 
>   1.  We currently have 2 domain controllers.  Each has 6 processor cores and 
> 180 threads allocated for 389ds.  We have gone through Red Hat’s performance 
> tuning guide for directory services made what we felt were appropriate 
> changes, and made additional tuning modifications to get lowered eviction 
> rates and high cache hit numbers for 389ds.  We have approximately 220 
> connections to our domain controllers (from "cn=monitor”), depending on the 
> test I’ve seen as many as 190 connected to a single DC.
>   2.  We are using an AD domain where all of our users and groups reside.
>   3.   I induce this by looking up a user (using the id command) on a large 
> number of nodes (maybe 200) for a user that has never been looked up before, 
> and is not cached on either the client, or on the DC.
>   4.   Before I induce the problem, I can lookup entries in LDAP without 
> delay or problem (i.e. the LDAP server is performant and responsive, I can 
> inspect cn=monitor or cn=config and get instantaneous results).
>   5.  When I do induce the issue, the LDAP server basically becomes 
> unresponsive (which is expected based on the attached output).  Servicing a 
> query using the ldapsearchtool (for either cn=monitor or cn=config) can take 
> upwards of 1-2 minutes or longer.  Eventually the LDAP server will ‘recover’, 
> i.e. I do not typically need to restart IPA services to get this working 
> again.
>   6.  After a lookup fails, subsequent parallel lookups succeed and return 
> the desired record (presumably from the cache).
>   7.  It appears that these failures are also characterized by a 
> corresponding "[monitor_hup] (0x0020): Received SIGHUP.”  in the sssd log.
>   8.  Right before the problem occurs I see a brief spike in CPU utilization 
> of the ns-slapd process, then the utilization basically drops to 0 once the 
> threads are blocked in ns-slapd.
>   9.  Since we are doing computation in our IPA environment, it is important 
> that we can perform these types of parallel operations against our IPA 
> environment at the scale we are testing.
> 
> I feel like we are either DoSing the LDAP server or the sss_be / sss_nss 
> processes, although I am not sure.   Right now we are in the process of 
> deploying an additional domain controller to see if that helps with 
> distribution of load.  If anybody could provide any sort of information with 
> respect addressing the issue in the attached trace I would be very grateful.

I think your observations are due to the fact that SSSD currently
serializes connections from a single process. Your clients will call the
extdom extended LDAP operation on the IPA server to get the information
about the user from the trusted domain. The extdom plugin runs inside of
389ds and each client connection will run in a different thread. To get
the information about the user from the trusted domain the extdom plugin
calls SSSD and here is where the serialization will happen, i.e. all
threads have to wait until the first one will get his results and the
next thread can talk to SSSD.

With an empty cache the initial lookup of a user and all its groups will
take some time and since you used quite a number of clients all 389ds
worker threads will be "busy" waiting to talk to SSSD so that it would
even be hard for other request, even the ones which do not need to talk
to SSSD, to get through because there are no free worker threads.

To improve the situation maybe setting 'ignore_group_members=True' as
described on
https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/
which you already mentioned might help.

Although in general not recommend depending on the size of the trusted
domain (i.e. the number of users and groups in the trusted

Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-20 Thread Sullivan, Daniel [CRI]
Thank you for responding Lukas.  This is actually a domain controller that 
trusts an AD domain, as far as I know winbindd was never installed specifically 
to fulfill a purpose other than for IPA (the machine was deployed specifically 
for the purpose of being an IPA DC).  Hopefully this sounds reasonable and sane…

And, no, winbind is not configured in nsswitch.

Dan

> On Jan 20, 2017, at 4:48 PM, Lukas Slebodnik  wrote:
> 
> On (20/01/17 20:18), Sullivan, Daniel [CRI] wrote:
>> Sorry to clutter people's inboxes.  I found another piece of what I believe 
>> to be useful information.  When this occurs the following entry also appears 
>> in /var/log/messages.
>> 
>> Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]: [2017/01/20 
>> 13:54:33.942448,  0] ipa_sam.c:4193(bind_callback_cleanup)
>> Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]:   kerberos error: 
>> code=-1765328228, message=Cannot contact any KDC for realm 
>> ‘XXX.XXX.UCHICAGO.EDU'
>> Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]: [2017/01/20 
>> 13:54:33.943497,  0] ../source3/lib/smbldap.c:998(smbldap_connect_system)
>> Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]:   failed to bind to 
>> server ldapi://%2fvar%2frun%2fslapd-XXX-XXX-UCHICAGO-EDU.socket with 
>> dn="[Anonymous bind]" Error: Local error
>> Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]:   #011(unknown)
>> Jan 20 13:55:16 xxx.xxx.uchicago.edu winbindd[7090]: [2017/01/20 
>> 13:55:16.970304,  0] ipa_sam.c:4193(bind_callback_cleanup)
>> Jan 20 13:55:16 xxx.xxx.uchicago.edu winbindd[7090]:   kerberos error: 
>> code=-1765328228, message=Cannot contact any KDC for realm 
>> ‘XXX.XXX.UCHICAGO.EDU'
>> Jan 20 14:00:01 xxx.xxx.uchicago.edu systemd[1]: Created slice user-0.slice.
>> 
> May I ask why you have configure sssd and winbind on the same machine?
> Do you have configured winbind also in /etc/nsswitch.conf?
> 
> LS


-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-20 Thread Lukas Slebodnik
On (20/01/17 20:18), Sullivan, Daniel [CRI] wrote:
>Sorry to clutter people's inboxes.  I found another piece of what I believe to 
>be useful information.  When this occurs the following entry also appears in 
>/var/log/messages.
>
>Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]: [2017/01/20 
>13:54:33.942448,  0] ipa_sam.c:4193(bind_callback_cleanup)
>Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]:   kerberos error: 
>code=-1765328228, message=Cannot contact any KDC for realm 
>‘XXX.XXX.UCHICAGO.EDU'
>Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]: [2017/01/20 
>13:54:33.943497,  0] ../source3/lib/smbldap.c:998(smbldap_connect_system)
>Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]:   failed to bind to 
>server ldapi://%2fvar%2frun%2fslapd-XXX-XXX-UCHICAGO-EDU.socket with 
>dn="[Anonymous bind]" Error: Local error
>Jan 20 13:54:33 xxx.xxx.uchicago.edu winbindd[7090]:   #011(unknown)
>Jan 20 13:55:16 xxx.xxx.uchicago.edu winbindd[7090]: [2017/01/20 
>13:55:16.970304,  0] ipa_sam.c:4193(bind_callback_cleanup)
>Jan 20 13:55:16 xxx.xxx.uchicago.edu winbindd[7090]:   kerberos error: 
>code=-1765328228, message=Cannot contact any KDC for realm 
>‘XXX.XXX.UCHICAGO.EDU'
>Jan 20 14:00:01 xxx.xxx.uchicago.edu systemd[1]: Created slice user-0.slice.
>
May I ask why you have configure sssd and winbind on the same machine?
Do you have configured winbind also in /etc/nsswitch.conf?

LS

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-20 Thread Sullivan, Daniel [CRI]
Sorry I didn’t realize you might want all sssd logs… Working on it.

Dan

> On Jan 20, 2017, at 10:27 AM, Sumit Bose  wrote:
> 
> On Fri, Jan 20, 2017 at 03:41:46PM +, Sullivan, Daniel [CRI] wrote:
>> Hi,
>> 
>> I have some more information on this issue.  I’m tracing it down through the 
>> slapd logs and  I am continuing to struggle; I was hoping that somebody 
>> could possibly help me provided this additional information.
>> 
>> On the domain logs (on the DC), I see an ldap_search_ext operation 155 with 
>> a timeout of 60:
>> 
>> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with 
>> [(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))][cn=Default
>>  Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu].
>> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 155
>> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] 
>> (0x2000): New operation 155 timeout 60
>> 
>> Then, on the slapd log for the server, I see the incoming request.  Based on 
>> the result, it looks like the request completes successfully within 1 second:
>> 
>> [20/Jan/2017:08:42:35.179890746 -0600] conn=1518 op=31 SRCH base="cn=Default 
>> Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu" scope=2 
>> filter="(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))" 
>> attrs=ALL
>> [20/Jan/2017:08:42:35.683485674 -0600] conn=1518 op=31 RESULT err=0 tag=101 
>> nentries=0 etime=1 notes=U
>> 
>> Then, subsequently, the domain log (on the DC), I see the same operation 155 
>> timeout (expectedly almost a minute later).
> 
> So it looks like ns-ldap send the response but it never reached SSSD in
> time. Can you send what it happening in the SSSD logs between 08:42:34
> and 08:43:34 (if you prefer you can send it to me directly). Maybe there
> is an operation which blocks SSSD for too long so that it returns too
> late to the main loop to process the response?
> 
> bye,
> Sumit
> 
>> 
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_op_timeout] (0x1000): Issuing timeout for 155
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_op_destructor] (0x1000): Abandoning operation 155
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed 
>> [110]: Connection timed out
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_get_ad_override_done] (0x0040): ipa_get_ad_override request failed.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_id_op_destroy] (0x4000): releasing operation connection
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_initgr_get_overrides_override_done] (0x0040): IPA override lookup 
>> failed: 110
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_id_get_groups_overrides_done] (0x0040): IPA resolve user groups 
>> overrides failed [110].
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [be_mark_dom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu offline
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [be_mark_subdom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu as 
>> inactive
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: 
>> [110]: Connection timed out.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [110]: 
>> Connection timed out.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_id_op_destroy] (0x4000): releasing operation connection
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_reply_std_set] (0x0080): DP Error is OK on failed request?
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_done] 
>> (0x0400): DP Request [Account #4292]: Request handler finished [0]: Success
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [_dp_req_recv] 
>> (0x0400): DP Request [Account #4292]: Receiving request data.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_req_reply_list_success] (0x0400): DP Request [Account #4292]: Finished. 
>> Success.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_req_reply_std] (0x1000): DP Request [Account #4292]: Returning 
>> [Success]: 0,110,Success
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_table_value_destructor] (0x0400): Removing 
>> [0:1:0x0001:1:1::bsdad.uchicago.edu:name=user...@domain.uchicago.edu] from 
>> reply table
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_req_destructor] (0x0400): DP Request [Account #4292]: Request removed.

Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-20 Thread Sullivan, Daniel [CRI]
Sumit,

Thank you for taking the time to help me; I know you are very active on 
supporting this project and I am thankful that you are willing to help me 
analyze this data.  As per your request here is the log data from SSSD from the 
time window you specified.  I anonymized the data but used find/replace so AAA 
/BBB/CCC/domain is consistent.

I really appreciate your help.

Thank you,

Dan

(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_process_result] (0x2000): Trace: sh[0x7f5b5c3eec20], connected[1], 
ops[0x7f5b5c3f7ed0], ldap[0x7f5b5c3df9a0]
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_RESULT]
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_op_finished] (0x0400): Search result: Success(0), no errmsg 
set
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_op_destructor] (0x2000): Operation 153 finished
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_get_ad_override_done] (0x4000): No override found with filter 
[(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:AAA))].
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_id_op_destroy] (0x4000): releasing operation connection
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_initgr_get_overrides_step] (0x1000): Processing group 1/4
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_initgr_get_overrides_step] (0x1000): Fetching group BBB
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_id_op_connect_step] (0x4000): reusing cached connection
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_get_ad_override_connect_done] (0x4000): Searching for overrides in view 
[Default Trust View] with filter 
[(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:BBB))].
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_print_server] 
(0x2000): Searching 10.50.178.21:389
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with 
[(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:BBB))][cn=Default Trust 
View,cn=views,cn=accounts,dc=ipa,dc=cri,dc=uchicago,dc=edu].
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 154
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] 
(0x2000): New operation 154 timeout 60
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_process_result] (0x2000): Trace: sh[0x7f5b5c3eec20], connected[1], 
ops[0x7f5b5c3f7ed0], ldap[0x7f5b5c3df9a0]
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_process_result] (0x2000): Trace: end of ldap_result list
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_process_result] (0x2000): Trace: sh[0x7f5b5c3eec20], connected[1], 
ops[0x7f5b5c3f7ed0], ldap[0x7f5b5c3df9a0]
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_RESULT]
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_op_finished] (0x0400): Search result: Success(0), no errmsg 
set
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_op_destructor] (0x2000): Operation 154 finished
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_get_ad_override_done] (0x4000): No override found with filter 
[(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:BBB))].
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_id_op_destroy] (0x4000): releasing operation connection
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_initgr_get_overrides_step] (0x1000): Processing group 2/4
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_initgr_get_overrides_step] (0x1000): Fetching group CCC
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_id_op_connect_step] (0x4000): reusing cached connection
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_get_ad_override_connect_done] (0x4000): Searching for overrides in view 
[Default Trust View] with filter 
[(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:CCC))].
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_print_server] 
(0x2000): Searching 10.50.178.21:389
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with 
[(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:CCC))][cn=Default Trust 
View,cn=views,cn=accounts,dc=ipa,dc=cri,dc=uchicago,dc=edu].
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 155
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] 
(0x2000): New operation 155 timeout 60
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.e

Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-20 Thread Sumit Bose
On Fri, Jan 20, 2017 at 03:41:46PM +, Sullivan, Daniel [CRI] wrote:
> Hi,
> 
> I have some more information on this issue.  I’m tracing it down through the 
> slapd logs and  I am continuing to struggle; I was hoping that somebody could 
> possibly help me provided this additional information.
> 
> On the domain logs (on the DC), I see an ldap_search_ext operation 155 with a 
> timeout of 60:
> 
> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with 
> [(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))][cn=Default
>  Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu].
> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 155
> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] 
> (0x2000): New operation 155 timeout 60
> 
> Then, on the slapd log for the server, I see the incoming request.  Based on 
> the result, it looks like the request completes successfully within 1 second:
> 
> [20/Jan/2017:08:42:35.179890746 -0600] conn=1518 op=31 SRCH base="cn=Default 
> Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu" scope=2 
> filter="(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))" 
> attrs=ALL
> [20/Jan/2017:08:42:35.683485674 -0600] conn=1518 op=31 RESULT err=0 tag=101 
> nentries=0 etime=1 notes=U
> 
> Then, subsequently, the domain log (on the DC), I see the same operation 155 
> timeout (expectedly almost a minute later).

So it looks like ns-ldap send the response but it never reached SSSD in
time. Can you send what it happening in the SSSD logs between 08:42:34
and 08:43:34 (if you prefer you can send it to me directly). Maybe there
is an operation which blocks SSSD for too long so that it returns too
late to the main loop to process the response?

bye,
Sumit

> 
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_timeout] 
> (0x1000): Issuing timeout for 155
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [sdap_op_destructor] (0x1000): Abandoning operation 155
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed 
> [110]: Connection timed out
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [ipa_get_ad_override_done] (0x0040): ipa_get_ad_override request failed.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [sdap_id_op_destroy] (0x4000): releasing operation connection
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [ipa_initgr_get_overrides_override_done] (0x0040): IPA override lookup 
> failed: 110
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [ipa_id_get_groups_overrides_done] (0x0040): IPA resolve user groups 
> overrides failed [110].
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [be_mark_dom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu offline
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [be_mark_subdom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu as 
> inactive
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [110]: 
> Connection timed out.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [110]: 
> Connection timed out.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [sdap_id_op_destroy] (0x4000): releasing operation connection
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [dp_reply_std_set] (0x0080): DP Error is OK on failed request?
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_done] 
> (0x0400): DP Request [Account #4292]: Request handler finished [0]: Success
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [_dp_req_recv] 
> (0x0400): DP Request [Account #4292]: Receiving request data.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [dp_req_reply_list_success] (0x0400): DP Request [Account #4292]: Finished. 
> Success.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [dp_req_reply_std] (0x1000): DP Request [Account #4292]: Returning [Success]: 
> 0,110,Success
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [dp_table_value_destructor] (0x0400): Removing 
> [0:1:0x0001:1:1::bsdad.uchicago.edu:name=user...@domain.uchicago.edu] from 
> reply table
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [dp_req_destructor] (0x0400): DP Request [Account #4292]: Request removed.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
> [dp_req_destructor] (0x0400): Number of active DP request: 1
> 
> It seems like whatever problem I have is a communication issue between sssd 
> on the domain contr

Re: [Freeipa-users] performance scaling of sssd / freeipa

2017-01-20 Thread Sullivan, Daniel [CRI]
Hi,

I have some more information on this issue.  I’m tracing it down through the 
slapd logs and  I am continuing to struggle; I was hoping that somebody could 
possibly help me provided this additional information.

On the domain logs (on the DC), I see an ldap_search_ext operation 155 with a 
timeout of 60:

(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with 
[(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))][cn=Default
 Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu].
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 155
(Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] 
(0x2000): New operation 155 timeout 60

Then, on the slapd log for the server, I see the incoming request.  Based on 
the result, it looks like the request completes successfully within 1 second:

[20/Jan/2017:08:42:35.179890746 -0600] conn=1518 op=31 SRCH base="cn=Default 
Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu" scope=2 
filter="(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))" 
attrs=ALL
[20/Jan/2017:08:42:35.683485674 -0600] conn=1518 op=31 RESULT err=0 tag=101 
nentries=0 etime=1 notes=U

Then, subsequently, the domain log (on the DC), I see the same operation 155 
timeout (expectedly almost a minute later).

(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_timeout] 
(0x1000): Issuing timeout for 155
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_op_destructor] (0x1000): Abandoning operation 155
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [110]: 
Connection timed out
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_get_ad_override_done] (0x0040): ipa_get_ad_override request failed.
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_id_op_destroy] (0x4000): releasing operation connection
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_initgr_get_overrides_override_done] (0x0040): IPA override lookup failed: 
110
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_id_get_groups_overrides_done] (0x0040): IPA resolve user groups overrides 
failed [110].
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[be_mark_dom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu offline
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[be_mark_subdom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu as 
inactive
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [110]: 
Connection timed out.
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [110]: 
Connection timed out.
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[sdap_id_op_destroy] (0x4000): releasing operation connection
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_reply_std_set] 
(0x0080): DP Error is OK on failed request?
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_done] 
(0x0400): DP Request [Account #4292]: Request handler finished [0]: Success
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [_dp_req_recv] 
(0x0400): DP Request [Account #4292]: Receiving request data.
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[dp_req_reply_list_success] (0x0400): DP Request [Account #4292]: Finished. 
Success.
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_reply_std] 
(0x1000): DP Request [Account #4292]: Returning [Success]: 0,110,Success
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
[dp_table_value_destructor] (0x0400): Removing 
[0:1:0x0001:1:1::bsdad.uchicago.edu:name=user...@domain.uchicago.edu] from 
reply table
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_destructor] 
(0x0400): DP Request [Account #4292]: Request removed.
(Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_destructor] 
(0x0400): Number of active DP request: 1

It seems like whatever problem I have is a communication issue between sssd on 
the domain controller, and ns-slapd.  I’m starting to venture down the road of 
doing trace/packet level debug logging on the LDAP server; this could be very 
time consuming and frustrating (I already have it enabled, I am starting to 
manipulate the log settings to yield potentially valuable data), I’m hoping 
that somebody might be able to provide some insight based on the information 
above; I can definitely lookup the user on both domain controllers & both IPA 
servers only use themselves for IPA servers.  Thank you so much for reading and 
for your help.

Dan


> On Jan 19, 2017, at 4:15 PM, 

[Freeipa-users] performance scaling of sssd / freeipa

2017-01-19 Thread Sullivan, Daniel [CRI]
Hi,

I’ve received incredibly good support from this mailing list previously; I am 
hoping that somebody can help me succeed in my ongoing efforts.  I have spent a 
few days on this at this point and I can’t seem to figure it out how to address 
this issue.  On my DCs I am seeing excessive ldap_search_ext and 
sdap_get_generic_ext_recv timeouts created solely by the invocation of the ‘id’ 
command on sssd clients.  This problem seems to present itself only when I 
parallelize lookups for an ‘uncached’ user (i.e. I have never performed an 
initial lookup).  Individual arbitrary one-off lookups for a single uncached 
user on a single system almost always work fine.  This leads me to believe this 
is a performance tuning issue.

We operate in an academic research computing unit (i.e. we have an HPC 
cluster), and I need the ability to lookup the same user in parallel (using the 
id command) across a relatively large number of systems, for example to spawn 
jobs that require large amounts of CPU cores and/or memory.  Right now I am 
doing about 50 parallel lookups for the same user to induce this problem.  

Here is some background information:

1) I have read Jakub's “Anatomy of an SSSD Lookup” as well as “Performance 
Tuning of SSSD for large IPA-AD deployments”, as well as implemented 
recommendations from the performance tuning doc, including moving the sssd 
cache to tmpfs.
2) We are on ipa-server 4.4.0-14.el7_3.4 using a trusted AD domain; all of our 
consumed users and groups are in the AD trusted domain.  We have two domain 
controllers; each is a RHEL 7.3 VM with 6 GB of memory.  Almost all (if not 
all) of our clients are running at least sssd 1.14, and are all RHEL 6/7.  
Neither DC is swapping, and both have 2 CPUs.
3) I have tuned SSSD clients on the DCs and all clients to include these 
options (the problem persists):
  a) ldap_opt_timeout = 60
  b) ldap_search_timeout = 60
4) On both DCs, I can clear the SSSD cache, and lookup all 2000 or so users in 
my environment with 40 concurrent lookups occurring locally on each DC (using 
UNIX job control).  I can process all 2000 lookups in this manner without any 
failures (on either DC), and have ‘pre-populated’ the SSSD cache on both DC’s 
by doing this.
6) I have made no additional performance tuning changes other than what has 
been described.

Would anybody be able to advise on any potential tuning that would be required 
(presumably on the DCs), to facilitate 50 parallel lookups without experiencing 
sdap_get_generic_ext_recv or  ldap_search_ext  timeouts?  Should I be able to 
do this sort of thing with relative ease?  I was hoping this would be the sort 
of thing that would just work, but based on my relatively extensive testing it 
doesn’t.  Any advice anybody could provide would be greatly appreciated.

Thank you,

Dan Sullivan



-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project