Re: [Freeipa-users] caching of lookups / performance problem

2017-02-01 Thread Sullivan, Daniel [CRI]
Alright cool, thank you for getting back to me.  I appreciate your input and 
expertise.

Dan

> On Feb 1, 2017, at 9:08 AM, Jakub Hrozek  wrote:
> 
> On Wed, Feb 01, 2017 at 02:35:00PM +, Sullivan, Daniel [CRI] wrote:
>> Jakub,
>> 
>> Thank you for getting back to me.  Yeah, I agree with what you are saying.  
>> The problem that I’m really trying to solve is the how to get them requested 
>> reasonably often part.  A good use case for my problem is basically;
>> 
>> 1) Somebody starts an interactive job on a compute node (this is somewhat 
>> unusual in it of itself).  There’s a decent chance that nobody has done this 
>> for weeks or months months in the first place.  Since a large number of our 
>> 1000 or so users aren’t compute users theres a high probablity that we have 
>> a substantial number of expired cached entries, possibly 500 or more for 
>> users in /home.
>> 2) They are navigating around on the filesystem and cd into /home and type 
>> ‘ls -l’
>> 
>> This command will actually take upwards of an hour to execute (although it 
>> will complete eventually).  If an ‘ls -l’ on a Linux system takes more than 
>> a few seconds people will think there’s a problem with the system.
>> 
>> Based on my experience even ‘nowait percentage’ has a difficult time with a 
>> large number of records past the nowait threshold.  For example, if there 
>> are 500 records past the expiration percentage threshold, the data provider 
>> will get ‘busy’ which seems to effectively appears to block the nss 
>> responder, instead of returning all 500 of those records from the cache and 
>> then queueing 500 data provider requests in the background to refresh the 
>> cache.
> 
> Yes, when the cache is totally expired, the request would block.
> 
>> 
>> Right now the only ways I can seem to get around this is to do a regular ‘ls 
>> -l’ to refresh the cache on our nodes, or just defer the problem by setting 
>> a really high entry cache timeout.  The cron approach is a little bit 
>> challenging because we need to randomize invocation times because bulk cache 
>> refreshes across the environment are going to cause high load on our domain 
>> controllers (I know this because a single cache refresh causes ns-slapd to 
>> hit 100% and sustain CPU utilization for the duration of the enumeration).
>> 
>> Is there anything crazy about setting the entry cache timeout on the client 
>> to something arbitrarily high, like 5 years (other than knowing the cache is 
>> not accurate)?  Based on my knowledge a user’s groups are evaluated at login 
>> so this should be a non-issue from a security standpoint.
> 
> I think a long expiration together with the nowait percentage might be
> a way to go.


-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] caching of lookups / performance problem

2017-02-01 Thread Jakub Hrozek
On Wed, Feb 01, 2017 at 02:35:00PM +, Sullivan, Daniel [CRI] wrote:
> Jakub,
> 
> Thank you for getting back to me.  Yeah, I agree with what you are saying.  
> The problem that I’m really trying to solve is the how to get them requested 
> reasonably often part.  A good use case for my problem is basically;
> 
> 1) Somebody starts an interactive job on a compute node (this is somewhat 
> unusual in it of itself).  There’s a decent chance that nobody has done this 
> for weeks or months months in the first place.  Since a large number of our 
> 1000 or so users aren’t compute users theres a high probablity that we have a 
> substantial number of expired cached entries, possibly 500 or more for users 
> in /home.
> 2) They are navigating around on the filesystem and cd into /home and type 
> ‘ls -l’
> 
> This command will actually take upwards of an hour to execute (although it 
> will complete eventually).  If an ‘ls -l’ on a Linux system takes more than a 
> few seconds people will think there’s a problem with the system.
> 
> Based on my experience even ‘nowait percentage’ has a difficult time with a 
> large number of records past the nowait threshold.  For example, if there are 
> 500 records past the expiration percentage threshold, the data provider will 
> get ‘busy’ which seems to effectively appears to block the nss responder, 
> instead of returning all 500 of those records from the cache and then 
> queueing 500 data provider requests in the background to refresh the cache.

Yes, when the cache is totally expired, the request would block.

> 
> Right now the only ways I can seem to get around this is to do a regular ‘ls 
> -l’ to refresh the cache on our nodes, or just defer the problem by setting a 
> really high entry cache timeout.  The cron approach is a little bit 
> challenging because we need to randomize invocation times because bulk cache 
> refreshes across the environment are going to cause high load on our domain 
> controllers (I know this because a single cache refresh causes ns-slapd to 
> hit 100% and sustain CPU utilization for the duration of the enumeration).
> 
> Is there anything crazy about setting the entry cache timeout on the client 
> to something arbitrarily high, like 5 years (other than knowing the cache is 
> not accurate)?  Based on my knowledge a user’s groups are evaluated at login 
> so this should be a non-issue from a security standpoint.

I think a long expiration together with the nowait percentage might be
a way to go.

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] caching of lookups / performance problem

2017-02-01 Thread Sullivan, Daniel [CRI]
Jakub,

Thank you for getting back to me.  Yeah, I agree with what you are saying.  The 
problem that I’m really trying to solve is the how to get them requested 
reasonably often part.  A good use case for my problem is basically;

1) Somebody starts an interactive job on a compute node (this is somewhat 
unusual in it of itself).  There’s a decent chance that nobody has done this 
for weeks or months months in the first place.  Since a large number of our 
1000 or so users aren’t compute users theres a high probablity that we have a 
substantial number of expired cached entries, possibly 500 or more for users in 
/home.
2) They are navigating around on the filesystem and cd into /home and type ‘ls 
-l’

This command will actually take upwards of an hour to execute (although it will 
complete eventually).  If an ‘ls -l’ on a Linux system takes more than a few 
seconds people will think there’s a problem with the system.

Based on my experience even ‘nowait percentage’ has a difficult time with a 
large number of records past the nowait threshold.  For example, if there are 
500 records past the expiration percentage threshold, the data provider will 
get ‘busy’ which seems to effectively appears to block the nss responder, 
instead of returning all 500 of those records from the cache and then queueing 
500 data provider requests in the background to refresh the cache.

Right now the only ways I can seem to get around this is to do a regular ‘ls 
-l’ to refresh the cache on our nodes, or just defer the problem by setting a 
really high entry cache timeout.  The cron approach is a little bit challenging 
because we need to randomize invocation times because bulk cache refreshes 
across the environment are going to cause high load on our domain controllers 
(I know this because a single cache refresh causes ns-slapd to hit 100% and 
sustain CPU utilization for the duration of the enumeration).

Is there anything crazy about setting the entry cache timeout on the client to 
something arbitrarily high, like 5 years (other than knowing the cache is not 
accurate)?  Based on my knowledge a user’s groups are evaluated at login so 
this should be a non-issue from a security standpoint.

Dan



> On Feb 1, 2017, at 1:55 AM, Jakub Hrozek  wrote:
> 
> On Tue, Jan 31, 2017 at 08:05:18PM +, Sullivan, Daniel [CRI] wrote:
>> Hi,
>> 
>> I figured out what was going on with this issue.  Basically cache timeouts 
>> were causing a large number of uid numbers in an arbitrarily-timed directory 
>> listing to have expired cache records, which causes those records to be 
>> looked up again by the data provider (and thus blocking ‘ls -l’).  To work 
>> around this issue now we currently setting the entry_cache_timeout to 
>> something arbitrarily high, i.e. 99, I’m questioning whether or not this 
>> is the best approach.  I’d like to use something like 
>> refresh_expired_interval, although based on my testing it appears that this 
>> does not update records for a trusted AD domain.  I’ve also tried using 
>> enumeration, and that doesn’t seem to work either.
>> 
>> I suppose my question is this; is there a preferred method to keep cache 
>> records up-to-date for a trusted AD domain?  Right now I am thinking about 
>> cron-tabbing an ‘ls -l’ of /home and allowing entry_cache_nowait_percentage 
>> to fill this function, although that seems hacky to me.
>> 
>> Any advisement that could be provided would be greatly appreciated.
> 
> Hi,
> 
> If the entries are requested reasonably often (typically at least once
> per cache lifetime), then maybe just lowering the
> 'entry_cache_nowait_percentage' value so that the background check is
> performed more often might help.
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project


-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] caching of lookups / performance problem

2017-02-01 Thread Jakub Hrozek
On Tue, Jan 31, 2017 at 08:05:18PM +, Sullivan, Daniel [CRI] wrote:
> Hi,
> 
> I figured out what was going on with this issue.  Basically cache timeouts 
> were causing a large number of uid numbers in an arbitrarily-timed directory 
> listing to have expired cache records, which causes those records to be 
> looked up again by the data provider (and thus blocking ‘ls -l’).  To work 
> around this issue now we currently setting the entry_cache_timeout to 
> something arbitrarily high, i.e. 99, I’m questioning whether or not this 
> is the best approach.  I’d like to use something like 
> refresh_expired_interval, although based on my testing it appears that this 
> does not update records for a trusted AD domain.  I’ve also tried using 
> enumeration, and that doesn’t seem to work either.
> 
> I suppose my question is this; is there a preferred method to keep cache 
> records up-to-date for a trusted AD domain?  Right now I am thinking about 
> cron-tabbing an ‘ls -l’ of /home and allowing entry_cache_nowait_percentage 
> to fill this function, although that seems hacky to me.
> 
> Any advisement that could be provided would be greatly appreciated.

Hi,

If the entries are requested reasonably often (typically at least once
per cache lifetime), then maybe just lowering the
'entry_cache_nowait_percentage' value so that the background check is
performed more often might help.

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] caching of lookups / performance problem

2017-01-31 Thread Sullivan, Daniel [CRI]
Hi,

I figured out what was going on with this issue.  Basically cache timeouts were 
causing a large number of uid numbers in an arbitrarily-timed directory listing 
to have expired cache records, which causes those records to be looked up again 
by the data provider (and thus blocking ‘ls -l’).  To work around this issue 
now we currently setting the entry_cache_timeout to something arbitrarily high, 
i.e. 99, I’m questioning whether or not this is the best approach.  I’d 
like to use something like refresh_expired_interval, although based on my 
testing it appears that this does not update records for a trusted AD domain.  
I’ve also tried using enumeration, and that doesn’t seem to work either.

I suppose my question is this; is there a preferred method to keep cache 
records up-to-date for a trusted AD domain?  Right now I am thinking about 
cron-tabbing an ‘ls -l’ of /home and allowing entry_cache_nowait_percentage to 
fill this function, although that seems hacky to me.

Any advisement that could be provided would be greatly appreciated.

Best,

Dan Sullivan

> On Jan 30, 2017, at 10:52 AM, Sullivan, Daniel [CRI] 
>  wrote:
> 
> Hi,
> 
> I have another question about sssd performance.  I’m having a difficult time 
> doing a regularly performant ‘ls -l’ operation against /home, a mounted NFS 
> share of all of our users home directories.  There are 667 entries in this 
> folder, and all of them have IDs that are resolvable via freeipa/sssd.  We 
> are using an AD trusted domain.
> 
> It is clear to me why an initial invocation of this lookup should take some 
> time (populating the local ldb cache).   And it does.  Usually around 5-10 
> minutes, but sometimes longer.  After the initial lookups are complete, the 
> output of ‘ls -l' renders fine, and I can inspect the local filesystem cache 
> using ldbsearch and see that it is populated.  The issue is that if I wait a 
> while, or restart sssd, it appears that I have to go through all of these 
> lookups again to render the directory listing.
> 
> I am trying to find an optimal configuration for sssd.conf that will allow a 
> performant ‘ls -l’ listing of a directory with a large number of different id 
> numbers assigned to filesystem objects to always return results immediately 
> from the local cache (after an initial invocation of this command for any 
> given directory).  I think basically what I want is to have the ldb cache 
> always ‘up-to-date’, or at least have sssd willing to immediately dump what 
> it has without having to do a bunch of lookups while blocking the ‘ls -l’ 
> thread.  If possible, whatever solution implemented should also survive a 
> restart of the sssd process.  In short, aside from an initial invocation, I 
> never want ‘ls -l’ to take more than a few seconds.
> 
> The issue described above is somewhat problematic because it appears to cause 
> contention on the sssd process effectively allowing a user doing ls -l /home 
> to inadvertently degrade system performance for another user.
> 
> So far I have tried:
> 
> 1)  Implementing 'enumeration = true' for the [domain] section .  This seems 
> to have no impact.  It might be worthwhile to note that we are using an AD 
> trusted domain.
> 2)  Using the refresh_expired_interval configuration for the [domain] section
> 
> I have read the following two documents in a decent level of detail:
> 
> https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/
> https://jhrozek.wordpress.com/2015/03/11/anatomy-of-sssd-user-lookup/
> 
> It almost seems to me like the answer to this would be to keep the LDB cache 
> valid indefinitely (step 4 on 
> https://jhrozek.wordpress.com/2015/03/11/anatomy-of-sssd-user-lookup/).
> 
> Presumably this is a problem that somebody has seen before.  Would somebody 
> be able to advise on the best way to deal with this?  I appreciate your help.
> 
> Thank you,
> 
> Dan
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project


-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

[Freeipa-users] caching of lookups / performance problem

2017-01-30 Thread Sullivan, Daniel [CRI]
Hi,

I have another question about sssd performance.  I’m having a difficult time 
doing a regularly performant ‘ls -l’ operation against /home, a mounted NFS 
share of all of our users home directories.  There are 667 entries in this 
folder, and all of them have IDs that are resolvable via freeipa/sssd.  We are 
using an AD trusted domain.

It is clear to me why an initial invocation of this lookup should take some 
time (populating the local ldb cache).   And it does.  Usually around 5-10 
minutes, but sometimes longer.  After the initial lookups are complete, the 
output of ‘ls -l' renders fine, and I can inspect the local filesystem cache 
using ldbsearch and see that it is populated.  The issue is that if I wait a 
while, or restart sssd, it appears that I have to go through all of these 
lookups again to render the directory listing.

I am trying to find an optimal configuration for sssd.conf that will allow a 
performant ‘ls -l’ listing of a directory with a large number of different id 
numbers assigned to filesystem objects to always return results immediately 
from the local cache (after an initial invocation of this command for any given 
directory).  I think basically what I want is to have the ldb cache always 
‘up-to-date’, or at least have sssd willing to immediately dump what it has 
without having to do a bunch of lookups while blocking the ‘ls -l’ thread.  If 
possible, whatever solution implemented should also survive a restart of the 
sssd process.  In short, aside from an initial invocation, I never want ‘ls -l’ 
to take more than a few seconds.

The issue described above is somewhat problematic because it appears to cause 
contention on the sssd process effectively allowing a user doing ls -l /home to 
inadvertently degrade system performance for another user.

So far I have tried:

1)  Implementing 'enumeration = true' for the [domain] section .  This seems to 
have no impact.  It might be worthwhile to note that we are using an AD trusted 
domain.
2)  Using the refresh_expired_interval configuration for the [domain] section

I have read the following two documents in a decent level of detail:

https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/
https://jhrozek.wordpress.com/2015/03/11/anatomy-of-sssd-user-lookup/

It almost seems to me like the answer to this would be to keep the LDB cache 
valid indefinitely (step 4 on 
https://jhrozek.wordpress.com/2015/03/11/anatomy-of-sssd-user-lookup/).

Presumably this is a problem that somebody has seen before.  Would somebody be 
able to advise on the best way to deal with this?  I appreciate your help.

Thank you,

Dan

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project