Re: [389-users] 389 pauses every 5 minutes under load

2011-10-13 Thread Justin Gronfur
On 10/12/2011 05:41 PM, Rich Megginson wrote:
 Thanks.

 The select/poll are the server basically idle, waiting on a condition 
 variable for new work to perform.

 You can eliminate many of these by decreasing your cn=config 
 nsslapd-threadnum setting.  The default is 30, but you may find better 
 performance by setting to somewhere around 2 times the number of 
 cpus/cores you have on your machine (but at least 8).
Yeah, I knew what the select/poll calls are, but I will look at 
decreasing the number of threads to a more suitable number.  Are these 
threads bound to a specific connection or is it a thread pool that gets 
requests delegated to it from the open connections?

 Do you know if any of these come from a period of time during which 
 the server is consuming a lot of CPU?
Yes these were taken starting ~1 second before the spike and going ~2 
seconds after.  It's easy to start monitoring right before the event 
when the leading edge to leading edge time is exactly 5 minutes!

 No, should not be a problem.  And it is standard - many apps do this 
 (e.g. a web service that uses ldap for auth will not want to 
 open/close a connection for every single user - it will typically use 
 a connection pool of already open and possibly idle connections).
Our app doesn't really use a pool in the traditional sense... just one 
connection bound as a given user for that user's session... but I'm glad 
to hear that it shouldn't be the problem.

Thanks,
Justin
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


Re: [389-users] 389 pauses every 5 minutes under load

2011-10-13 Thread Richard Megginson
- Original Message -
 On 10/12/2011 05:41 PM, Rich Megginson wrote:
  Thanks.
 
  The select/poll are the server basically idle, waiting on a
  condition
  variable for new work to perform.
 
  You can eliminate many of these by decreasing your cn=config
  nsslapd-threadnum setting.  The default is 30, but you may find
  better
  performance by setting to somewhere around 2 times the number of
  cpus/cores you have on your machine (but at least 8).
 Yeah, I knew what the select/poll calls are, but I will look at
 decreasing the number of threads to a more suitable number.  Are
 these
 threads bound to a specific connection or is it a thread pool that
 gets
 requests delegated to it from the open connections?

It is essentially a thread pool, although it has a bit more logic than that.

 
  Do you know if any of these come from a period of time during which
  the server is consuming a lot of CPU?
 Yes these were taken starting ~1 second before the spike and going ~2
 seconds after.  It's easy to start monitoring right before the event
 when the leading edge to leading edge time is exactly 5 minutes!

Ok, thanks.  I am investigating.

  No, should not be a problem.  And it is standard - many apps do
  this
  (e.g. a web service that uses ldap for auth will not want to
  open/close a connection for every single user - it will typically
  use
  a connection pool of already open and possibly idle connections).
 Our app doesn't really use a pool in the traditional sense... just
 one
 connection bound as a given user for that user's session... but I'm
 glad
 to hear that it shouldn't be the problem.
 
 Thanks,
 Justin
 
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


Re: [389-users] 389 pauses every 5 minutes under load

2011-10-13 Thread Justin Gronfur
Well, I figured it out!

After removing the Java stack completely from the equation, I noticed 
that I was getting worse, but more consistent performance.  The worse 
performance indicated something to do with caching, so after I dug into 
the Java code for a while I found that we were caching certain object 
classes (therefore making them not need to be rebuilt from ldap).  The 
timeout on that cache was 5 minutes.

So during load testing, the tests run great until everything in that 
cache expires at once making all 15 threads hit 389 to rebuild the cache 
entries at the exact same time.  My data sample used for testing 
involved about 150 objects being cached, which takes 150 queries to 
generate.  So when the cache expires, 15*150=2250 queries hit 389 at the 
exact same time which causes it to seemingly lock up for 20 seconds 
while it handles all of that.

I am still surprised that 389 can't handle that better, but I haven't 
spent much time tweaking settings or indexes for performance.  I'm sure 
that we are running very sub-optimally right now.

Thanks,
Justin
--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


Re: [389-users] 389 pauses every 5 minutes under load

2011-10-13 Thread Rich Megginson
On 10/13/2011 12:30 PM, Justin Gronfur wrote:
 Well, I figured it out!

 After removing the Java stack completely from the equation, I noticed 
 that I was getting worse, but more consistent performance.  The worse 
 performance indicated something to do with caching, so after I dug 
 into the Java code for a while I found that we were caching certain 
 object classes (therefore making them not need to be rebuilt from 
 ldap).  The timeout on that cache was 5 minutes.

 So during load testing, the tests run great until everything in that 
 cache expires at once making all 15 threads hit 389 to rebuild the 
 cache entries at the exact same time.  My data sample used for testing 
 involved about 150 objects being cached, which takes 150 queries to 
 generate.  So when the cache expires, 15*150=2250 queries hit 389 at 
 the exact same time which causes it to seemingly lock up for 20 
 seconds while it handles all of that.

 I am still surprised that 389 can't handle that better, but I haven't 
 spent much time tweaking settings or indexes for performance.  I'm 
 sure that we are running very sub-optimally right now.
The stack traces you provided give us some clues.  I would suggest 
lowering nsslapd-threadnumber in cn=config to see if that makes a 
difference - one factor could be excessive thread contention based on 
your stack traces.

You can use the logconv.pl script to analyze your access log to see if 
you need to tune your indexes or other performance related settings such 
as nsslapd-lookthroughlimit or nsslapd-idlistscanlimit.

 Thanks,
 Justin

--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


Re: [389-users] 389 pauses every 5 minutes under load

2011-10-12 Thread Rich Megginson
On 10/11/2011 08:42 AM, Justin Gronfur wrote:
 On 10/10/2011 02:46 PM, Rich Megginson wrote:
 I'm just trying to figure out what happens every 5 minutes inside the 
 directory server.  There is a thread that attempts to clean up 
 tombstone entries and other state information.  But if you are not 
 using replication then that is not run.  There are database threads 
 such as checkpointing, log flushing, etc. but they run every 250 
 milliseconds.  That's what this seems like - do you notice high I/O 
 usage during the time when the server is pegged at 100% cpu?

 No high disk I/O, in fact vmstat is reporting substantially less disk 
 IO at that time than normal.  The cpu is 94% user and 6% system during 
 those time.  I'm not sure if attachments are allowed on this list, but 
 I used ptrace to grab the list of current operations (not full traces) 
 and filtered out all of the common select(), poll(), wait() calls.  
 What is left is attached.

 Let me know if you want me to try anything else or gather anymore data.
This is helpful.  Any chance you could paste the entire stack traces?  
For example,
#0  0x003735c3c868 in slapi_get_mapping_tree_node_by_dn@plt () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c4ad38 in slapi_dn_normalize_ext () from 
/usr/lib64/dirsrv/libslapd.so.0
etc. are nice to have, but much better would be the entire stack traces 
of these calls so we can see where they are called from.

 Thanks,
 Justin

--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


Re: [389-users] 389 pauses every 5 minutes under load

2011-10-11 Thread Justin Gronfur

On 10/10/2011 02:46 PM, Rich Megginson wrote:
I'm just trying to figure out what happens every 5 minutes inside the 
directory server.  There is a thread that attempts to clean up 
tombstone entries and other state information.  But if you are not 
using replication then that is not run.  There are database threads 
such as checkpointing, log flushing, etc. but they run every 250 
milliseconds.  That's what this seems like - do you notice high I/O 
usage during the time when the server is pegged at 100% cpu?


No high disk I/O, in fact vmstat is reporting substantially less disk IO 
at that time than normal.  The cpu is 94% user and 6% system during 
those time.  I'm not sure if attachments are allowed on this list, but I 
used ptrace to grab the list of current operations (not full traces) and 
filtered out all of the common select(), poll(), wait() calls.  What is 
left is attached.


Let me know if you want me to try anything else or gather anymore data.

Thanks,
Justin
#0  0x003735c3c868 in slapi_get_mapping_tree_node_by_dn@plt () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c4ad38 in slapi_dn_normalize_ext () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x7f5459e15b8f in value_cmp () from 
/usr/lib64/dirsrv/plugins/libsyntax-plugin.so
#0  0x003735c388b8 in slapi_sdn_get_ndn@plt () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c4c398 in dn_ignore_case_to_end () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x7f5457f8f816 in acl_access_allowed () from 
/usr/lib64/dirsrv/plugins/libacl-plugin.so
#0  0x7f5457fa24d1 in acl_match_macro_in_target () from 
/usr/lib64/dirsrv/plugins/libacl-plugin.so
#0  0x003735c38b88 in slapi_attr_basetype@plt () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x7f5457fa2392 in acl_match_macro_in_target () from 
/usr/lib64/dirsrv/plugins/libacl-plugin.so
#0  0x003110414237 in PR_RWLock_Unlock () from /lib64/libnspr4.so
#0  0x7f5459e15a2c in value_normalize_ext () from 
/usr/lib64/dirsrv/plugins/libsyntax-plugin.so
#0  0x003735c4d581 in slapi_sdn_issuffix () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c3a1a8 in slapi_ch_free@plt () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c4d175 in slapi_sdn_compare () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735caeef0 in slapi_utf8casecmp () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c4d5c2 in slapi_sdn_issuffix () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003112cdd2f3 in __db_check_txn () from /lib64/libdb-4.8.so
#0  0x003735c4d579 in slapi_sdn_issuffix () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x00310580a98c in pthread_mutex_unlock () from /lib64/libpthread.so.0
#0  0x7f5457f9ce50 in acllist_get_next_aci () from 
/usr/lib64/dirsrv/plugins/libacl-plugin.so
#0  0x7f5456ea1885 in idl_iterator_dereference_increment () from 
/usr/lib64/dirsrv/plugins/libback-ldbm.so
#0  0x00311040cde8 in pthread_mutex_unlock@plt () from /lib64/libnspr4.so
#0  0x7f5457fa2970 in acl_ht_add_and_freeOld () from 
/usr/lib64/dirsrv/plugins/libacl-plugin.so
#0  0x003735c3a268 in __strdup@plt () from /usr/lib64/dirsrv/libslapd.so.0
#0  0x00310580e01d in read () from /lib64/libpthread.so.0
#0  0x003735cadff6 in slapi_has8thBit () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x0031054c7b57 in sched_yield () from /lib64/libc.so.6
#0  0x003112c257c8 in __db_tas_mutex_unlock () from /lib64/libdb-4.8.so
#0  0x003111401b5a in PL_HashTableRemove () from /lib64/libplds4.so
#0  0x0031054780f2 in _int_malloc () from /lib64/libc.so.6
#0  0x003110423050 in PR_Unlock () from /lib64/libnspr4.so
#0  0x003735caeefc in slapi_utf8casecmp () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c4cf84 in slapi_sdn_get_dn () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x0031054c7b57 in sched_yield () from /lib64/libc.so.6
#0  0x7f5457f8f895 in acl_access_allowed () from 
/usr/lib64/dirsrv/plugins/libacl-plugin.so
#0  0x003735c41413 in attrlist_find () from /usr/lib64/dirsrv/libslapd.so.0
#0  0x003105477683 in _int_free () from /lib64/libc.so.6
#0  0x003111401be0 in PL_HashTableLookupConst () from /lib64/libplds4.so
#0  0x7f5457f9c283 in ?? () from /usr/lib64/dirsrv/plugins/libacl-plugin.so
#0  0x0031054e0e4a in setsockopt () from /lib64/libc.so.6
#0  0x0031054c7b57 in sched_yield () from /lib64/libc.so.6
#0  0x003735c3aa68 in slapi_isbitset_uchar@plt () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c457b5 in slapi_ch_malloc () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003735c40231 in attr_done () from /usr/lib64/dirsrv/libslapd.so.0
#0  0x00310547ab14 in malloc () from /lib64/libc.so.6
#0  0x0031054805f5 in strdup () from /lib64/libc.so.6
#0  0x7f5459e0d878 in value_normalize_ext@plt () from 
/usr/lib64/dirsrv/plugins/libsyntax-plugin.so
#0  0x003735c8dbe3 in plugin_call_syntax_filter_ava () from 
/usr/lib64/dirsrv/libslapd.so.0
#0  0x003105527043 in __strlen_sse42 () 

Re: [389-users] 389 pauses every 5 minutes under load

2011-10-10 Thread Rich Megginson
On 10/07/2011 11:56 AM, Justin Gronfur wrote:
 Hello all,
 I need your expertise... please help me!  (Disclaimer:  I am a relative
 newcomer to 389ds)

 I'm running a Java application that keeps user authentication,
 permissions, and preferences in ldap.  And I'm currently load testing
 this application (using Jmeter, 15 concurrent threads, no think time)
 and I'm getting really good performance most of the time.  However every
 5 minutes (from the time I started ldap), 389's CPU usage will spike to
 375% (400% = all 4 processors at 100%, 389 normally sits around
 15-20%).  These pauses last for between 20 - 30 seconds (proportionate
 to the load I'm throwing at it) during which our application will just
 sit.  Since I'm just running the same set of requests at it constantly,
 there isn't anything different in terms of our application during those
 times, which points to 389 as the culprit (or possibly some glassfish
 ldap pool problem).

 Some info:
 Glassfish 3.1 final on Java 1.6.0_26 (64 bit server VM)
 389-Directory/1.2.9.10 B2011.250.1455
 Fedora 15 64-bit (also observed on Centos 5.4 64-bit)

 Have any of you run into this problem?  Do you have any possible config
 changes I could try? Any possible leads at all?
Are you using replication?  Is this a replication master?  Is your load 
tester doing delete operations?
 Thanks,
 Justin
 --
 389 users mailing list
 389-users@lists.fedoraproject.org
 https://admin.fedoraproject.org/mailman/listinfo/389-users

--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


Re: [389-users] 389 pauses every 5 minutes under load

2011-10-10 Thread Justin Gronfur
On 10/10/2011 12:00 PM, Rich Megginson wrote:
 Are you using replication?  Is this a replication master?  Is your 
 load tester doing delete operations?

I am not using replication on this particular instance.  Load tester is 
not doing any delete operations, almost all reads/searches.  I would 
guess that 30-50% of these requests result in no such element return in 
case that matters at all.

Thanks,
Justin

--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


Re: [389-users] 389 pauses every 5 minutes under load

2011-10-10 Thread Rich Megginson
On 10/10/2011 01:34 PM, Justin Gronfur wrote:
 On 10/10/2011 12:00 PM, Rich Megginson wrote:
 Are you using replication?  Is this a replication master?  Is your 
 load tester doing delete operations?

 I am not using replication on this particular instance.  Load tester 
 is not doing any delete operations, almost all reads/searches.  I 
 would guess that 30-50% of these requests result in no such element 
 return in case that matters at all.
I'm just trying to figure out what happens every 5 minutes inside the 
directory server.  There is a thread that attempts to clean up tombstone 
entries and other state information.  But if you are not using 
replication then that is not run.  There are database threads such as 
checkpointing, log flushing, etc. but they run every 250 milliseconds.  
That's what this seems like - do you notice high I/O usage during the 
time when the server is pegged at 100% cpu?

 Thanks,
 Justin


--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


[389-users] 389 pauses every 5 minutes under load

2011-10-07 Thread Justin Gronfur
Hello all,
I need your expertise... please help me!  (Disclaimer:  I am a relative 
newcomer to 389ds)

I'm running a Java application that keeps user authentication, 
permissions, and preferences in ldap.  And I'm currently load testing 
this application (using Jmeter, 15 concurrent threads, no think time) 
and I'm getting really good performance most of the time.  However every 
5 minutes (from the time I started ldap), 389's CPU usage will spike to 
375% (400% = all 4 processors at 100%, 389 normally sits around 
15-20%).  These pauses last for between 20 - 30 seconds (proportionate 
to the load I'm throwing at it) during which our application will just 
sit.  Since I'm just running the same set of requests at it constantly, 
there isn't anything different in terms of our application during those 
times, which points to 389 as the culprit (or possibly some glassfish 
ldap pool problem).

Some info:
Glassfish 3.1 final on Java 1.6.0_26 (64 bit server VM)
389-Directory/1.2.9.10 B2011.250.1455
Fedora 15 64-bit (also observed on Centos 5.4 64-bit)

Have any of you run into this problem?  Do you have any possible config 
changes I could try? Any possible leads at all?

Thanks,
Justin
--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users